[00:00:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P75572 and previous config saved to /var/cache/conftool/dbconfig/20250429-000049-fceratto.json
[00:03:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:08:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:09:52] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139576
[00:09:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139576 (owner: 10TrainBranchBot)
[00:12:08] <wikibugs>	 (03PS1) 10Zabe: enwiki and commons: Increase revision-slots cache expiry again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139577 (https://phabricator.wikimedia.org/T183490)
[00:15:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P75573 and previous config saved to /var/cache/conftool/dbconfig/20250429-001557-fceratto.json
[00:18:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[00:31:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T392806)', diff saved to https://phabricator.wikimedia.org/P75574 and previous config saved to /var/cache/conftool/dbconfig/20250429-003104-fceratto.json
[00:31:24] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance
[00:31:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T392806)', diff saved to https://phabricator.wikimedia.org/P75575 and previous config saved to /var/cache/conftool/dbconfig/20250429-003131-fceratto.json
[00:38:27] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1139576 (owner: 10TrainBranchBot)
[00:39:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T392806)', diff saved to https://phabricator.wikimedia.org/P75576 and previous config saved to /var/cache/conftool/dbconfig/20250429-003948-fceratto.json
[00:47:21] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[00:50:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:54:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P75577 and previous config saved to /var/cache/conftool/dbconfig/20250429-005455-fceratto.json
[01:03:18] <wikibugs>	 06SRE, 06serviceops-radar, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774987 (10Legoktm) >>! In T392834#10773349, @elukey wrote: > ` > elukey@mwmaint1002:/home$ sudo du -hs /home/* | sort -h | tail > ... > 11G /home/legoktm > ` >  > The home dirs may be...
[01:09:38] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.27 [core] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1139578 (https://phabricator.wikimedia.org/T386222)
[01:09:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.27 [core] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1139578 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[01:10:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P75578 and previous config saved to /var/cache/conftool/dbconfig/20250429-011002-fceratto.json
[01:21:45] <wikibugs>	 06SRE, 06serviceops-radar, 13Patch-For-Review: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10774996 (10Dzahn) >>! In T392834#10774930, @bd808 wrote: > Dropping priority to High as it seems @Dzahn's cleanup work has taken care of the immediate problem. I'll leave it to him and...
[01:22:44] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.27 [core] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1139578 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[01:25:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T392806)', diff saved to https://phabricator.wikimedia.org/P75579 and previous config saved to /var/cache/conftool/dbconfig/20250429-012509-fceratto.json
[01:25:28] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1255.eqiad.wmnet with reason: Maintenance
[01:25:45] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db[1256-1257].eqiad.wmnet with reason: Maintenance
[01:35:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[01:50:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0200)
[02:23:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:23:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:33:45] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, 07Wikimedia-production-error: Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded (via Special:UploadStash) - https://phabricator.wikimedia.org/T381109#10775079 (...
[02:33:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:38:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:54:42] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10775088 (10Kirilloparma)    >>! In T374230#10771849, @Silvan_WMDE wrote: > @Kirillopa...
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0300)
[03:01:46] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139583 (https://phabricator.wikimedia.org/T386222)
[03:01:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139583 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[03:02:37] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139583 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[03:03:00] <logmsgbot>	 !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.27  refs T386222
[03:03:05] <stashbot>	 T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222
[03:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[03:13:42] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:19:18] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10775097 (10Jakob_WMDE) >>! In T374230#10775088, @Kirilloparma wrote: >  > @Silvan_WMD...
[03:23:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:24:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:53:42] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:53:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:00:04] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0400)
[04:03:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:04:24] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.44.0-wmf.27  refs T386222 (duration: 61m 23s)
[04:04:28] <stashbot>	 T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222
[04:06:25] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:11:09] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:11:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:12:07] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:16:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:17:47] <wikibugs>	 (03CR) 10Pppery: "Ptwikibooks isn't ready yet, it has its own separate ugly set of special cases:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe)
[04:18:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[04:21:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:40:11] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:41:07] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:41:25] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:59:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:00:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:07:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:08:07] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:08:27] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:08:27] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 111, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:09:05] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:09:07] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:09:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-codfw:9804&var-bgp_group=Transit6&var-bgp_neighbor=Lumen - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[05:09:51] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:10:27] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 111, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:11:39] <jinxer-wm>	 FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqsin (103.102.166.130) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:14:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[05:14:51] <jinxer-wm>	 FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:14:55] <wikibugs>	 (03CR) 10Arnaudb: "kudos for the ascii arts 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn)
[05:15:21] <wikibugs>	 (03PS2) 10Dzahn: gerrit: replace legacy fact with modern fact [puppet] - 10https://gerrit.wikimedia.org/r/1137842
[05:15:32] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1137842 (owner: 10Dzahn)
[05:20:17] <wikibugs>	 (03CR) 10Arnaudb: "looks good!" [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[05:20:25] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[05:35:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[05:48:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10775197 (10Marostegui) Thank you @VRiley-WMF - I will reimage the host.
[05:48:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[05:57:01] <Bsadowski1>	 "Error: 503, Backend fetch failed at Tue, 29 Apr 2025 05:56:47 GMT"
[05:57:01] <Bsadowski1>	 :O
[05:57:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0600).
[06:00:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.ipmi-password-reset
[06:00:25] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99)
[06:00:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.ipmi-password-reset
[06:00:59] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99)
[06:03:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bookworm
[06:04:04] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10775225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db1246.eqiad.wmnet with OS bookworm
[06:04:09] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:05:09] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:20:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7002.magru.wmnet
[06:21:50] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage
[06:22:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet
[06:25:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage
[06:26:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instance.schema: Add x3 [puppet] - 10https://gerrit.wikimedia.org/r/1139350 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui)
[06:28:08] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:28:08] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:29:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[06:29:51] <jinxer-wm>	 FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:30:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet
[06:30:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7002.magru.wmnet
[06:31:39] <jinxer-wm>	 FIRING: [8x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqsin (103.102.166.130) - group Confed_eqsin - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[06:31:59] <wikibugs>	 (03PS1) 10Brouberol: airflow: separate postgresql and airflow helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139657 (https://phabricator.wikimedia.org/T391348)
[06:32:07] <wikibugs>	 (03PS1) 10Brouberol: deployment_server: provision dedicated kubeconfigs for airflow PGs [puppet] - 10https://gerrit.wikimedia.org/r/1139659 (https://phabricator.wikimedia.org/T391348)
[06:32:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1033 es2033 T391921', diff saved to https://phabricator.wikimedia.org/P75580 and previous config saved to /var/cache/conftool/dbconfig/20250429-063219-marostegui.json
[06:32:24] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[06:32:43] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2033.codfw.wmnet,es1033.eqiad.wmnet with reason: Maintenance
[06:33:00] <wikibugs>	 (03PS1) 10Marostegui: es1033: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139672 (https://phabricator.wikimedia.org/T391921)
[06:33:24] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti7004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[06:33:40] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1033: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139672 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[06:33:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti7002:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:34:24] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[06:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:35:41] <wikibugs>	 (03PS1) 10Marostegui: es2033: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139703 (https://phabricator.wikimedia.org/T391921)
[06:37:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2033: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1139703 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[06:38:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75581 and previous config saved to /var/cache/conftool/dbconfig/20250429-063811-root.json
[06:40:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75582 and previous config saved to /var/cache/conftool/dbconfig/20250429-064032-root.json
[06:46:04] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1246.eqiad.wmnet with OS bookworm
[06:46:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10775271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db1246.eqiad.wmnet with OS bookworm completed: - db1246 (**WARN**)   - Removed from Puppet...
[06:46:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10775272 (10Marostegui) I've reimaged the host, I had to reset the idrac password.
[06:51:21] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1188.eqiad.wmnet onto db1246.eqiad.wmnet
[06:51:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db1188 - Depool db1188.eqiad.wmnet to then clone it to db1246.eqiad.wmnet - marostegui@cumin1002
[06:51:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1188 - Depool db1188.eqiad.wmnet to then clone it to db1246.eqiad.wmnet - marostegui@cumin1002
[06:53:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75584 and previous config saved to /var/cache/conftool/dbconfig/20250429-065317-root.json
[06:55:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75585 and previous config saved to /var/cache/conftool/dbconfig/20250429-065537-root.json
[06:58:10] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:58:10] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:59:06] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:59:06] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:01:01] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::metricsinfra: Add instance FQDN template [puppet] - 10https://gerrit.wikimedia.org/r/1139511 (https://phabricator.wikimedia.org/T392570) (owner: 10Majavah)
[07:02:00] <wikibugs>	 07Puppet: Puppet broken on db1178.eqiad.wmnet - https://phabricator.wikimedia.org/T392627#10775314 (10elukey) ` elukey@db1178:~$ sudo zgrep -c "SSL_read: sslv3 alert certificate unknown" /var/log/puppet.log* /var/log/puppet.log:0 /var/log/puppet.log.1:0 /var/log/puppet.log.2.gz:0 /var/log/puppet.log.3.gz:0 /...
[07:03:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[07:06:38] <wikibugs>	 (03PS2) 10Filippo Giunchedi: puppetdb: add tunable for maximum-pool-size [puppet] - 10https://gerrit.wikimedia.org/r/1139481
[07:07:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: puppetdb: add tunable for maximum-pool-size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139481 (owner: 10Filippo Giunchedi)
[07:08:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75586 and previous config saved to /var/cache/conftool/dbconfig/20250429-070822-root.json
[07:10:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75587 and previous config saved to /var/cache/conftool/dbconfig/20250429-071042-root.json
[07:11:04] <marostegui>	 !log Reboot all codfw dbproxy2* hosts T392806
[07:11:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:31] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy[2005-2008].codfw.wmnet with reason: Maintenance
[07:13:42] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:23:27] <moritzm>	 !log imported debdeploy 0.0.99.14-1+deb13u1 to apt.wikimedia.org/main for trixie-wikimedia T391083
[07:23:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75588 and previous config saved to /var/cache/conftool/dbconfig/20250429-072328-root.json
[07:23:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:32] <stashbot>	 T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083
[07:25:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75589 and previous config saved to /var/cache/conftool/dbconfig/20250429-072548-root.json
[07:29:21] <wikibugs>	 (03CR) 10Btullis: [C:03+1] deployment_server: provision dedicated kubeconfigs for airflow PGs [puppet] - 10https://gerrit.wikimedia.org/r/1139659 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol)
[07:30:13] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: separate postgresql and airflow helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139657 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol)
[07:31:49] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra::filesystem::slos: fix citoid's latency bucket [puppet] - 10https://gerrit.wikimedia.org/r/1139774 (https://phabricator.wikimedia.org/T391852)
[07:32:16] <joelyrookewmde>	 Hi all, I'm planning to run a couple of maintenance scripts to add wikidata support for nupwiki (as per T390715). Let me know if that will disrupt anyone's deployment
[07:32:16] <stashbot>	 T390715: Add Wikidata support for nupwiki - https://phabricator.wikimedia.org/T390715
[07:33:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet
[07:34:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet
[07:37:28] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slos: fix citoid's latency bucket [puppet] - 10https://gerrit.wikimedia.org/r/1139774 (https://phabricator.wikimedia.org/T391852) (owner: 10Elukey)
[07:38:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75590 and previous config saved to /var/cache/conftool/dbconfig/20250429-073833-root.json
[07:39:44] <wikibugs>	 (03PS1) 10Slyngshede: Modern fronted [software/bitu] - 10https://gerrit.wikimedia.org/r/1139776 (https://phabricator.wikimedia.org/T391443)
[07:40:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75591 and previous config saved to /var/cache/conftool/dbconfig/20250429-074053-root.json
[07:42:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] deployment_server: provision dedicated kubeconfigs for airflow PGs [puppet] - 10https://gerrit.wikimedia.org/r/1139659 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol)
[07:44:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:44:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet
[07:44:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet
[07:46:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[07:48:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti7001:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:49:51] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:50:08] <moritzm>	 !log copied wmf-certificates 1~20230906-1 from bookworm-wikimedia to trixie-wikimedia T391083
[07:50:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:13] <stashbot>	 T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083
[07:50:25] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: drop X-Forwarded-For received from clients [puppet] - 10https://gerrit.wikimedia.org/r/1139778 (https://phabricator.wikimedia.org/T388791)
[07:50:55] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gerrit: drop X-Forwarded-For received from clients [puppet] - 10https://gerrit.wikimedia.org/r/1139778 (https://phabricator.wikimedia.org/T388791) (owner: 10Arnaudb)
[07:51:08] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: drop X-Forwarded-For received from clients [puppet] - 10https://gerrit.wikimedia.org/r/1139778 (https://phabricator.wikimedia.org/T388791) (owner: 10Arnaudb)
[07:51:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[07:53:06] <moritzm>	 !log copied cadvisor 0.44.0+ds1-1~wmf1 from bookworm-wikimedia to trixie-wikimedia T391083
[07:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:23] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, similar to the patch discussed in Gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/1139778 (https://phabricator.wikimedia.org/T388791) (owner: 10Arnaudb)
[07:53:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:53:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:53:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75593 and previous config saved to /var/cache/conftool/dbconfig/20250429-075339-root.json
[07:53:42] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:54:19] <wikibugs>	 (03PS1) 10Klausman: admin/data.yaml: Add dr0ptp4kt (Adam Baso) to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1139779
[07:54:50] <suzannewoodWMDE2>	 As mentioned by @joelyrookewmde, we are about to run the maintenance script to add wikidata support for nupwiki
[07:54:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:56:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75594 and previous config saved to /var/cache/conftool/dbconfig/20250429-075600-root.json
[07:56:14] <suzannewoodWMDE2>	 !log suzannewood@mwmaint1002:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https
[07:56:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet
[07:57:52] <icinga-wm>	 PROBLEM - Dell PowerEdge RAID Controller on db2176 is CRITICAL: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[07:57:53] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on db2176 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T392876 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[07:57:58] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876 (10ops-monitoring-bot) 03NEW
[07:59:16] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10775406 (10Marostegui) p:05Triage→03Medium This is a normal s1 slave - can we get a new disk for it?
[08:00:04] <jouncebot>	 hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T0800)
[08:01:19] <hashar>	 o/
[08:01:25] <hashar>	 I am running the train for group0
[08:01:37] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139780 (https://phabricator.wikimedia.org/T386222)
[08:01:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139780 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[08:01:49] <logmsgbot>	 jmm@cumin2002 drain-node (PID 2021306) is awaiting input
[08:01:50] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[08:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139780 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[08:03:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7004.magru.wmnet
[08:03:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[08:06:44] <wikibugs>	 (03CR) 10Fabfur: cache,haproxy: allowed methods check and set response headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[08:08:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: separate postgresql and airflow helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139657 (https://phabricator.wikimedia.org/T391348) (owner: 10Brouberol)
[08:08:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75595 and previous config saved to /var/cache/conftool/dbconfig/20250429-080844-root.json
[08:11:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7004.magru.wmnet
[08:11:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75596 and previous config saved to /var/cache/conftool/dbconfig/20250429-081106-root.json
[08:11:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet
[08:12:09] <logmsgbot>	 !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1001.eqiad.wmnet
[08:13:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti7004:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:15:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1139481 (owner: 10Filippo Giunchedi)
[08:16:09] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "change looks good to me, thanks. Commit message is a bit off" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[08:17:29] <logmsgbot>	 !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.27  refs T386222
[08:17:30] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache,haproxy: allowed methods check and set response headers [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[08:17:33] <stashbot>	 T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222
[08:18:13] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1136998 (https://phabricator.wikimedia.org/T392073) (owner: 10Fabfur)
[08:18:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[08:19:19] <logmsgbot>	 !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1001.eqiad.wmnet
[08:19:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet
[08:19:29] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet
[08:19:32] <wikibugs>	 (03PS1) 10Majavah: P:wmcs: toolsdb_replica_cnf: Remove HTTPS redirect [puppet] - 10https://gerrit.wikimedia.org/r/1139781
[08:20:35] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::proxy::static: Bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1139782 (https://phabricator.wikimedia.org/T392826)
[08:21:57] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[08:22:17] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[08:22:38] <logmsgbot>	 marostegui@cumin1002 clone (PID 4106512) is awaiting input
[08:23:28] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::proxy::static: Bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1139782 (https://phabricator.wikimedia.org/T392826)
[08:23:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75597 and previous config saved to /var/cache/conftool/dbconfig/20250429-082349-root.json
[08:24:47] <wikibugs>	 (03PS1) 10Slyngshede: Upgrade to version 7.1.6 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1139783
[08:24:51] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Maintenance
[08:25:57] <fabfur>	 !log rolling restart haproxykafka on A:cp to apply new configuration https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136679 (T382571)
[08:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:01] <stashbot>	 T382571: [HAProxy migration] HAProxy and VarnishKafka should produce compatible datasets - https://phabricator.wikimedia.org/T382571
[08:26:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75598 and previous config saved to /var/cache/conftool/dbconfig/20250429-082611-root.json
[08:28:01] <moritzm>	 !log installing wget security updates
[08:28:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db1188 slowly with 10 steps - Pool db1188.eqiad.wmnet in after cloning
[08:36:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet
[08:36:48] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs7002.magru.wmnet} and A:liberica
[08:37:09] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7002.magru.wmnet} and A:liberica
[08:38:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75600 and previous config saved to /var/cache/conftool/dbconfig/20250429-083855-root.json
[08:39:32] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:39:42] <godog>	 !log bounce prometheus-statsd-exporter on stat1011 - T389344
[08:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:46] <stashbot>	 T389344: analytics/wmde/scripts Graphite to Prometheus migration - https://phabricator.wikimedia.org/T389344
[08:40:08] <icinga-wm>	 PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[08:40:25] <vgutierrez>	 BGP alert is me
[08:40:40] <vgutierrez>	 seaborgium.. moritzm ^^
[08:41:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2033 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75601 and previous config saved to /var/cache/conftool/dbconfig/20250429-084116-root.json
[08:42:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet
[08:42:08] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:42:08] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:42:08] <icinga-wm>	 RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.009 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting
[08:42:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10775532 (10MoritzMuehlenhoff)
[08:43:06] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:43:08] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:43:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM. I haven't checked the syntax though." [puppet] - 10https://gerrit.wikimedia.org/r/1139782 (https://phabricator.wikimedia.org/T392826) (owner: 10Majavah)
[08:43:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet
[08:43:44] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::proxy::static: Bind on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1139782 (https://phabricator.wikimedia.org/T392826) (owner: 10Majavah)
[08:45:27] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7002.magru.wmnet
[08:46:38] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:04-1] "please review target wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime)
[08:48:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet
[08:49:32] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:51:24] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::proxy::static: Fix syntax for binding on both families [puppet] - 10https://gerrit.wikimedia.org/r/1139786
[08:52:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM. I haven't checked the syntax myself though." [puppet] - 10https://gerrit.wikimedia.org/r/1139786 (owner: 10Majavah)
[08:52:19] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7002.magru.wmnet
[08:54:05] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::proxy::static: Fix syntax for binding on both families [puppet] - 10https://gerrit.wikimedia.org/r/1139786 (owner: 10Majavah)
[08:57:36] <moritzm>	 vgutierrez: thanks for the pointer, slapd gets restarted automatically every few weeks, this was just unfortunate timing, otherwise this doesn't trigger
[08:57:42] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] "I think this is ok for now." [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834) (owner: 10Gergő Tisza)
[08:59:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1139783 (owner: 10Slyngshede)
[09:00:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:01:56] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::proxy::static: Fix listening on IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/1139791
[09:04:04] <wikibugs>	 (03PS3) 10Zoe: Set flow boards readonly on fiwikimedia and gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909)
[09:04:14] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::proxy::static: Fix listening on IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/1139791 (owner: 10Majavah)
[09:09:37] <wikibugs>	 (03CR) 10Vgutierrez: "this needs to be in sync with the racking plan" [puppet] - 10https://gerrit.wikimedia.org/r/1139559 (https://phabricator.wikimedia.org/T392851) (owner: 10BCornwall)
[09:10:41] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Upgrade to version 7.1.6 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1139783 (owner: 10Slyngshede)
[09:10:44] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs7001.magru.wmnet} and A:liberica
[09:11:06] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs7001.magru.wmnet} and A:liberica
[09:13:20] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:18:22] <suzannewoodWMDE2>	 The populateSitesTable.php script we were running seems to have stopped, it succeed for tswiktionary but did not proceed from ttwiki onwards
[09:18:29] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7001.magru.wmnet
[09:19:01] <wikibugs>	 (03PS1) 10Slyngshede: Update Debian changelog - 7.1.6 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1139794
[09:19:14] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Update Debian changelog - 7.1.6 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1139794 (owner: 10Slyngshede)
[09:21:50] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7001.magru.wmnet
[09:22:20] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:24:29] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/1139795 (https://phabricator.wikimedia.org/T392806)
[09:25:20] <wikibugs>	 (03CR) 10Marostegui: "@fceratto@wikimedia.org please confirm dbproxy1023 is the active one and dbproxy1025 has the same puppet config so I can failover to dbpro" [dns] - 10https://gerrit.wikimedia.org/r/1139795 (https://phabricator.wikimedia.org/T392806) (owner: 10Marostegui)
[09:25:29] <wikibugs>	 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10775662 (10elukey) All right I think both request and latency SLOs are now looking good, way better than before. After a chat with Reuven I realized that we'l...
[09:29:53] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5006.eqsin.wmnet} and A:liberica
[09:30:20] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5006.eqsin.wmnet} and A:liberica
[09:31:13] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet
[09:31:20] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:34:45] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5006.eqsin.wmnet
[09:35:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:37:52] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5005.eqsin.wmnet} and A:liberica
[09:39:29] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5005.eqsin.wmnet} and A:liberica
[09:40:20] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:40:20] <wikibugs>	 (03PS1) 10Jelto: gerrit: require user for gitiles access [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467)
[09:41:29] <wikibugs>	 (03PS5) 10Cyndywikime: Growth-Beta: Configure higher Impact Module edit limits for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599)
[09:41:52] <vgutierrez>	 !log re-arming keyholder in acmechief and acmechief-test instances
[09:41:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:05] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet
[09:44:19] <logmsgbot>	 !log tappof@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet
[09:44:51] <TheresNoTime>	 !log Ran fixStuckGlobalRename.php for T392873 — job (re)started OK
[09:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:55] <stashbot>	 T392873: Unblock stuck global rename of Ikan - https://phabricator.wikimedia.org/T392873
[09:45:18] <fabfur>	 !log uploading haproxykafka 0.3.7 to reprepro (T387454)
[09:45:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:22] <stashbot>	 T387454: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454
[09:45:28] <jinxer-wm>	 RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:46:14] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet
[09:46:17] <logmsgbot>	 !log tappof@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet
[09:46:35] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet
[09:47:10] <wikibugs>	 (03PS1) 10Federico Ceratto: sre.mysql.clone: speed up pooling in [cookbooks] - 10https://gerrit.wikimedia.org/r/1139799 (https://phabricator.wikimedia.org/T392883)
[09:47:10] <wikibugs>	 (03CR) 10Federico Ceratto: "Tiny change, just a speedup as discussed." [cookbooks] - 10https://gerrit.wikimedia.org/r/1139799 (https://phabricator.wikimedia.org/T392883) (owner: 10Federico Ceratto)
[09:47:13] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs5005.eqsin.wmnet
[09:47:30] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10775722 (10elukey) We discussed the options on IRC, to summarize:  1) The DNS cookbook co...
[09:48:20] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: fix warnings/tests [cookbooks] - 10https://gerrit.wikimedia.org/r/1137285 (owner: 10Volans)
[09:48:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] sre.mysql.clone: speed up pooling in [cookbooks] - 10https://gerrit.wikimedia.org/r/1139799 (https://phabricator.wikimedia.org/T392883) (owner: 10Federico Ceratto)
[09:50:23] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet
[09:50:34] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: prevent crawling of some URLs [puppet] - 10https://gerrit.wikimedia.org/r/1138331 (https://phabricator.wikimedia.org/T392669) (owner: 10Hashar)
[09:50:45] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5005.eqsin.wmnet
[09:51:41] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10775738 (10elukey) From https://icinga.com/docs/icinga-2/latest/doc/24-appendix/ it seems...
[09:56:27] <wikibugs>	 (03CR) 10Btullis: [C:03+1] admin/data.yaml: Add dr0ptp4kt (Adam Baso) to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1139779 (owner: 10Klausman)
[09:58:54] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] admin/data.yaml: Add dr0ptp4kt (Adam Baso) to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1139779 (owner: 10Klausman)
[09:58:55] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs5004.eqsin.wmnet} and A:liberica
[09:59:10] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs5004.eqsin.wmnet} and A:liberica
[09:59:22] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1000)
[10:00:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:02:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:11:34] <wikibugs>	 (03PS1) 10Btullis: Update clouddumps contactgroups to reflect shared ownership [puppet] - 10https://gerrit.wikimedia.org/r/1139804
[10:12:39] <wikibugs>	 (03PS2) 10Btullis: Update clouddumps contactgroups to reflect shared ownership [puppet] - 10https://gerrit.wikimedia.org/r/1139804
[10:14:22] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5384/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139804 (owner: 10Btullis)
[10:14:57] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs5004.eqsin.wmnet
[10:18:29] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5004.eqsin.wmnet
[10:22:59] <wikibugs>	 (03PS1) 10Hashar: gerrit: split Gerrit and Gitiles proxy pools [puppet] - 10https://gerrit.wikimedia.org/r/1139806 (https://phabricator.wikimedia.org/T392467)
[10:22:59] <wikibugs>	 (03PS1) 10Hashar: gerrit: lower connections to Gitiles from 25 to 4 [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467)
[10:25:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet
[10:27:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet
[10:27:54] <wikibugs>	 (03CR) 10Hashar: "For reference: ProxyPass doc https://httpd.apache.org/docs/2.4/mod/mod_proxy.html#proxypass" [puppet] - 10https://gerrit.wikimedia.org/r/1139806 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar)
[10:28:04] <wikibugs>	 (03CR) 10Hashar: "For reference: ProxyPass doc https://httpd.apache.org/docs/2.4/mod/mod_proxy.html#proxypass" [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar)
[10:28:58] <wikibugs>	 (03PS2) 10Gergő Tisza: mediawiki: Make refreshLinkRecommendations job less verbose [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834)
[10:29:12] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mediawiki: Make refreshLinkRecommendations job less verbose [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834) (owner: 10Gergő Tisza)
[10:29:15] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Make refreshLinkRecommendations job less verbose [puppet] - 10https://gerrit.wikimedia.org/r/1139515 (https://phabricator.wikimedia.org/T392834) (owner: 10Gergő Tisza)
[10:30:07] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet
[10:31:11] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs4010.ulsfo.wmnet} and A:liberica
[10:31:34] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4010.ulsfo.wmnet} and A:liberica
[10:31:58] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs4010.ulsfo.wmnet
[10:32:14] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:32:18] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] GlobalBlocking: Migrate fixGlobalBlockWhitelist [puppet] - 10https://gerrit.wikimedia.org/r/1139078 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková)
[10:32:53] <wikibugs>	 (03PS1) 10Mvolz: Change citoid config for test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576)
[10:33:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet
[10:33:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Change citoid config for test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz)
[10:33:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet
[10:34:02] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana1002.eqiad.wmnet
[10:34:06] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:35:22] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4010.ulsfo.wmnet
[10:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:36:08] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:36:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet
[10:37:18] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:37:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet
[10:37:44] <wikibugs>	 (03PS1) 10MVernon: Preseed: select manual setup for apus-be[1,2]004 [puppet] - 10https://gerrit.wikimedia.org/r/1139810 (https://phabricator.wikimedia.org/T392844)
[10:38:05] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs4009.ulsfo.wmnet} and A:liberica
[10:38:16] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4009.ulsfo.wmnet} and A:liberica
[10:39:14] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:39:42] <wikibugs>	 (03PS2) 10Mvolz: Change citoid config for test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576)
[10:40:22] <logmsgbot>	 !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[10:40:27] <logmsgbot>	 !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[10:40:36] <logmsgbot>	 !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[10:40:46] <wikibugs>	 (03CR) 10Hashar: [C:03+1] gerrit: require user for gitiles access [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto)
[10:41:02] <logmsgbot>	 !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[10:41:06] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:41:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:41:35] <wikibugs>	 (03CR) 10Majavah: "I don't think 429 is a good status code for this. What about a 401 (Authentication required) or just a redirect to the login page?" [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto)
[10:43:19] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Preseed: select manual setup for apus-be[1,2]004 [puppet] - 10https://gerrit.wikimedia.org/r/1139810 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon)
[10:43:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet
[10:44:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet
[10:44:06] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti4007:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:44:53] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs4009.ulsfo.wmnet
[10:46:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:46:34] <wikibugs>	 (03CR) 10MVernon: [C:03+2] Preseed: select manual setup for apus-be[1,2]004 [puppet] - 10https://gerrit.wikimedia.org/r/1139810 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon)
[10:47:17] <wikibugs>	 (03PS1) 10Kamila Součková: CampaignEvents: Migrate aggregateparticipantanswers-testwiki [puppet] - 10https://gerrit.wikimedia.org/r/1139811 (https://phabricator.wikimedia.org/T385867)
[10:47:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1188 slowly with 10 steps - Pool db1188.eqiad.wmnet in after cloning
[10:47:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1188.eqiad.wmnet onto db1246.eqiad.wmnet
[10:48:10] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4009.ulsfo.wmnet
[10:48:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet
[10:49:06] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:49:23] <wikibugs>	 (03PS1) 10Marostegui: db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1139812 (https://phabricator.wikimedia.org/T392874)
[10:50:16] <wikibugs>	 (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139811 (https://phabricator.wikimedia.org/T385867) (owner: 10Kamila Součková)
[10:51:24] <logmsgbot>	 jmm@cumin2002 drain-node (PID 2196390) is awaiting input
[10:52:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1139812 (https://phabricator.wikimedia.org/T392874) (owner: 10Marostegui)
[10:52:33] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10775898 (10MatthewVernon) a:05MatthewVernon→03None (done, although with manual setup as we don't know how the boss card will present the SSDs to the OS)
[10:53:01] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install apus-be2004 - https://phabricator.wikimedia.org/T392845#10775901 (10MatthewVernon) a:05MatthewVernon→03None (done, although with manual setup as we don't know how the boss card will present the SSDs to the OS)
[10:53:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75613 and previous config saved to /var/cache/conftool/dbconfig/20250429-105304-root.json
[10:53:39] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs4008.ulsfo.wmnet} and A:liberica
[10:54:00] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs4008.ulsfo.wmnet} and A:liberica
[10:54:14] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:55:15] <wikibugs>	 06SRE, 10Observability-Metrics: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10775923 (10elukey) And the issue is know:  https://github.com/pyrra-dev/pyrra/issues/1465 https://github.com/pyrra-dev/pyrra/issues/1235
[10:55:56] <godog>	 jouncebot: now and next
[10:55:56] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1000)
[10:56:06] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:02:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet
[11:02:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet
[11:03:37] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs4008.ulsfo.wmnet
[11:06:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet
[11:06:54] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4008.ulsfo.wmnet
[11:07:06] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:08:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75614 and previous config saved to /var/cache/conftool/dbconfig/20250429-110809-root.json
[11:08:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet
[11:09:02] <wikibugs>	 10SRE-swift-storage, 06Commons, 10Thumbor: Error: 429, Too Many Requests - https://phabricator.wikimedia.org/T392348#10775982 (10MatthewVernon) That last URL is an `archive` URL, which I wouldn't generally expect to work (they're for deleted-by-admin objects).
[11:09:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet
[11:09:31] <wikibugs>	 (03PS2) 10Jelto: gerrit: require user for gitiles access [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467)
[11:09:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet
[11:09:53] <wikibugs>	 (03CR) 10Jelto: "good point, I changed the response to 401 in patchset 2" [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto)
[11:10:18] <godog>	 !log bounce prometheus-statsd-exporter on stat1011 - T389344
[11:10:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:23] <stashbot>	 T389344: analytics/wmde/scripts Graphite to Prometheus migration - https://phabricator.wikimedia.org/T389344
[11:11:08] <wikibugs>	 (03PS1) 10Majavah: P:mariadb: packages_client: Default to 10.6 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1139817 (https://phabricator.wikimedia.org/T380073)
[11:12:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet
[11:13:42] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:14:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] P:mariadb: packages_client: Default to 10.6 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1139817 (https://phabricator.wikimedia.org/T380073) (owner: 10Majavah)
[11:14:53] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:mariadb: packages_client: Default to 10.6 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1139817 (https://phabricator.wikimedia.org/T380073) (owner: 10Majavah)
[11:16:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1002.eqiad.wmnet
[11:16:49] <wikibugs>	 (03PS3) 10AOkoth: miscweb: update os-reports image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794)
[11:16:56] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[11:17:37] <wikibugs>	 (03CR) 10AOkoth: "Ack. I've updated it. I started this change with a whole different idea in mind." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[11:21:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1002.eqiad.wmnet
[11:21:27] <wikibugs>	 (03CR) 10Arnaudb: gerrit: lower connections to Gitiles from 25 to 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar)
[11:23:10] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:23:10] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:23:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75615 and previous config saved to /var/cache/conftool/dbconfig/20250429-112314-root.json
[11:24:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org
[11:26:06] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:26:08] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:28:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org
[11:28:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org
[11:30:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/1139795 (https://phabricator.wikimedia.org/T392806) (owner: 10Marostegui)
[11:30:20] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[11:30:38] <marostegui>	 !log Failover m2 master from dbproxy1023 to dbproxy1025
[11:30:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:02] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "minor question, otherwise lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1139806 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar)
[11:32:50] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[11:32:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org
[11:33:16] <wikibugs>	 (03PS1) 10Majavah: hieradata: Add new eqiad1 proxies [puppet] - 10https://gerrit.wikimedia.org/r/1139818 (https://phabricator.wikimedia.org/T379175)
[11:34:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet
[11:35:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893 (10Madalina) 03NEW
[11:36:09] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:36:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet
[11:37:07] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:37:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:38:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75616 and previous config saved to /var/cache/conftool/dbconfig/20250429-113820-root.json
[11:38:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:39:27] <moritzm>	 !log installing curl security updates
[11:39:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:39] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gerrit: require user for gitiles access [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto)
[11:42:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet
[11:42:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet
[11:43:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti4005:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:44:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10776072 (10MoritzMuehlenhoff)
[11:49:26] <suzannewoodWMDE2>	 !log suzannewood@deploy1003:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https
[11:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75617 and previous config saved to /var/cache/conftool/dbconfig/20250429-115325-root.json
[11:53:30] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan2001.codfw.wmnet
[11:53:42] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:56:20] <wikibugs>	 (03PS1) 10Majavah: P:wmcs: novaproxy: Add separate keepalived_peers variable [puppet] - 10https://gerrit.wikimedia.org/r/1139821 (https://phabricator.wikimedia.org/T379175)
[11:57:05] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5385/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139821 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[11:58:00] <wikibugs>	 (03PS2) 10Majavah: P:wmcs: novaproxy: Add separate keepalived_peers variable [puppet] - 10https://gerrit.wikimedia.org/r/1139821 (https://phabricator.wikimedia.org/T379175)
[11:58:48] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5386/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139821 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1200)
[12:01:21] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2001.codfw.wmnet
[12:01:44] <wikibugs>	 (03PS1) 10Slyngshede: Permissions: Add comments from permission managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1139823 (https://phabricator.wikimedia.org/T392682)
[12:23:26] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::novaproxy: Fix keepalived_peers type [puppet] - 10https://gerrit.wikimedia.org/r/1139838
[12:23:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75619 and previous config saved to /var/cache/conftool/dbconfig/20250429-122335-root.json
[12:23:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:24:15] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:24:29] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:25:35] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::novaproxy: Fix keepalived_peers type [puppet] - 10https://gerrit.wikimedia.org/r/1139838 (owner: 10Majavah)
[12:25:47] <suzannewoodWMDE2>	 !log Finished populateSitesTable for nupwiki (https://phabricator.wikimedia.org/T390715)
[12:25:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:11] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1001.eqiad.wmnet
[12:28:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[12:28:42] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:29:15] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:29:55] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:30:55] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:30:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet
[12:31:15] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:31:39] <wikibugs>	 (03PS1) 10Btullis: Create an SSH private key in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139840 (https://phabricator.wikimedia.org/T390738)
[12:31:59] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[12:32:35] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:32:37] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:32:54] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::novaproxy: Fix keepalived peer list definition [puppet] - 10https://gerrit.wikimedia.org/r/1139841
[12:33:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "<3" [software/bitu] - 10https://gerrit.wikimedia.org/r/1139823 (https://phabricator.wikimedia.org/T392682) (owner: 10Slyngshede)
[12:34:19] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:34:39] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:34:53] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan1002.eqiad.wmnet
[12:35:35] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:wmcs::novaproxy: Fix keepalived peer list definition [puppet] - 10https://gerrit.wikimedia.org/r/1139841 (owner: 10Majavah)
[12:35:39] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:36:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr1-eqiad and 10.64.48.95 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:36:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1139823 (https://phabricator.wikimedia.org/T392682) (owner: 10Slyngshede)
[12:36:19] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:36:25] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.47 ms
[12:37:05] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove config for renaming WikibaseEntitySchema propertyType [extensions/EntitySchema] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139842 (https://phabricator.wikimedia.org/T371196)
[12:37:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/EntitySchema] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139842 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE))
[12:37:22] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10776256 (10tappof) Actually, they're defined in Puppet like this:  ` # drmrs, single phase PDUs facilities::monitor_pdu_1phase...
[12:37:37] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:37:37] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:37:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE))
[12:37:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet
[12:37:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet
[12:37:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet
[12:38:09] <wikibugs>	 (03PS7) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231)
[12:38:42] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service ganeti2023:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:38:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75620 and previous config saved to /var/cache/conftool/dbconfig/20250429-123840-root.json
[12:39:39] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:40:39] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:41:10] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:42:06] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1002.eqiad.wmnet
[12:43:25] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:43:42] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service ganeti2023:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:43:46] <logmsgbot>	 jmm@cumin2002 drain-node (PID 2314348) is awaiting input
[12:45:23] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:45:23] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:45:56] <godog>	 jouncebot: now and next
[12:45:56] <jouncebot>	 For the next 0 hour(s) and 14 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1200)
[12:46:09] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet
[12:46:10] <jinxer-wm>	 FIRING: [10x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:46:23] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:46:23] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:46:26] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet
[12:46:54] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet
[12:47:13] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet
[12:48:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet
[12:48:25] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:49:10] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#10776315 (10elukey) 05Open→03Resolved a:03elukey I am tentati...
[12:50:23] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:50:23] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:51:10] <jinxer-wm>	 RESOLVED: [10x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.20 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[12:51:45] <wikibugs>	 (03PS8) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231)
[12:52:23] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:52:23] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:53:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:53:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75621 and previous config saved to /var/cache/conftool/dbconfig/20250429-125347-root.json
[12:54:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet
[12:55:15] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet
[12:55:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet
[12:55:51] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:55:51] <icinga-wm>	 PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:57:08] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet
[12:57:51] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:57:51] <icinga-wm>	 RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:57:53] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet
[12:57:54] <wikibugs>	 (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[12:58:24] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet
[12:58:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job pushgateway in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:58:51] <jinxer-wm>	 FIRING: [15x] ProbeDown: Service ganeti2024:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1300). Please do the needful.
[13:00:05] <jouncebot>	 Daimona, zip, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:12] <Lucas_WMDE>	 o/
[13:00:21] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs6003.drmrs.wmnet} and A:liberica
[13:00:23] <Lucas_WMDE>	 my patches are optional btw, if there’s no time I can do them later
[13:00:29] <icinga-wm>	 PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:00:34] <Daimona>	 o/
[13:00:43] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs6003.drmrs.wmnet} and A:liberica
[13:00:43] <zip>	 is it not 13:00 GMT
[13:01:11] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs6003.drmrs.wmnet
[13:01:15] <zip>	 oh I see
[13:01:15] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:01:24] <zip>	 I mis-calendared that, but yes, I'm around!
[13:01:30] <Lucas_WMDE>	 ok ^^
[13:01:48] <Lucas_WMDE>	 I think technically it’s 13:00 GMT but Greenwich is not currently in GMT? or some nonsense like that
[13:02:04] <zip>	 I think I had the idea that this happened at 14:00GMT
[13:02:15] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:02:19] <Lucas_WMDE>	 well, sometimes it does
[13:02:24] <Lucas_WMDE>	 it’s tied to the san francisco time zone
[13:02:29] <icinga-wm>	 RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:02:32] <Lucas_WMDE>	 so the UTC time jumps around as the US go in and out of daylight savings time
[13:02:49] * Daimona is triggered by people talking about time zones and DST
[13:03:15] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:03:45] <Lucas_WMDE>	 I’ll do the changes by Daimona and zip together, should be harmless
[13:03:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138405 (https://phabricator.wikimedia.org/T392240) (owner: 10Daimona Eaytoy)
[13:03:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe)
[13:03:49] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6003.drmrs.wmnet
[13:04:15] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:04:49] <Lucas_WMDE>	 spiderpig go brrrrrr
[13:05:13] <Daimona>	 :D
[13:06:09] <icinga-wm>	 PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:06:21] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:06:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet
[13:06:52] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on 43 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138405 (https://phabricator.wikimedia.org/T392240) (owner: 10Daimona Eaytoy)
[13:06:55] <wikibugs>	 (03Merged) 10jenkins-bot: Set flow boards readonly on fiwikimedia and gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139517 (https://phabricator.wikimedia.org/T380909) (owner: 10Zoe)
[13:07:09] <icinga-wm>	 RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:07:21] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:07:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1138405|Enable the CampaignEvents extension on 43 more wikis (T392240)]], [[gerrit:1139517|Set flow boards readonly on fiwikimedia and gomwiki (T380909)]]
[13:07:52] <stashbot>	 T392240: Release CampaignEvents extension to multiple ESEAP & SA wikis - https://phabricator.wikimedia.org/T392240
[13:07:52] <stashbot>	 T380909: [Config] Set Flow to read-only at all *Phase 2b* wikis - https://phabricator.wikimedia.org/T380909
[13:08:08] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs6002.drmrs.wmnet} and A:liberica
[13:08:31] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs6002.drmrs.wmnet} and A:liberica
[13:10:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet
[13:10:53] <logmsgbot>	 !log fab@deploy1003 Started deploy [airflow-dags/research@414def7]: (no justification provided)
[13:11:09] <icinga-wm>	 PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:11:15] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:11:31] <logmsgbot>	 !log fab@deploy1003 Finished deploy [airflow-dags/research@414def7]: (no justification provided) (duration: 00m 40s)
[13:11:33] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:12:09] <icinga-wm>	 RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:12:24] <sukhe>	 !log reprepro include bookworm-wikimedia dnsdist_1.8.2-1+wmf12u2_amd64.changes 
[13:12:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:33] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:13:04] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Fastnetmon: bump threshold_pps to 1.75M [puppet] - 10https://gerrit.wikimedia.org/r/1139503 (owner: 10Ayounsi)
[13:14:26] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:14:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 daimona, zoe, lucaswerkmeister-wmde: Backport for [[gerrit:1138405|Enable the CampaignEvents extension on 43 more wikis (T392240)]], [[gerrit:1139517|Set flow boards readonly on fiwikimedia and gomwiki (T380909)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:14:31] <wikibugs>	 (03PS1) 10Ssingh: Revert "wikidough: add healthcheck override for doh1001 and doh2002" [puppet] - 10https://gerrit.wikimedia.org/r/1139849
[13:14:33] <stashbot>	 T392240: Release CampaignEvents extension to multiple ESEAP & SA wikis - https://phabricator.wikimedia.org/T392240
[13:14:34] <stashbot>	 T380909: [Config] Set Flow to read-only at all *Phase 2b* wikis - https://phabricator.wikimedia.org/T380909
[13:14:43] <Lucas_WMDE>	 Daimona, zip: please test on WikimediaDebug :)
[13:14:53] <zip>	 tested
[13:14:54] <zip>	 looking good
[13:14:57] <Lucas_WMDE>	 yay
[13:15:02] <Lucas_WMDE>	 Daimona: I expect you don’t have to test all the wikis ;)
[13:15:17] <wikibugs>	 (03CR) 10Ssingh: "Merging since it's a revert." [puppet] - 10https://gerrit.wikimedia.org/r/1139849 (owner: 10Ssingh)
[13:15:18] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "wikidough: add healthcheck override for doh1001 and doh2002" [puppet] - 10https://gerrit.wikimedia.org/r/1139849 (owner: 10Ssingh)
[13:15:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Add krb1002 to the list of KDCs presented to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1139850 (https://phabricator.wikimedia.org/T390863)
[13:16:09] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:16:11] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:16:11] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:16:14] <sukhe>	 ^ expected, reboots
[13:16:49] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "P:auth: temporarily skip returning a WARN on check_authdns_state" [puppet] - 10https://gerrit.wikimedia.org/r/1139529 (owner: 10Ssingh)
[13:17:11] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:17:11] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:17:16] <sukhe>	 !log force agent run on A:dnsbox 
[13:17:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet
[13:17:17] <wikibugs>	 (03PS1) 10Elukey: k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857)
[13:17:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet
[13:17:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet
[13:18:07] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:18:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.6 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:18:34] <Daimona>	 Lucas_WMDE: looks good, thanks!
[13:18:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 daimona, zoe, lucaswerkmeister-wmde: Continuing with sync
[13:18:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti2025:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:18:44] <Lucas_WMDE>	 great, thanks!
[13:18:58] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139850 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[13:19:00] <wikibugs>	 (03CR) 10Abijeet Patro: [C:03+1] Catalog ContentTranslation tables [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas)
[13:19:26] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:19:46] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs6002.drmrs.wmnet
[13:20:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating - jhancock@cumin2002"
[13:20:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating - jhancock@cumin2002"
[13:20:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:21:09] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:21:11] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:21:11] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:21:13] <icinga-wm>	 RECOVERY - Wikidough DoH Check -IPv4- on doh2002 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check
[13:22:09] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 95, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:22:11] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:22:11] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:22:23] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6002.drmrs.wmnet
[13:22:46] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:22:58] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum
[13:23:09] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:23:10] <jinxer-wm>	 FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:23:15] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:23:57] <fabfur>	 !log importing haproxykafka 0.3.8 in bullseye-wikimedia (https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/83)
[13:24:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:03] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: enable auto memlimit [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966)
[13:24:06] <sukhe>	 !log disable puppet on A:durum to progressively roll out CR 1139542
[13:24:07] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:24:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:24:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet
[13:25:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:25:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138405|Enable the CampaignEvents extension on 43 more wikis (T392240)]], [[gerrit:1139517|Set flow boards readonly on fiwikimedia and gomwiki (T380909)]] (duration: 17m 39s)
[13:25:31] <stashbot>	 T392240: Release CampaignEvents extension to multiple ESEAP & SA wikis - https://phabricator.wikimedia.org/T392240
[13:25:31] <stashbot>	 T380909: [Config] Set Flow to read-only at all *Phase 2b* wikis - https://phabricator.wikimedia.org/T380909
[13:26:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[13:28:10] <jinxer-wm>	 RESOLVED: [8x] BFDdown: BFD session down between cr3-ulsfo and 10.128.0.7 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[13:28:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10776482 (10Papaul) p:05Triage→03Medium
[13:29:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:29:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10776486 (10Papaul) I will take a look at it when I am on site. Thank you
[13:29:46] <wikibugs>	 (03PS14) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175)
[13:29:46] <wikibugs>	 (03PS1) 10Majavah: keepalived: Fix IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1139857 (https://phabricator.wikimedia.org/T379175)
[13:30:22] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs6001.drmrs.wmnet} and A:liberica
[13:30:34] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs6001.drmrs.wmnet} and A:liberica
[13:30:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/EntitySchema] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139842 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE))
[13:31:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047
[13:31:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2048
[13:31:33] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5390/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139857 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[13:31:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047
[13:31:37] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2048
[13:31:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet
[13:31:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet
[13:32:23] <fabfur>	 !log updated haproxykafka on cp1112 to test version 0.3.8 
[13:32:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:43] <wikibugs>	 (03Merged) 10jenkins-bot: Remove config for renaming WikibaseEntitySchema propertyType [extensions/EntitySchema] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1139842 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE))
[13:33:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1139842|Remove config for renaming WikibaseEntitySchema propertyType (T371196)]]
[13:33:10] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Thanks for the review :)" [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:33:10] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: durum: set do_ech true for all durum hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139542 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:33:12] <stashbot>	 T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196
[13:33:15] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:33:42] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service ganeti2025:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:34:04] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1139850 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[13:36:47] <fabfur>	 !log depooling cp1112 to test new haproxykafka version behavior (T387454)
[13:36:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:52] <stashbot>	 T387454: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454
[13:37:01] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:38:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet
[13:38:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:38:17] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:38:25] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] keepalived: Fix IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1139857 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[13:39:17] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum2002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:39:29] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1139842|Remove config for renaming WikibaseEntitySchema propertyType (T371196)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:39:33] <stashbot>	 T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196
[13:39:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync
[13:39:42] <Lucas_WMDE>	 https://www.wikidata.org/wiki/Special:EntityData/P12861.ttl still looks good on WikimediaDebug
[13:39:47] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs6001.drmrs.wmnet
[13:40:11] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[13:40:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet
[13:41:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet
[13:42:15] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:42:24] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6001.drmrs.wmnet
[13:42:26] <icinga-wm>	 PROBLEM - Host kubestagemaster2005 is DOWN: PING CRITICAL - Packet loss = 100%
[13:42:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:43:15] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:43:40] <awight>	 ping, quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1139434 if possible?
[13:43:42] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service ganeti2026:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:44:45] <fabfur>	 !log [correcting] cp1112 has NOT been depooled (T387454)
[13:44:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:50] <stashbot>	 T387454: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454
[13:45:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet
[13:45:39] <icinga-wm>	 RECOVERY - Host kubestagemaster2005 is UP: PING OK - Packet loss = 0%, RTA = 30.55 ms
[13:46:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139842|Remove config for renaming WikibaseEntitySchema propertyType (T371196)]] (duration: 13m 04s)
[13:46:16] <stashbot>	 T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196
[13:46:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:47:20] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3010.esams.wmnet} and A:liberica
[13:47:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet
[13:47:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet
[13:47:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5391/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi)
[13:47:41] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3010.esams.wmnet} and A:liberica
[13:47:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:47:56] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3010.esams.wmnet
[13:48:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[13:48:36] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet
[13:48:42] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:48:43] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet
[13:48:48] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2008.codfw.wmnet
[13:48:53] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: require user for gitiles access [puppet] - 10https://gerrit.wikimedia.org/r/1139798 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto)
[13:48:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cirrussearch2078.codfw.wmnet
[13:49:07] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1008.eqiad.wmnet
[13:49:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet
[13:49:15] <Lucas_WMDE>	 jouncebot: next
[13:49:15] <jouncebot>	 In 1 hour(s) and 10 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1500)
[13:49:20] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Unused in wmf.25 and wmf.27:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE))
[13:49:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE))
[13:49:38] <Lucas_WMDE>	 I might slightly overrun the window depending on how long ^ takes
[13:50:15] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:51:13] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3010.esams.wmnet
[13:51:15] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:51:37] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused EntitySchema config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE))
[13:51:57] <moritzm>	 !log installing libcap2 security updates
[13:51:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1134693|Remove unused EntitySchema config (T371196)]]
[13:52:03] <logmsgbot>	 pt1979@cumin2002 dhcp (PID 2388488) is awaiting input
[13:52:06] <stashbot>	 T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196
[13:53:21] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3009.esams.wmnet} and A:liberica
[13:53:43] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3009.esams.wmnet} and A:liberica
[13:54:37] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:54:37] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:55:16] <logmsgbot>	 jmm@cumin2002 drain-node (PID 2388691) is awaiting input
[13:55:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cirrussearch2078.codfw.wmnet
[13:55:50] <fabfur>	 !log upgrading haproxkafka on A:cp (T387454)
[13:55:51] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:55:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet
[13:55:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:55] <stashbot>	 T387454: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454
[13:56:08] <awight>	 Amir1: if you have a minute, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1139434
[13:56:14] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet
[13:57:01] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2008.codfw.wmnet
[13:57:09] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1008.eqiad.wmnet
[13:58:25] <wikibugs>	 (03PS1) 10Elukey: admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493)
[13:58:29] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1134693|Remove unused EntitySchema config (T371196)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:58:34] <stashbot>	 T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196
[13:58:37] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:58:37] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:58:42] <jinxer-wm>	 FIRING: [15x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:58:47] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10776579 (10Jhancock.wm) @Marostegui we caught this one right before it went out of warranty. I put in for a new drive with dell. should be here tomorrow. But I have on hands from decommed servers if yo...
[13:58:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync
[13:58:55] <Lucas_WMDE>	 still works
[13:59:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2078.codfw.wmnet with OS bullseye
[13:59:24] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10776580 (10Marostegui) It is fine to wait till tomorrow - no worries!.
[13:59:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[14:00:05] <wikibugs>	 (03PS2) 10Elukey: admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493)
[14:00:17] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10776582 (10Jhancock.wm) can do.  Dell Service Request: 209215325
[14:00:40] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet
[14:01:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[14:01:09] <fabfur>	 !log haproxykafka upgraded and restarted on A:cp (T387454)
[14:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:14] <stashbot>	 T387454: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454
[14:01:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10776590 (10MoritzMuehlenhoff)
[14:02:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2047.codfw.wmnet with OS bookworm
[14:02:28] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3009.esams.wmnet
[14:02:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10776592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm
[14:02:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet
[14:02:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet
[14:02:46] <elukey>	 OpenURI::HTTPError: 401 Unauthorized - not great from cI :(
[14:03:09] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:03:09] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:03:11] <wikibugs>	 (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[14:03:25] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:03:42] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:03:44] <wikibugs>	 (03PS3) 10Elukey: admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493)
[14:03:45] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet
[14:04:06] <jinxer-wm>	 FIRING: [18x] ProbeDown: Service ganeti2027:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:05:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[14:05:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): cirrussearch2078 (R440 Config D, Row/Rack B2) unable to PXE boot - https://phabricator.wikimedia.org/T392644#10776602 (10bking) 05Open→03Resolved Per IRC conversation with @Papaul , he was able to get PXE booting to work w...
[14:05:43] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Possible frdb2004 hardware failure. - https://phabricator.wikimedia.org/T392579#10776610 (10Jgreen)
[14:05:45] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3009.esams.wmnet
[14:05:45] <icinga-wm>	 PROBLEM - Host lvs3009 is DOWN: PING CRITICAL - Packet loss = 100%
[14:05:51] <icinga-wm>	 RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:05:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134693|Remove unused EntitySchema config (T371196)]] (duration: 13m 57s)
[14:06:04] <stashbot>	 T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196
[14:06:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:06:32] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:06:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:37] <icinga-wm>	 RECOVERY - Host lvs3009 is UP: PING OK - Packet loss = 0%, RTA = 80.22 ms
[14:07:09] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:07:09] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:07:58] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet
[14:08:27] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:08:37] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Permissions: Add comments from permission managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1139823 (https://phabricator.wikimedia.org/T392682) (owner: 10Slyngshede)
[14:08:42] <jinxer-wm>	 FIRING: [17x] ProbeDown: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:10:23] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:10:23] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:11:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr2-eqsin and 10.132.0.10 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:11:23] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:11:23] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:11:27] <wikibugs>	 (03PS9) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387866)
[14:11:57] <wikibugs>	 (03Merged) 10jenkins-bot: Permissions: Add comments from permission managers [software/bitu] - 10https://gerrit.wikimedia.org/r/1139823 (https://phabricator.wikimedia.org/T392682) (owner: 10Slyngshede)
[14:12:21] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3008.esams.wmnet} and A:liberica
[14:12:42] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3008.esams.wmnet} and A:liberica
[14:13:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2078.codfw.wmnet with reason: host reimage
[14:13:35] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Possible frdb2004 hardware failure. - https://phabricator.wikimedia.org/T392579#10776664 (10Papaul) a:03Jhancock.wm @Jhancock.wm when you have a minutes can you please check this host. Also can you also please upgrade the CPLD.  Thank you
[14:14:34] <wikibugs>	 (03PS2) 10AOkoth: wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128)
[14:15:15] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:15:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1153:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:15:47] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet
[14:16:06] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585#10776676 (10ops-monitoring-bot) Host rebooted by jelto@cumin1002 with reason: eno1 has the wrong speed
[14:16:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2078.codfw.wmnet with reason: host reimage
[14:18:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5392/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi)
[14:19:20] <godog>	 jouncebot: now and next
[14:19:20] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 40 minute(s)
[14:19:35] <godog>	 alright I'll reboot a bunch of prometheus hosts in pops
[14:19:41] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3008.esams.wmnet
[14:19:45] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet
[14:20:00] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet
[14:20:22] <wikibugs>	 (03PS4) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086)
[14:20:36] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:20:38] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:22:18] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:22:18] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:22:23] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet
[14:22:38] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:22:58] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3008.esams.wmnet
[14:23:14] <icinga-wm>	 RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:24:10] <jinxer-wm>	 FIRING: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:26:11] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] Catalog ContentTranslation tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas)
[14:26:18] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:26:18] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:26:38] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:26:56] <wikibugs>	 (03CR) 10Awight: [C:03+1] "ping: would be fantastic to have this reenabled now." [puppet] - 10https://gerrit.wikimedia.org/r/1139434 (owner: 10Awight)
[14:27:13] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet
[14:27:40] <wikibugs>	 (03CR) 10Herron: [C:03+1] thanos: enable auto memlimit [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi)
[14:28:42] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:29:03] <wikibugs>	 (03CR) 10Ladsgroup: "Hi, I've got your ping multiple times now. Adding back ssh key is much less straightforward of removing it. I need to confirm the identity" [puppet] - 10https://gerrit.wikimedia.org/r/1139434 (owner: 10Awight)
[14:29:06] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Add krb1002 to the list of KDCs presented to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1139850 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[14:29:10] <jinxer-wm>	 RESOLVED: [2x] BFDdown: BFD session down between cr1-eqiad and 10.64.16.86 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[14:30:13] <wikibugs>	 (03CR) 10Elukey: "Ok to merge?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135402 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey)
[14:30:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2106 to cirrussearch2106
[14:30:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[14:32:56] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus3003.esams.wmnet
[14:32:57] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1016.eqiad.wmnet
[14:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:35:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2078.codfw.wmnet with OS bullseye
[14:35:57] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus4002.ulsfo.wmnet
[14:36:19] <logmsgbot>	 bking@cumin2002 rename (PID 2431696) is awaiting input
[14:36:38] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus5002.eqsin.wmnet
[14:38:33] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1016.eqiad.wmnet
[14:38:48] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3003.esams.wmnet
[14:39:09] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus7001.magru.wmnet
[14:39:11] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus6002.drmrs.wmnet
[14:40:13] <wikibugs>	 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908 (10RobH) 03NEW
[14:40:29] <wikibugs>	 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10776761 (10RobH)
[14:40:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2106 to cirrussearch2106 - bking@cumin2002"
[14:41:06] <wikibugs>	 10ops-codfw, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10776762 (10RobH) a:03MatthewVernon @matthewvernon,  Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet u...
[14:41:55] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1015.eqiad.wmnet
[14:42:17] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909 (10RobH) 03NEW
[14:42:21] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus4002.ulsfo.wmnet
[14:42:30] <wikibugs>	 (03PS1) 10Ssingh: P:durum and hiera: update health check path [puppet] - 10https://gerrit.wikimedia.org/r/1139873
[14:42:35] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10776789 (10RobH)
[14:42:43] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus5002.eqsin.wmnet
[14:42:56] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10776791 (10RobH) a:03MatthewVernon
[14:43:07] <wikibugs>	 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10776794 (10RobH) @MatthewVernon,  Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-...
[14:43:34] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5393/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139873 (owner: 10Ssingh)
[14:43:37] <logmsgbot>	 bking@cumin2002 rename (PID 2431696) is awaiting input
[14:43:42] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:43:42] <wikibugs>	 (03PS3) 10Ssingh: wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378)
[14:45:13] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus7001.magru.wmnet
[14:45:14] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "sukhe@durum2002:~$ /usr/lib/nagios/plugins/check_http -H yesdoh.check.wikimedia-dns.org --ssl --sni -I 185.71.138.140 -u /check -t 1 && /u" [puppet] - 10https://gerrit.wikimedia.org/r/1139873 (owner: 10Ssingh)
[14:45:14] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus6002.drmrs.wmnet
[14:46:14] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: eno1 on gitlab-runner1003:9100 has the wrong speed: 1.25e+07. - https://phabricator.wikimedia.org/T392585#10776805 (10Jelto) 05Open→03Resolved This issue resolved after a reboot. The alert is gone. I'll resolve the task optimistically.
[14:46:28] <wikibugs>	 (03CR) 10Elukey: "Ack I'll try! At the moment I have trouble cherry-picking, I see some conflicts with run_ci_locally.sh :(" [puppet] - 10https://gerrit.wikimedia.org/r/1135115 (owner: 10JHathaway)
[14:47:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2106 to cirrussearch2106 - bking@cumin2002"
[14:47:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:47:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2106
[14:47:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2106
[14:47:28] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: "A non-identical file already exists" - Cannot undelete [[File:Hawkmoth (Meganoton nyctiphanes) (8688240817).jpg]] - https://phabricator.wikimedia.org/T392658#10776816 (10MatthewVernon) This image exists in both swift clusters, dating back to 2021...
[14:47:31] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1015.eqiad.wmnet
[14:48:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2106 to cirrussearch2106
[14:49:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2106.codfw.wmnet on all recursors
[14:49:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2106.codfw.wmnet on all recursors
[14:50:12] <wikibugs>	 (03PS1) 10Jelto: Revert "gerrit: require user for gitiles access" [puppet] - 10https://gerrit.wikimedia.org/r/1139874 (https://phabricator.wikimedia.org/T392467)
[14:50:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure (sdb) on coludcephmon1004 - https://phabricator.wikimedia.org/T392458#10776834 (10Jclark-ctr) Confirmed: Service Request 209219050 was successfully submitted.
[14:50:30] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1014.eqiad.wmnet
[14:50:36] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:51:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392428#10776838 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[14:52:42] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Revert "gerrit: require user for gitiles access" [puppet] - 10https://gerrit.wikimedia.org/r/1139874 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto)
[14:52:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392427#10776845 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[14:53:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2106.codfw.wmnet with OS bullseye
[14:53:20] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[14:53:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T392806)', diff saved to https://phabricator.wikimedia.org/P75622 and previous config saved to /var/cache/conftool/dbconfig/20250429-145327-fceratto.json
[14:53:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10776852 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[14:53:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2106
[14:54:38] <wikibugs>	 (03PS1) 10Majavah: keepalived: failover: Select unicast source v6 more reliably [puppet] - 10https://gerrit.wikimedia.org/r/1139877 (https://phabricator.wikimedia.org/T379175)
[14:54:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[14:55:10] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10776857 (10tappof)
[14:55:11] <wikibugs>	 (03CR) 10Bking: [C:03+1] Create an SSH private key in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139840 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[14:55:14] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:56:17] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1014.eqiad.wmnet
[14:57:10] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:57:27] <wikibugs>	 (03CR) 10Bking: [C:03+1] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[14:57:46] <wikibugs>	 (03PS4) 10Ssingh: wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378)
[14:57:52] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[14:58:15] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Create an SSH private key in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139840 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[14:58:31] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5394/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139857 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[14:58:49] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] puppetdb: add tunable for maximum-pool-size [puppet] - 10https://gerrit.wikimedia.org/r/1139481 (owner: 10Filippo Giunchedi)
[14:58:55] <wikibugs>	 (03CR) 10Ssingh: "Generated with:" [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:00:04] <jouncebot>	 jelto, arnoldokoth, and mutante: That opportune time for a SRE Collaboration Services office hours deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1500).
[15:00:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T392806)', diff saved to https://phabricator.wikimedia.org/P75623 and previous config saved to /var/cache/conftool/dbconfig/20250429-150011-fceratto.json
[15:00:12] <logmsgbot>	 bking@cumin2002 reimage (PID 2453344) is awaiting input
[15:02:50] <wikibugs>	 (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[15:03:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet
[15:05:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2106 - bking@cumin2002"
[15:05:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[15:05:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2106 - bking@cumin2002"
[15:05:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:05:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2106.codfw.wmnet 88.48.192.10.in-addr.arpa 8.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:05:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2106.codfw.wmnet 88.48.192.10.in-addr.arpa 8.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:05:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2106
[15:06:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2106
[15:06:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2106
[15:07:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:09:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet
[15:10:44] <wikibugs>	 (03PS4) 10Elukey: admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493)
[15:10:51] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5395/co" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi)
[15:11:13] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi)
[15:11:47] <wikibugs>	 (03CR) 10Dreamy Jazz: Unify IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó)
[15:13:42] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:15:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P75624 and previous config saved to /var/cache/conftool/dbconfig/20250429-151518-fceratto.json
[15:15:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet
[15:15:47] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-serve-ctrl2001:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:15:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet
[15:15:58] <sukhe>	 !incidents
[15:15:58] <sirenbot>	 6068 (UNACKED)  [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw)
[15:16:00] <sukhe>	 !ack 6068
[15:16:01] <sirenbot>	 6068 (ACKED)  [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw)
[15:16:22] <sukhe>	 klausman: is this you?
[15:16:37] <sukhe>	 (sorry, going by SAL and a possibly related change for ml-lab1001?)
[15:18:24] <sukhe>	 elukey: maybe you as well :) I am really not sure what to do here
[15:18:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:19:07] <sukhe>	 the host seems up though
[15:19:07] <elukey>	 sukhe: o/ in theory no, I see that the kube-apiserver reloaded a while ago, I think it is due to a TLS cert reload
[15:19:20] <elukey>	 but I bumped vcores and memory to prevent this :D
[15:19:30] <sukhe>	 elukey: hmm I see but a cert reload can cause a probe failure?
[15:19:30] <elukey>	 (not now, some days ago)
[15:19:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet
[15:20:13] <sukhe>	 weirdly, a resolve has not come in
[15:20:38] <elukey>	 sukhe: yes I know it is sad, but the kube-apiserver needs to be restarted and in the ML case it may be busy in doing multiple things while booting, not replying to health checks
[15:20:47] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl2001:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:50] <sukhe>	 ah no worries, was trying to understand it
[15:20:53] <sukhe>	 ok resolve came in :)
[15:20:56] <sukhe>	 thanks elukey <3
[15:21:31] <elukey>	 np! It is weird that from https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ml-serve-ctrl2001&var-datasource=thanos&var-cluster=ml_serve I don't see the server under pressure
[15:21:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2108 to cirrussearch2108
[15:21:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:22:29] <sukhe>	 yeah nothing else stands out on the server itself as well, except a smallish spike on network utilization?
[15:22:33] <sukhe>	 but surely that can't be it
[15:22:58] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[15:23:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2106.codfw.wmnet with reason: host reimage
[15:23:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet
[15:24:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10776944 (10MoritzMuehlenhoff)
[15:26:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2108 to cirrussearch2108 - bking@cumin2002"
[15:26:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1139877 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[15:26:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2106.codfw.wmnet with reason: host reimage
[15:26:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2108 to cirrussearch2108 - bking@cumin2002"
[15:26:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:26:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2108
[15:27:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2108
[15:27:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2108 to cirrussearch2108
[15:28:29] <elukey>	 sukhe: I think I can confirm, kube-publish-sa-cert.service ran 15 mins ago sigh
[15:28:33] <elukey>	 timing matches perfectly
[15:28:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2108.codfw.wmnet with OS bullseye
[15:28:38] <sukhe>	 ok thanks elukey 
[15:28:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2108
[15:28:47] <sukhe>	 that's good to know at least, that there is a cause
[15:29:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[15:30:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P75625 and previous config saved to /var/cache/conftool/dbconfig/20250429-153026-fceratto.json
[15:30:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1201 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:31:44] <jinxer-wm>	 FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:32:39] <wikibugs>	 (03PS1) 10David Caro: dcaro: add yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1139887
[15:33:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2108 - bking@cumin2002"
[15:33:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2108 - bking@cumin2002"
[15:33:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:33:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2108.codfw.wmnet 90.48.192.10.in-addr.arpa 0.9.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:33:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2108.codfw.wmnet 90.48.192.10.in-addr.arpa 0.9.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[15:33:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2108
[15:34:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2108
[15:34:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2108
[15:35:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1153:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1153 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:36:44] <jinxer-wm>	 RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:37:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777021 (10tappof)
[15:37:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:37:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777022 (10tappof)
[15:37:58] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10777023 (10elukey) Today John helped me test the hot-swap behavior, and everything seems working way more nicely.  1) John swapped one...
[15:38:08] <wikibugs>	 (03PS1) 10Ebernhardson: Revert "Revert "Update opensearch-madvise call for version 0.2"" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592)
[15:38:24] <wikibugs>	 (03PS2) 10Ebernhardson: Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592)
[15:38:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson)
[15:41:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138921 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[15:42:47] <wikibugs>	 (03Merged) 10jenkins-bot: missing.php: Simplify code to reduce abstraction and duplication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138921 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[15:43:13] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1138921|missing.php: Simplify code to reduce abstraction and duplication (T113114)]]
[15:43:18] <stashbot>	 T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114
[15:43:47] <wikibugs>	 (03PS3) 10Btullis: Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738)
[15:44:37] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5396/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[15:45:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM. Confirmed identify via videocall." [puppet] - 10https://gerrit.wikimedia.org/r/1139887 (owner: 10David Caro)
[15:45:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T392806)', diff saved to https://phabricator.wikimedia.org/P75626 and previous config saved to /var/cache/conftool/dbconfig/20250429-154533-fceratto.json
[15:45:52] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[15:46:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T392806)', diff saved to https://phabricator.wikimedia.org/P75627 and previous config saved to /var/cache/conftool/dbconfig/20250429-154559-fceratto.json
[15:48:16] <wikibugs>	 (03PS1) 10BCornwall: slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1139891
[15:49:55] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1138921|missing.php: Simplify code to reduce abstraction and duplication (T113114)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:49:59] <stashbot>	 T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114
[15:50:28] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777049 (10tappof)
[15:50:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1201 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:50:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2108.codfw.wmnet with reason: host reimage
[15:51:35] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wikimedia-dns.org: add TYPE65 records for check.wikimedia-dns.org [dns] - 10https://gerrit.wikimedia.org/r/1137021 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:51:42] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[15:52:01] <wikibugs>	 (03CR) 10Herron: [C:03+1] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1139891 (owner: 10BCornwall)
[15:53:42] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:53:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777064 (10tappof)
[15:53:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2108.codfw.wmnet with reason: host reimage
[15:54:10] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[15:54:11] <wikibugs>	 (03PS4) 10Btullis: Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738)
[15:54:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T392806)', diff saved to https://phabricator.wikimedia.org/P75628 and previous config saved to /var/cache/conftool/dbconfig/20250429-155419-fceratto.json
[15:54:32] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[15:54:47] <logmsgbot>	 !log ebernhardson@deploy1003 Started deploy [airflow-dags/search@5bff61a]: Update airflow-search with simplified mjolnir dag
[15:54:55] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777075 (10tappof)
[15:55:01] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5397/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[15:55:11] <wikibugs>	 (03PS5) 10Btullis: Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738)
[15:55:12] <logmsgbot>	 !log ebernhardson@deploy1003 Finished deploy [airflow-dags/search@5bff61a]: Update airflow-search with simplified mjolnir dag (duration: 00m 25s)
[15:55:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2106.codfw.wmnet with OS bullseye
[15:55:58] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5398/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[15:57:28] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[15:58:56] <wikibugs>	 (03PS6) 10Btullis: Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738)
[15:59:42] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5399/co" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[16:00:05] <jouncebot>	 jhathaway and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:43] <wikibugs>	 (03PS2) 10Krinkle: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114)
[16:00:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[16:01:11] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138921|missing.php: Simplify code to reduce abstraction and duplication (T113114)]] (duration: 17m 57s)
[16:01:16] <stashbot>	 T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114
[16:01:26] <wikibugs>	 (03PS3) 10Krinkle: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114)
[16:01:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Possible frdb2004 hardware failure. - https://phabricator.wikimedia.org/T392579#10777103 (10Jhancock.wm) @Jgreen reseated all the connections to the backplane. server came up. I checked the firmware version of the CPLD and it is current (1.0.7).  lemme...
[16:04:22] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[16:04:49] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Enable the dumpsgen user to use an rsync server over ssh from dse-k8s-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139835 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[16:05:00] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10777118 (10Jhancock.wm) and of course they decide to give me trouble. had to resubmit it. I'll let you know when the new drive is here and been replaced.
[16:05:32] <wikibugs>	 (03CR) 10David Caro: [C:03+2] dcaro: add yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1139887 (owner: 10David Caro)
[16:06:06] <wikibugs>	 (03PS1) 10JHathaway: ferm: ignore hidden staged files created by confd [puppet] - 10https://gerrit.wikimedia.org/r/1139893
[16:06:51] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway)
[16:09:07] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2176 - https://phabricator.wikimedia.org/T392876#10777149 (10Marostegui) Thank you!
[16:09:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P75629 and previous config saved to /var/cache/conftool/dbconfig/20250429-160925-fceratto.json
[16:09:33] <icinga-wm>	 RECOVERY - Host ms-be1060 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[16:11:07] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777168 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr @MatthewVernon   I reseated the PCI RAID card and updated the BIO...
[16:11:27] <icinga-wm>	 PROBLEM - Host ms-be1060 is DOWN: PING CRITICAL - Packet loss = 100%
[16:12:54] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777194 (10Jclark-ctr) 05Resolved→03Open
[16:14:10] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2108.codfw.wmnet with OS bullseye
[16:14:12] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] "The commit message isn't mentioning why you're removing the templating for durum's domain/ip addresses, so I'm a little confused about tha" [puppet] - 10https://gerrit.wikimedia.org/r/1139873 (owner: 10Ssingh)
[16:14:46] <wikibugs>	 (03CR) 10BCornwall: [V:03+2 C:03+2] slo_template: update SLO dates to current window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1139891 (owner: 10BCornwall)
[16:16:29] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:16:43] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:16:57] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Create an SSH private key in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139840 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[16:17:29] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10777280 (10Jhancock.wm)
[16:18:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[16:18:49] <wikibugs>	 (03Merged) 10jenkins-bot: Create an SSH private key in the mediawiki-dumps-legacy namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139840 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[16:18:52] <jinxer-wm>	 FIRING: [8x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:19:37] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 4.451 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:20:19] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:22:19] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:22:33] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Yes sorry that's on me. I will clarify it in the commit that fixes it." [puppet] - 10https://gerrit.wikimedia.org/r/1139873 (owner: 10Ssingh)
[16:23:03] <wikibugs>	 (03CR) 10Dzahn: gerrit: have different motd banners on active/passive servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn)
[16:23:17] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) (owner: 10Elukey)
[16:23:19] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:24:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P75630 and previous config saved to /var/cache/conftool/dbconfig/20250429-162432-fceratto.json
[16:24:39] <wikibugs>	 (03CR) 10Dzahn: "thanks. this is to fix a warning I got on running 'puppet lint'." [puppet] - 10https://gerrit.wikimedia.org/r/1137842 (owner: 10Dzahn)
[16:27:01] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777417 (10Jclark-ctr) Error came back  reopened ticket
[16:27:19] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:28:15] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:28:15] <wikibugs>	 (03PS2) 10Kimberly Sarabia: Stream registration for article summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097)
[16:29:11] <wikibugs>	 (03PS3) 10Jdlrobson: Stream registration for article summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia)
[16:29:55] <wikibugs>	 (03CR) 10Dzahn: "I was about to upload this and then saw it was already done. taavi, that suggestion was right." [puppet] - 10https://gerrit.wikimedia.org/r/1138995 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto)
[16:31:35] <wikibugs>	 (03PS2) 10Dzahn: microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777)
[16:31:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[16:32:49] <wikibugs>	 (03CR) 10Majavah: [C:03+2] keepalived: failover: Select unicast source v6 more reliably [puppet] - 10https://gerrit.wikimedia.org/r/1139877 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[16:33:04] <wikibugs>	 (03PS3) 10Dzahn: microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777)
[16:33:23] <wikibugs>	 (03PS4) 10Dzahn: microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777)
[16:35:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[16:39:21] <wikibugs>	 (03PS5) 10Dzahn: microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777)
[16:39:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T392806)', diff saved to https://phabricator.wikimedia.org/P75631 and previous config saved to /var/cache/conftool/dbconfig/20250429-163939-fceratto.json
[16:39:59] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[16:40:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T392806)', diff saved to https://phabricator.wikimedia.org/P75632 and previous config saved to /var/cache/conftool/dbconfig/20250429-164005-fceratto.json
[16:46:08] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] microsites/backup: remove rt-static backup fileset [puppet] - 10https://gerrit.wikimedia.org/r/1137486 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[16:47:30] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777501 (10Jclark-ctr) @wiki_willy  @RobH   looks like this raid card has failed Can we get a new one ordered?
[16:48:46] <wikibugs>	 (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1139901
[16:49:15] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777503 (10Dzahn) out of curiosity: are we replacing this hardware anyways since it's almost 5 years old?
[16:49:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T392806)', diff saved to https://phabricator.wikimedia.org/P75633 and previous config saved to /var/cache/conftool/dbconfig/20250429-164927-fceratto.json
[16:49:33] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777504 (10RobH) Notes: * System warranty ended on October 27, 2023 (3 years after purchase) * 5 year life projection says this sho...
[16:50:02] <wikibugs>	 (03PS1) 10Majavah: keepalived: failover: Fix hiera key path [puppet] - 10https://gerrit.wikimedia.org/r/1139902
[16:50:48] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5400/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139902 (owner: 10Majavah)
[16:51:02] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777512 (10RobH) >>! In T392796#10777500, @Jclark-ctr wrote: > @wiki_willy  @RobH   looks like this raid card has failed Can we get...
[16:51:28] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] keepalived: failover: Fix hiera key path [puppet] - 10https://gerrit.wikimedia.org/r/1139902 (owner: 10Majavah)
[16:52:46] <wikibugs>	 (03PS3) 10AOkoth: wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128)
[16:53:35] <wikibugs>	 (03CR) 10Hashar: "The cow arts are not Apache2 licensed, they are licensed under `COWSAY`. The license is shipped by the Debian package `/usr/share/doc/cows" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn)
[16:55:11] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777538 (10RobH)
[16:56:17] <wikibugs>	 (03PS6) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882
[16:56:29] <wikibugs>	 (03CR) 10BCornwall: cdn: Unify ats/haproxy/varnish upgrade cookbooks (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129882 (owner: 10BCornwall)
[16:56:43] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777542 (10wiki_willy) @Jclark-ctr - it looks like we refreshed ms-be105[1-9] towards the end of last year via T371389.  Can you ch...
[16:56:51] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:58:21] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10777558 (10wiki_willy) Sorry, nevermind....it looks like they're HPs  >>! In T392796#10777542, @wiki_willy wrote: > @Jclark-ctr - i...
[16:58:47] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:59:03] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] wmnet: change active aphlict host [dns] - 10https://gerrit.wikimedia.org/r/1139546 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[16:59:23] <wikibugs>	 (03PS15) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175)
[16:59:26] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] aphlict: ensure absent on active host [puppet] - 10https://gerrit.wikimedia.org/r/1139534 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[16:59:48] <wikibugs>	 (03CR) 10Majavah: "Weirdly enough this is up next." [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1700)
[17:00:09] <logmsgbot>	 !log aokoth@dns1004 START - running authdns-update
[17:02:41] <logmsgbot>	 !log aokoth@dns1004 END - running authdns-update
[17:03:49] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:04:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P75634 and previous config saved to /var/cache/conftool/dbconfig/20250429-170436-fceratto.json
[17:04:49] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:08:26] <wikibugs>	 (03PS1) 10Btullis: mediawiki-dumps-legacy: Fix helmfile secrets path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139903 (https://phabricator.wikimedia.org/T390738)
[17:10:12] <sukhe>	 !log sudo cumin 'A:durum' 'disable-puppet "rolling out CR 1139873"'
[17:10:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:46] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum and hiera: update health check path [puppet] - 10https://gerrit.wikimedia.org/r/1139873 (owner: 10Ssingh)
[17:12:18] <wikibugs>	 (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Fix helmfile secrets path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139903 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[17:14:35] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Fix helmfile secrets path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139903 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[17:16:12] <sukhe>	 !log sudo cumin -b1 -s30 'A:durum and not P{durum2002*}' 'run-puppet-agent --enable "rolling out CR 1139873"'
[17:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P75635 and previous config saved to /var/cache/conftool/dbconfig/20250429-171943-fceratto.json
[17:22:46] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:24:19] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:24:51] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1139901 (owner: 10Ncmonitor)
[17:25:15] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:25:21] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1139901 (owner: 10Ncmonitor)
[17:27:09] <logmsgbot>	 !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aphlict1002.eqiad.wmnet with reason: Bookworm Re-image
[17:28:33] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host aphlict1002.eqiad.wmnet with OS bookworm
[17:28:42] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:33:55] <wikibugs>	 (03PS1) 10Btullis: mediawiki-dumps-legacy: Add private values files to resources deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139906 (https://phabricator.wikimedia.org/T390738)
[17:34:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T392806)', diff saved to https://phabricator.wikimedia.org/P75636 and previous config saved to /var/cache/conftool/dbconfig/20250429-173450-fceratto.json
[17:35:11] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[17:35:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T392806)', diff saved to https://phabricator.wikimedia.org/P75637 and previous config saved to /var/cache/conftool/dbconfig/20250429-173517-fceratto.json
[17:36:05] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777736 (10Ahoelzl) Approved.
[17:36:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10777738 (10Ahoelzl)
[17:37:38] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM! I'm just curious about how was the memlimit_ratio value defined." [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi)
[17:37:43] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aphlict1002.eqiad.wmnet with reason: host reimage
[17:40:49] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aphlict1002.eqiad.wmnet with reason: host reimage
[17:41:58] <wikibugs>	 (03PS1) 10Majavah: keepalived: failover: Skip searching v6 addresses on v4-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139909
[17:43:31] <wikibugs>	 (03Abandoned) 10AOkoth: miscweb: update values-os-reports env config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115944 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[17:44:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T392806)', diff saved to https://phabricator.wikimedia.org/P75638 and previous config saved to /var/cache/conftool/dbconfig/20250429-174438-fceratto.json
[17:44:52] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5401/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139909 (owner: 10Majavah)
[17:46:19] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5402/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139909 (owner: 10Majavah)
[17:47:18] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] keepalived: failover: Skip searching v6 addresses on v4-only hosts [puppet] - 10https://gerrit.wikimedia.org/r/1139909 (owner: 10Majavah)
[17:53:19] <wikibugs>	 (03PS1) 10Ssingh: P:durum: hiera: log only ech_status [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378)
[17:53:59] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5403/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[17:54:59] <wikibugs>	 (03PS2) 10Ssingh: P:durum: hiera: log only ech_status [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378)
[17:55:39] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5404/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[17:56:47] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "@bcornwall@wikimedia.org: Hopefully this should clear up the confusion I created with the earlier commit and the intent. Let me know if yo" [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[17:58:42] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:59:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:59:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P75639 and previous config saved to /var/cache/conftool/dbconfig/20250429-175946-fceratto.json
[18:00:05] <jouncebot>	 hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T1800)
[18:01:42] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aphlict1002.eqiad.wmnet with OS bookworm
[18:02:35] <wikibugs>	 (03CR) 10Ssingh: "I am abandoning this for now. The Gitlab project is working fine so I will stick with it. For the CDN deployment, the changes should be up" [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:02:41] <wikibugs>	 (03Abandoned) 10Ssingh: Release 1.22.1-9+deb12u1+ech1 [debs/nginx-ech] - 10https://gerrit.wikimedia.org/r/1135733 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:06:30] <wikibugs>	 (03PS1) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858)
[18:11:23] <wikibugs>	 (03PS2) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858)
[18:14:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P75640 and previous config saved to /var/cache/conftool/dbconfig/20250429-181453-fceratto.json
[18:16:42] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] P:durum: hiera: log only ech_status [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:17:19] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: hiera: log only ech_status [puppet] - 10https://gerrit.wikimedia.org/r/1139911 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:17:30] <wikibugs>	 (03PS3) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858)
[18:18:22] <wikibugs>	 (03PS4) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858)
[18:24:40] <wikibugs>	 (03PS5) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858)
[18:27:23] <wikibugs>	 (03PS6) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858)
[18:30:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T392806)', diff saved to https://phabricator.wikimedia.org/P75641 and previous config saved to /var/cache/conftool/dbconfig/20250429-183000-fceratto.json
[18:30:22] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[18:30:37] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[18:30:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T392806)', diff saved to https://phabricator.wikimedia.org/P75642 and previous config saved to /var/cache/conftool/dbconfig/20250429-183044-fceratto.json
[18:31:22] <wikibugs>	 (03PS8) 10Dzahn: gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212)
[18:31:42] <wikibugs>	 (03CR) 10Dzahn: gerrit: have different motd banners on active/passive servers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn)
[18:31:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[18:33:22] <wikibugs>	 (03PS9) 10Dzahn: gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212)
[18:33:43] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:33:45] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:34:27] <wikibugs>	 (03PS3) 10Dzahn: gerrit: replace legacy fact with modern fact [puppet] - 10https://gerrit.wikimedia.org/r/1137842
[18:35:04] <wikibugs>	 (03PS1) 10Ssingh: P:durum: use /health instead of /check [puppet] - 10https://gerrit.wikimedia.org/r/1139919
[18:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:35:44] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5406/co" [puppet] - 10https://gerrit.wikimedia.org/r/1139919 (owner: 10Ssingh)
[18:36:26] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "Updating path from I64107416fdeaffbdc00c6b5481d12494d4ccfe0d." [puppet] - 10https://gerrit.wikimedia.org/r/1139919 (owner: 10Ssingh)
[18:37:23] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: use /health instead of /check [puppet] - 10https://gerrit.wikimedia.org/r/1139919 (owner: 10Ssingh)
[18:37:53] <wikibugs>	 (03PS4) 10Dzahn: gerrit: replace legacy fact with modern fact [puppet] - 10https://gerrit.wikimedia.org/r/1137842
[18:37:58] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1137842/5405/" [puppet] - 10https://gerrit.wikimedia.org/r/1137842 (owner: 10Dzahn)
[18:39:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T392806)', diff saved to https://phabricator.wikimedia.org/P75643 and previous config saved to /var/cache/conftool/dbconfig/20250429-183913-fceratto.json
[18:47:21] <wikibugs>	 (03CR) 10GergesShamon: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[18:48:43] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:48:45] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:53:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10778059 (10Jhancock.wm) @Papaul can you take a look at this one. 2047 is installed on 2048 and 2048 is installed on 2047. not sure where the swap happened. i checke...
[18:54:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P75644 and previous config saved to /var/cache/conftool/dbconfig/20250429-185421-fceratto.json
[18:58:07] <logmsgbot>	 !log fab@deploy1003 Started deploy [airflow-dags/research@414def7]: (no justification provided)
[18:58:44] <logmsgbot>	 !log fab@deploy1003 Finished deploy [airflow-dags/research@414def7]: (no justification provided) (duration: 00m 50s)
[19:01:04] <wikibugs>	 (03PS7) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858)
[19:04:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[19:04:44] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: add ports to hackathon nftables rule [puppet] - 10https://gerrit.wikimedia.org/r/1138995 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto)
[19:06:10] <wikibugs>	 (03PS8) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858)
[19:06:21] <wikibugs>	 06SRE, 06serviceops-radar: Cannot connect to MariaDB server from mwmaint1002 - https://phabricator.wikimedia.org/T392846#10778104 (10Dzahn) 05Open→03Resolved a:03Dzahn Well, I would say this is resolved. Just needed more disk space.  And follow-ups can be done over there on the linked task.
[19:09:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P75645 and previous config saved to /var/cache/conftool/dbconfig/20250429-190927-fceratto.json
[19:16:19] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:17:15] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:18:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:24:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T392806)', diff saved to https://phabricator.wikimedia.org/P75646 and previous config saved to /var/cache/conftool/dbconfig/20250429-192434-fceratto.json
[19:24:55] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[19:25:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T392806)', diff saved to https://phabricator.wikimedia.org/P75647 and previous config saved to /var/cache/conftool/dbconfig/20250429-192501-fceratto.json
[19:33:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T392806)', diff saved to https://phabricator.wikimedia.org/P75648 and previous config saved to /var/cache/conftool/dbconfig/20250429-193316-fceratto.json
[19:48:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P75649 and previous config saved to /var/cache/conftool/dbconfig/20250429-194824-fceratto.json
[19:53:42] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:02:24] <denisse>	 !log disabling Puppet on grafana2001 - T384841
[20:02:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:30] <stashbot>	 T384841: Upgrade to Grafana 11 - https://phabricator.wikimedia.org/T384841
[20:03:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P75651 and previous config saved to /var/cache/conftool/dbconfig/20250429-200331-fceratto.json
[20:05:00] <wikibugs>	 (03CR) 10Kimberly Sarabia: Stream registration for article summaries (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129958 (https://phabricator.wikimedia.org/T389097) (owner: 10Kimberly Sarabia)
[20:09:23] <wikibugs>	 (03PS3) 10Dzahn: miscweb: remove static-rt profile from legacy miscweb role [puppet] - 10https://gerrit.wikimedia.org/r/1137484 (https://phabricator.wikimedia.org/T385777)
[20:14:35] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] miscweb: remove static-rt profile from legacy miscweb role [puppet] - 10https://gerrit.wikimedia.org/r/1137484 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[20:17:21] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw:maintenance:updatequerypages: move all deadendpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[20:17:32] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[20:18:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T392806)', diff saved to https://phabricator.wikimedia.org/P75652 and previous config saved to /var/cache/conftool/dbconfig/20250429-201838-fceratto.json
[20:18:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[20:18:58] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[20:19:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T392806)', diff saved to https://phabricator.wikimedia.org/P75653 and previous config saved to /var/cache/conftool/dbconfig/20250429-201905-fceratto.json
[20:25:36] <wikibugs>	 (03PS1) 10Dwisehaupt: monitoring: Fix check_puppetrun for failures on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1139930 (https://phabricator.wikimedia.org/T392961)
[20:27:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Allow releng to resume train related systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar)
[20:28:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T392806)', diff saved to https://phabricator.wikimedia.org/P75654 and previous config saved to /var/cache/conftool/dbconfig/20250429-202827-fceratto.json
[20:31:55] <wikibugs>	 (03PS3) 10Scott French: P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139923 (https://phabricator.wikimedia.org/T388536)
[20:35:58] <wikibugs>	 (03PS1) 10GergesShamon: Change Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858)
[20:43:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Change Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[20:43:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P75655 and previous config saved to /var/cache/conftool/dbconfig/20250429-204334-fceratto.json
[20:47:07] <wikibugs>	 (03PS2) 10GergesShamon: Change Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858)
[20:55:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Change Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[20:58:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P75656 and previous config saved to /var/cache/conftool/dbconfig/20250429-205841-fceratto.json
[21:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250429T2100)
[21:03:57] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10778426 (10Jclark-ctr) @RobH  this is a 740xd2 we have not had any of these decom yet
[21:04:48] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] fix inconsequential typos [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137356 (owner: 10Ryan Kemper)
[21:05:01] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151) (owner: 10Ryan Kemper)
[21:06:33] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151)
[21:12:46] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs-internal: remove disc records [dns] - 10https://gerrit.wikimedia.org/r/1136740 (https://phabricator.wikimedia.org/T376151)
[21:12:47] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs-internal: remove lvs VIP [dns] - 10https://gerrit.wikimedia.org/r/1139936 (https://phabricator.wikimedia.org/T376151)
[21:13:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T392806)', diff saved to https://phabricator.wikimedia.org/P75657 and previous config saved to /var/cache/conftool/dbconfig/20250429-211349-fceratto.json
[21:14:08] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[21:14:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75658 and previous config saved to /var/cache/conftool/dbconfig/20250429-211415-fceratto.json
[21:19:19] <icinga-wm>	 PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:22:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75659 and previous config saved to /var/cache/conftool/dbconfig/20250429-212235-fceratto.json
[21:22:46] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:23:39] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (2001:7f8:36::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[21:28:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[21:31:19] <icinga-wm>	 PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS13030/IPv6: Idle - Init7, AS6939/IPv6: Idle - HE, AS13030/IPv4: Idle - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:31:37] <wikibugs>	 (03PS2) 10Ryan Kemper: query-legacy-full: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422)
[21:31:48] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[21:33:15] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 58, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:34:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/0 (Peering: DE-CIX (DXDB:NAS:173434 MAC filter) {#D0067}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:35:24] <wikibugs>	 (03PS3) 10Ryan Kemper: query-legacy-full: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422)
[21:35:32] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[21:35:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[21:37:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P75660 and previous config saved to /var/cache/conftool/dbconfig/20250429-213743-fceratto.json
[21:38:15] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:39:26] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] query-legacy-full: set cluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/1139537 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[21:39:51] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/0 (Peering: DE-CIX (DXDB:NAS:173434 MAC filter) {#D0067}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[21:40:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr1-drmrs and Hurricane Electric (185.1.47.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[21:49:44] <wikibugs>	 (03PS7) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550)
[21:52:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P75661 and previous config saved to /var/cache/conftool/dbconfig/20250429-215250-fceratto.json
[21:55:37] <icinga-wm>	 RECOVERY - Host ms-be1060 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[21:58:42] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10778576 (10BCornwall)
[21:58:52] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:58:58] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10778577 (10Jclark-ctr) Removed the BBU from the RAID card. After letting the server sit for 10 minutes without the BBU, I reinstall...
[21:59:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:00:26] <brett>	 !log import ncmonitor 1.3.5 to bookworm-wikimedia
[22:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:04:42] <icinga-wm>	 PROBLEM - Host ms-be1060 is DOWN: PING CRITICAL - Packet loss = 100%
[22:07:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2167 (T392806)', diff saved to https://phabricator.wikimedia.org/P75662 and previous config saved to /var/cache/conftool/dbconfig/20250429-220757-fceratto.json
[22:08:17] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
[22:08:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2181 (T392806)', diff saved to https://phabricator.wikimedia.org/P75663 and previous config saved to /var/cache/conftool/dbconfig/20250429-220823-fceratto.json
[22:14:35] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:16:00] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[22:16:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T392806)', diff saved to https://phabricator.wikimedia.org/P75664 and previous config saved to /var/cache/conftool/dbconfig/20250429-221633-fceratto.json
[22:21:22] <wikibugs>	 (03CR) 10Dwisehaupt: "This code has been tested and rolled out for fr-tech. It only gets triggered if there is a puppet run failure to parse so may not have bee" [puppet] - 10https://gerrit.wikimedia.org/r/1139930 (https://phabricator.wikimedia.org/T392961) (owner: 10Dwisehaupt)
[22:31:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P75665 and previous config saved to /var/cache/conftool/dbconfig/20250429-223140-fceratto.json
[22:32:51] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:33:00] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[22:33:41] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:33:58] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[22:34:33] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:34:38] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[22:34:44] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10778614 (10Papaul) @Jhancock.wm you have mismatch on serial number in netbox 91 is ganeti2047 and and 90 is ganeti2048
[22:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:46:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P75666 and previous config saved to /var/cache/conftool/dbconfig/20250429-224647-fceratto.json
[23:01:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T392806)', diff saved to https://phabricator.wikimedia.org/P75667 and previous config saved to /var/cache/conftool/dbconfig/20250429-230155-fceratto.json
[23:02:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance
[23:02:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T392806)', diff saved to https://phabricator.wikimedia.org/P75668 and previous config saved to /var/cache/conftool/dbconfig/20250429-230222-fceratto.json
[23:10:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T392806)', diff saved to https://phabricator.wikimedia.org/P75669 and previous config saved to /var/cache/conftool/dbconfig/20250429-231031-fceratto.json
[23:18:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:25:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P75670 and previous config saved to /var/cache/conftool/dbconfig/20250429-232538-fceratto.json
[23:29:06] <wikibugs>	 (03CR) 10Ssingh: varnish: Replace X-IS-ALT-DOMAIN with variable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[23:31:36] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:32:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:38:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:39:48] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:39:58] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139952
[23:39:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139952 (owner: 10TrainBranchBot)
[23:40:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P75671 and previous config saved to /var/cache/conftool/dbconfig/20250429-234045-fceratto.json
[23:41:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:42:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:44:48] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:48:54] <zabe>	 jouncebot: nowandnext
[23:48:54] <jouncebot>	 No deployments scheduled for the next 6 hour(s) and 11 minute(s)
[23:48:54] <jouncebot>	 In 6 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0600)
[23:50:24] <wikibugs>	 (03CR) 10Zabe: [C:03+2] enwiki and commons: Increase revision-slots cache expiry again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139577 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[23:51:16] <wikibugs>	 (03Merged) 10jenkins-bot: enwiki and commons: Increase revision-slots cache expiry again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139577 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe)
[23:51:44] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1139952 (owner: 10TrainBranchBot)
[23:51:53] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1139577|enwiki and commons: Increase revision-slots cache expiry again (T183490)]]
[23:51:58] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[23:53:42] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:55:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T392806)', diff saved to https://phabricator.wikimedia.org/P75672 and previous config saved to /var/cache/conftool/dbconfig/20250429-235552-fceratto.json
[23:56:12] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance
[23:58:43] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1139577|enwiki and commons: Increase revision-slots cache expiry again (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:58:48] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[23:58:53] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[23:59:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed