[00:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T0000)
[00:01:43] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1263.eqiad.wmnet with reason: host reimage
[00:05:18] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1263.eqiad.wmnet with reason: host reimage
[00:07:12] <wikibugs>	 (03PS9) 10Scott French: trafficserver: add mw-php-migration to mapping_rules [puppet] - 10https://gerrit.wikimedia.org/r/1082581 (https://phabricator.wikimedia.org/T377042)
[00:09:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10435822 (10phaultfinder)
[00:22:31] <wikibugs>	 (03PS1) 10ZhaoFJx: cswikivoyage: Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108508 (https://phabricator.wikimedia.org/T382779)
[00:24:06] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1263.eqiad.wmnet with OS bookworm
[00:24:09] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1257-1263].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[00:25:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:35:46] <wikibugs>	 (03CR) 10Hamish: [C:03+1] Enable AutoModerator on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108459 (https://phabricator.wikimedia.org/T367306) (owner: 10ZhaoFJx)
[00:38:17] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108510
[00:38:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108510 (owner: 10TrainBranchBot)
[00:39:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10435848 (10phaultfinder)
[01:00:30] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1108510 (owner: 10TrainBranchBot)
[01:08:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108516
[01:08:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108516 (owner: 10TrainBranchBot)
[01:26:34] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1108516 (owner: 10TrainBranchBot)
[01:29:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10435876 (10phaultfinder)
[01:40:09] <wikibugs>	 (03PS3) 10Scott French: trafficserver: validate production config in tests [puppet] - 10https://gerrit.wikimedia.org/r/1101104 (https://phabricator.wikimedia.org/T377042)
[01:40:09] <wikibugs>	 (03PS10) 10Scott French: trafficserver: add mw-php-migration to mapping_rules [puppet] - 10https://gerrit.wikimedia.org/r/1082581 (https://phabricator.wikimedia.org/T377042)
[01:45:20] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082581 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French)
[01:51:48] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108508 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[01:52:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cswikivoyage: Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108508 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[02:00:14] <wikibugs>	 (03CR) 10Anzx: [C:04-1] "please follow instructions on https://gerrit.wikimedia.org/g/operations/mediawiki-config/%2B/refs/heads/master/logos/ for generating logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108508 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[02:08:01] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.11 [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1108521 (https://phabricator.wikimedia.org/T382362)
[02:08:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.11 [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1108521 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot)
[02:11:39] <icinga-wm>	 PROBLEM - Routinator process on rpki2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[02:11:47] <icinga-wm>	 PROBLEM - RPKI Validator RTR port on rpki2003 is CRITICAL: connect to address 10.192.24.3 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[02:14:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:27:24] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.11 [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1108521 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot)
[02:36:33] <wikibugs>	 (03PS1) 10Zabe: Add Apache configuration for wikipedia-zh-sysop.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1108522
[02:36:51] <wikibugs>	 (03PS2) 10Zabe: Add Apache configuration for wikipedia-zh-arbcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1108522
[02:36:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Apache configuration for wikipedia-zh-arbcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1108522 (owner: 10Zabe)
[02:37:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Apache configuration for wikipedia-zh-arbcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1108522 (owner: 10Zabe)
[02:37:27] <wikibugs>	 (03PS3) 10Zabe: Add Apache configuration for wikipedia-zh-arbcom.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1108522 (https://phabricator.wikimedia.org/T380119)
[02:39:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:54:03] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1243.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T0300)
[03:04:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:09:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[03:09:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[03:14:15] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:15:05] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:27:58] <wikibugs>	 (03Abandoned) 10ZhaoFJx: cswikivoyage: Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108508 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[03:44:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10436069 (10phaultfinder)
[03:50:42] <wikibugs>	 (03PS1) 10ZhaoFJx: cswikivoyage: Change the wordmark v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779)
[03:52:13] <wikibugs>	 (03CR) 10ZhaoFJx: "Thanks for the important advise! I followed the document, edited config.yaml first, then run the tox command, and commit. Hope they can wo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[03:53:27] <wikibugs>	 (03CR) 10Diskdance: varnish: Hide X-Client-IP on error page by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall)
[03:53:48] <wikibugs>	 (03PS2) 10ZhaoFJx: cswikivoyage: Change the wordmark v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779)
[03:59:43] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[04:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T0400)
[04:01:42] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108532 (https://phabricator.wikimedia.org/T382362)
[04:01:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108532 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot)
[04:02:29] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108532 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot)
[04:02:55] <logmsgbot>	 !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.44.0-wmf.11  refs T382362
[04:02:58] <stashbot>	 T382362: 1.44.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T382362
[04:09:15] <wikibugs>	 (03CR) 10Anzx: "there are some unrelated changes in wmf-config/logos.php ,please revert it to manually" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[04:20:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:31:02] <wikibugs>	 (03PS3) 10ZhaoFJx: cswikivoyage: Change the wordmark v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779)
[04:32:09] <wikibugs>	 (03CR) 10ZhaoFJx: "done in patch #3, please review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[04:36:57] <wikibugs>	 (03PS4) 10ZhaoFJx: cswikivoyage: Change the wordmark v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779)
[04:39:01] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[04:41:42] <wikibugs>	 (03CR) 10Anzx: [C:03+1] "looks good, please schedule patch for deployment through any of backport windows https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[04:51:02] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108459 (https://phabricator.wikimedia.org/T367306) (owner: 10ZhaoFJx)
[04:51:30] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.44.0-wmf.11  refs T382362 (duration: 48m 35s)
[04:51:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[04:51:34] <stashbot>	 T382362: 1.44.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T382362
[05:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T0500)
[05:15:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:33:41] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1029: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108537
[05:34:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] dbproxy1029: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108537 (owner: 10Marostegui)
[05:36:43] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Switch m5-master proxy [dns] - 10https://gerrit.wikimedia.org/r/1108539 (https://phabricator.wikimedia.org/T368874)
[05:38:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Switch m5-master proxy [dns] - 10https://gerrit.wikimedia.org/r/1108539 (https://phabricator.wikimedia.org/T368874) (owner: 10Marostegui)
[05:40:41] <marostegui>	 !log Switchover m5 eqiad proxy dbmaint eqiad T368874
[05:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:43] <stashbot>	 T368874: Productionize dbproxy102[89] - https://phabricator.wikimedia.org/T368874
[05:52:14] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2025-01-07-045930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108544 (https://phabricator.wikimedia.org/T377966)
[06:18:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:19:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10436178 (10phaultfinder)
[06:37:49] <wikibugs>	 (03PS1) 10Marostegui: es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108667 (https://phabricator.wikimedia.org/T383029)
[06:38:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1108667 (https://phabricator.wikimedia.org/T383029) (owner: 10Marostegui)
[06:38:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:43:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:48:22] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove es2025 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108668 (https://phabricator.wikimedia.org/T381848)
[06:48:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es2025 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108668 (https://phabricator.wikimedia.org/T381848) (owner: 10Marostegui)
[06:49:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es2025 from dbctl T381848', diff saved to https://phabricator.wikimedia.org/P71820 and previous config saved to /var/cache/conftool/dbconfig/20250107-064958-marostegui.json
[06:50:02] <stashbot>	 T381848: Decommission es202[0-5] - https://phabricator.wikimedia.org/T381848
[06:54:03] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1243.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:58:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T0700). Please do the needful.
[07:04:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:09:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[07:09:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[07:10:21] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 12010MiB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[07:22:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete puppetmaster::certmanager class [puppet] - 10https://gerrit.wikimedia.org/r/1090853 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff)
[07:32:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2038-2039].codfw.wmnet
[07:33:49] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2038-2039].codfw.wmnet
[07:37:02] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2039.codfw.wmnet with OS bookworm
[07:37:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2039
[07:37:52] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[07:41:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2039 - jelto@cumin1002"
[07:42:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2039 - jelto@cumin1002"
[07:42:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:43:00] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2039.codfw.wmnet 150.32.192.10.in-addr.arpa 0.5.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[07:43:03] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2039.codfw.wmnet 150.32.192.10.in-addr.arpa 0.5.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[07:43:03] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2039
[07:43:33] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2039
[07:43:33] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2039
[07:44:09] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:46:01] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:46:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Good to merge now" [puppet] - 10https://gerrit.wikimedia.org/r/1108034 (owner: 10Slyngshede)
[07:48:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] postgresql::master: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1105874 (owner: 10Muehlenhoff)
[07:49:06] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2038.codfw.wmnet with OS bookworm
[07:49:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2038
[07:49:49] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[07:53:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2038 - jelto@cumin1002"
[07:53:29] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2038 - jelto@cumin1002"
[07:53:29] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:53:29] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2038.codfw.wmnet 147.32.192.10.in-addr.arpa 7.4.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[07:53:32] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2038.codfw.wmnet 147.32.192.10.in-addr.arpa 7.4.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[07:53:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2038
[07:53:54] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2038
[07:53:55] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2038
[07:57:28] <wikibugs>	 (03PS1) 10Marostegui: production-m3.sql.erb: Replace dbproxy1020 with dbproxy1028 [puppet] - 10https://gerrit.wikimedia.org/r/1108706 (https://phabricator.wikimedia.org/T383025)
[07:58:15] <wikibugs>	 (03CR) 10Marostegui: "This is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1108706 (https://phabricator.wikimedia.org/T383025) (owner: 10Marostegui)
[07:59:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] postgresql::slave: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1105877 (owner: 10Muehlenhoff)
[07:59:49] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] production-m3.sql.erb: Replace dbproxy1020 with dbproxy1028 [puppet] - 10https://gerrit.wikimedia.org/r/1108706 (https://phabricator.wikimedia.org/T383025) (owner: 10Marostegui)
[08:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:03:10] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2039.codfw.wmnet with reason: host reimage
[08:06:05] <wikibugs>	 (03PS1) 10Muehlenhoff: postgresql::server: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108707
[08:06:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2039.codfw.wmnet with reason: host reimage
[08:09:23] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove es2025 [puppet] - 10https://gerrit.wikimedia.org/r/1108708 (https://phabricator.wikimedia.org/T383029)
[08:09:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts es2025.codfw.wmnet
[08:11:23] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove es2025 [puppet] - 10https://gerrit.wikimedia.org/r/1108708 (https://phabricator.wikimedia.org/T383029) (owner: 10Marostegui)
[08:12:57] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108707 (owner: 10Muehlenhoff)
[08:13:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2038.codfw.wmnet with reason: host reimage
[08:14:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[08:15:38] <wikibugs>	 (03PS1) 10Muehlenhoff: postgresql::dirs: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1108710
[08:17:06] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2038.codfw.wmnet with reason: host reimage
[08:19:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2025.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[08:20:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2025.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[08:20:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:20:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2025.codfw.wmnet
[08:25:36] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2025.codfw.wmnet - https://phabricator.wikimedia.org/T383029#10436443 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1002 for hosts: `es2025.codfw.wmnet` - es2025.codfw.wmnet (**PASS**)   - Downtimed...
[08:25:42] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2025.codfw.wmnet - https://phabricator.wikimedia.org/T383029#10436444 (10Marostegui) a:05Marostegui→03None
[08:25:48] <wikibugs>	 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2025.codfw.wmnet - https://phabricator.wikimedia.org/T383029#10436449 (10Marostegui) This is ready for #dc-ops
[08:26:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2039.codfw.wmnet with OS bookworm
[08:36:10] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff)
[08:37:58] <icinga-wm>	 RECOVERY - Routinator process on rpki2003 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[08:38:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Switchover es5 eqiad master dbmaint T382569', diff saved to https://phabricator.wikimedia.org/P71821 and previous config saved to /var/cache/conftool/dbconfig/20250107-083811-marostegui.json
[08:38:14] <stashbot>	 T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569
[08:38:42] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2038.codfw.wmnet with OS bookworm
[08:39:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1020 T382569', diff saved to https://phabricator.wikimedia.org/P71822 and previous config saved to /var/cache/conftool/dbconfig/20250107-083930-marostegui.json
[08:40:58] <icinga-wm>	 PROBLEM - Routinator process on rpki2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[08:41:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es1020.eqiad.wmnet with reason: cloning es1041
[08:41:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1020.eqiad.wmnet with reason: cloning es1041
[08:42:30] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es1041 [puppet] - 10https://gerrit.wikimedia.org/r/1108714 (https://phabricator.wikimedia.org/T382569)
[08:43:10] <jelto>	 !log sudo homer 'lsw1-c7-codfw*' commit 'T377877'
[08:43:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:13] <stashbot>	 T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877
[08:43:40] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Productionize es1041 [puppet] - 10https://gerrit.wikimedia.org/r/1108714 (https://phabricator.wikimedia.org/T382569)
[08:44:05] <jelto>	 !log sudo homer 'lsw1-c1-codfw*' commit 'T377877'
[08:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:17] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1041 [puppet] - 10https://gerrit.wikimedia.org/r/1108714 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui)
[08:44:55] <jelto>	 !log sudo homer 'cr*codfw*' commit 'T377877'
[08:44:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:18] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 184, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:46:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] draft: scrub echostore userids [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108106 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis)
[08:47:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] otelcol: drop service-runner healthchecks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108086 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis)
[08:47:45] <jelto>	 !log sudo homer 'lsw1-c5-codfw*' commit 'T377877'
[08:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] ripeatlas: remove hardcoded measurements [alerts] - 10https://gerrit.wikimedia.org/r/1105747 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli)
[08:48:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2039.codfw.wmnet
[08:48:39] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2039.codfw.wmnet
[08:48:46] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2038.codfw.wmnet
[08:48:48] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2038.codfw.wmnet
[08:53:50] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2036-2037].codfw.wmnet
[08:54:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2036-2037].codfw.wmnet
[08:56:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2036.codfw.wmnet with OS bookworm
[08:57:16] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2036
[08:57:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2036
[08:57:45] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2037.codfw.wmnet with OS bookworm
[08:58:41] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2037
[08:59:02] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[09:01:09] <icinga-wm>	 PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:02:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2037 - jelto@cumin1002"
[09:02:28] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2037 - jelto@cumin1002"
[09:02:28] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:02:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2037.codfw.wmnet 146.32.192.10.in-addr.arpa 6.4.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:02:31] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2037.codfw.wmnet 146.32.192.10.in-addr.arpa 6.4.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[09:02:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2037
[09:02:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2037
[09:02:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2037
[09:05:19] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:06:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10436559 (10MoritzMuehlenhoff)
[09:08:23] <moritzm>	 !log installing Java 21.0.5 security updates
[09:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1108716
[09:15:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:16:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10436584 (10MoritzMuehlenhoff)
[09:16:40] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10436585 (10MoritzMuehlenhoff)
[09:16:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2036.codfw.wmnet with reason: host reimage
[09:17:16] <moritzm>	 !log installing systemd bugfix updates from Bookworm point release
[09:17:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:35] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:20:39] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2036.codfw.wmnet with reason: host reimage
[09:22:21] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2037.codfw.wmnet with reason: host reimage
[09:25:21] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1108716 (owner: 10Muehlenhoff)
[09:26:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2037.codfw.wmnet with reason: host reimage
[09:33:32] <wikibugs>	 06SRE, 10Phabricator, 06Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228#10436598 (10kostajh) Hi, wondering if there's interest to move this forward. @DLynch and I have a [[ https://github.com/kemayo/loosephabric | tool that integrates...
[09:33:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10436599 (10MoritzMuehlenhoff)
[09:40:12] <icinga-wm>	 RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:40:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2036.codfw.wmnet with OS bookworm
[09:46:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2037.codfw.wmnet with OS bookworm
[09:47:16] <wikibugs>	 (03PS1) 10Muehlenhoff: routinator: Bump size of tmpfs by 50% [puppet] - 10https://gerrit.wikimedia.org/r/1108719
[09:47:29] <wikibugs>	 (03PS2) 10Muehlenhoff: routinator: Bump size of tmpfs by 50% [puppet] - 10https://gerrit.wikimedia.org/r/1108719
[09:50:22] <jelto>	 !log sudo homer 'lsw1-c1-codfw*' commit 'T377877'
[09:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:25] <stashbot>	 T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877
[09:51:06] <jelto>	 !log sudo homer 'cr*codfw*' commit 'T377877'
[09:51:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:28] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 182, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:01:09] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet
[10:01:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2037.codfw.wmnet
[10:01:14] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2037.codfw.wmnet
[10:01:19] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2036.codfw.wmnet
[10:01:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2036.codfw.wmnet
[10:02:08] <icinga-wm>	 RECOVERY - Routinator process on rpki2003 is OK: PROCS OK: 1 process with command name routinator https://wikitech.wikimedia.org/wiki/RPKI%23Process
[10:02:20] <icinga-wm>	 RECOVERY - RPKI Validator RTR port on rpki2003 is OK: TCP OK - 0.030 second response time on 10.192.24.3 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port
[10:02:46] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2034-2035].codfw.wmnet
[10:03:29] <wikibugs>	 (03Abandoned) 10Muehlenhoff: routinator: Bump size of tmpfs by 50% [puppet] - 10https://gerrit.wikimedia.org/r/1108719 (owner: 10Muehlenhoff)
[10:03:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2034-2035].codfw.wmnet
[10:05:06] <icinga-wm>	 RECOVERY - Disk space on rpki2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=rpki2003&var-datasource=codfw+prometheus/ops
[10:05:15] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet
[10:05:37] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2035.codfw.wmnet with OS bookworm
[10:05:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2034.codfw.wmnet with OS bookworm
[10:05:45] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2034
[10:05:46] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2034
[10:05:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2035
[10:05:57] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2035
[10:08:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:09:14] <icinga-wm>	 PROBLEM - BGP status on lsw1-b8-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:09:30] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:12:33] <wikibugs>	 (03CR) 10David Caro: "@fgiunchedi@wikimedia.org do you know if it's possible to tie this alert to the wmcs team?" [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250) (owner: 10David Caro)
[10:14:20] <wikibugs>	 (03PS6) 10Filippo Giunchedi: prometheus: deploy instances from a single configuration [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087)
[10:14:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[10:14:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:17:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: "We haven't implemented team-level routing for alerts from service::catalog no. The easiest way in this case is to use prometheus::blackbox" [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250) (owner: 10David Caro)
[10:21:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116 (10cmooney) 03NEW p:05Triage→03Medium
[10:22:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10436696 (10cmooney)
[10:23:00] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2034.codfw.wmnet with reason: host reimage
[10:25:31] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2035.codfw.wmnet with reason: host reimage
[10:26:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2034.codfw.wmnet with reason: host reimage
[10:29:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10436714 (10phaultfinder)
[10:30:54] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2035.codfw.wmnet with reason: host reimage
[10:32:26] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] enwiki: Deploy Add Link to 5% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108724 (https://phabricator.wikimedia.org/T382382)
[10:35:17] <wikibugs>	 (03PS2) 10Urbanecm: [Growth] enwiki: Deploy Add Link to 5% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108724 (https://phabricator.wikimedia.org/T382382)
[10:40:12] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti4007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[10:40:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10436738 (10cmooney)
[10:41:44] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:42:15] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service ganeti4005:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:42:30] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti4007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[10:42:39] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:42:53] <wikibugs>	 (03PS1) 10Btullis: Create k8s tokens for airflow-research on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1108725 (https://phabricator.wikimedia.org/T380620)
[10:44:42] <wikibugs>	 (03Abandoned) 10Btullis: Create k8s tokens for airflow-research on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1108725 (https://phabricator.wikimedia.org/T380620) (owner: 10Btullis)
[10:45:36] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:46:44] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2034.codfw.wmnet with OS bookworm
[10:49:12] <icinga-wm>	 PROBLEM - HTTPS Ganeti RAPI ulsfo on ganeti4008 is CRITICAL: connect to address ganeti01.svc.ulsfo.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon
[10:49:12] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[10:49:46] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 269180
[10:49:57] <moritzm>	 ^ I'm extending the cert, adding downtime for ganeti4*
[10:50:16] <icinga-wm>	 RECOVERY - BGP status on lsw1-b8-codfw.mgmt is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:50:19] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 269180
[10:50:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on ganeti[4005-4008].ulsfo.wmnet with reason: renew certs
[10:50:55] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2035.codfw.wmnet with OS bookworm
[10:51:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ganeti[4005-4008].ulsfo.wmnet with reason: renew certs
[10:51:21] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Ganeti expired certificate errors in ulsfo - https://phabricator.wikimedia.org/T382873#10436765 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d605aa1f-a578-4f3f-b71b-febdf98df1fb) set by jmm@cumin2002 for 3:00:00 on 4 host(s) and their services with re...
[10:51:58] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2035.codfw.wmnet
[10:52:00] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2035.codfw.wmnet
[10:52:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2034.codfw.wmnet
[10:52:14] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2034.codfw.wmnet
[10:52:16] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "Very nice. Long-term, manually truncating the job name isn't very sustainable (`mercurius-2025-01-06-031821-publish-81-webvideotranscode-5" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105821 (https://phabricator.wikimedia.org/T382630) (owner: 10Scott French)
[10:53:06] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2032-2033].codfw.wmnet
[10:54:03] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1243.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:55:54] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mediawiki: add rsyslog container to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[10:56:54] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2032-2033].codfw.wmnet
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1100)
[11:01:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2033.codfw.wmnet with OS bookworm
[11:01:12] <icinga-wm>	 RECOVERY - ganeti-wconfd running on ganeti4008 is OK: PROCS OK: 1 process with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[11:01:14] <icinga-wm>	 RECOVERY - ganeti-confd running on ganeti4005 is OK: PROCS OK: 1 process with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[11:01:25] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2033
[11:01:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2033
[11:01:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2032.codfw.wmnet with OS bookworm
[11:02:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2032
[11:02:14] <icinga-wm>	 RECOVERY - HTTPS Ganeti RAPI ulsfo on ganeti4008 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.014 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon
[11:03:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[11:04:18] <icinga-wm>	 RECOVERY - ganeti-confd running on ganeti4007 is OK: PROCS OK: 1 process with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti
[11:04:30] <icinga-wm>	 RECOVERY - ganeti-noded running on ganeti4007 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[11:05:36] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:05:49] <wikibugs>	 (03PS2) 10David Caro: alerts: add toolsadmin probe [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250)
[11:06:19] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108726
[11:06:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] alerts: add toolsadmin probe [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250) (owner: 10David Caro)
[11:07:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2032 - jelto@cumin1002"
[11:07:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2032 - jelto@cumin1002"
[11:07:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:07:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2032.codfw.wmnet 215.32.192.10.in-addr.arpa 5.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:08:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2032.codfw.wmnet 215.32.192.10.in-addr.arpa 5.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:08:03] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2032
[11:08:21] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2032
[11:08:21] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2032
[11:09:28] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:09:31] <wikibugs>	 (03PS3) 10David Caro: alerts: add toolsadmin probe [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250)
[11:09:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[11:09:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[11:11:38] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:13:37] <logmsgbot>	 !log hashar@deploy2002 Started deploy [integration/docroot@5d32766]: Remove Flow from being listed on doc.wikimedia.org frontpage - T379671
[11:13:40] <stashbot>	 T379671: Remove Flow from doc.wikimedia.org - https://phabricator.wikimedia.org/T379671
[11:13:49] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [integration/docroot@5d32766]: Remove Flow from being listed on doc.wikimedia.org frontpage - T379671 (duration: 00m 11s)
[11:18:04] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] ripeatlas: remove hardcoded measurements [alerts] - 10https://gerrit.wikimedia.org/r/1105747 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli)
[11:18:46] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108724 (https://phabricator.wikimedia.org/T382382) (owner: 10Urbanecm)
[11:18:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2033.codfw.wmnet with reason: host reimage
[11:19:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10436828 (10phaultfinder)
[11:19:43] <wikibugs>	 (03Merged) 10jenkins-bot: ripeatlas: remove hardcoded measurements [alerts] - 10https://gerrit.wikimedia.org/r/1105747 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli)
[11:22:19] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2033.codfw.wmnet with reason: host reimage
[11:25:11] <moritzm>	 !log refreshed Ganeti internal cert for ulsfo (after adding a manual temp cert to unblock itself) T382873
[11:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:14] <stashbot>	 T382873: Ganeti expired certificate errors in ulsfo - https://phabricator.wikimedia.org/T382873
[11:26:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2032.codfw.wmnet with reason: host reimage
[11:28:28] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:30:07] <icinga-wm>	 PROBLEM - Host wikikube-worker2032 is DOWN: PING CRITICAL - Packet loss = 100%
[11:32:30] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2032.codfw.wmnet with reason: host reimage
[11:35:10] <icinga-wm>	 RECOVERY - Host wikikube-worker2032 is UP: PING OK - Packet loss = 0%, RTA = 33.36 ms
[11:41:30] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2033.codfw.wmnet with OS bookworm
[11:45:24] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:46:19] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[11:46:23] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[11:49:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[11:49:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[11:52:09] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/1108730 (https://phabricator.wikimedia.org/T383025)
[11:52:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2032.codfw.wmnet with OS bookworm
[11:52:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts dbproxy1020.eqiad.wmnet
[11:53:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Remove dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/1108730 (https://phabricator.wikimedia.org/T383025) (owner: 10Marostegui)
[11:54:44] <jinxer-wm>	 RESOLVED: [2x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[11:54:44] <jinxer-wm>	 RESOLVED: [2x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[11:55:43] <jelto>	 !log sudo homer 'lsw1-c6-codfw*' commit 'T377877'
[11:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:46] <stashbot>	 T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877
[11:56:16] <jelto>	 !log sudo homer 'cr*codfw*' commit 'T377877'
[11:56:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:47] <wikibugs>	 (03PS1) 10Marostegui: report_users.sh: Remove 10.64.32.179 [software] - 10https://gerrit.wikimedia.org/r/1108731 (https://phabricator.wikimedia.org/T383025)
[11:57:48] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 180, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:57:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2033.codfw.wmnet
[11:57:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] report_users.sh: Remove 10.64.32.179 [software] - 10https://gerrit.wikimedia.org/r/1108731 (https://phabricator.wikimedia.org/T383025) (owner: 10Marostegui)
[11:57:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2033.codfw.wmnet
[11:58:02] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2032.codfw.wmnet
[11:58:05] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2032.codfw.wmnet
[11:58:50] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.dns.netbox
[11:59:26] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2030-2031].codfw.wmnet
[11:59:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Ganeti expired certificate errors in ulsfo - https://phabricator.wikimedia.org/T382873#10436916 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I created a temporary /var/lib/ganeti/server.pem certificate to unblock gnt-cluster (following https://wik...
[12:00:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2030-2031].codfw.wmnet
[12:01:42] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2031.codfw.wmnet with OS bookworm
[12:01:43] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2030.codfw.wmnet with OS bookworm
[12:02:02] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2031
[12:02:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2031
[12:02:03] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2030
[12:02:03] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2030
[12:02:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1020.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[12:03:24] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1020.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002"
[12:03:24] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:03:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy1020.eqiad.wmnet
[12:03:55] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission dbproxy1020.eqiad.wmnet - https://phabricator.wikimedia.org/T383025#10436928 (10Marostegui) a:05Marostegui→03None
[12:04:01] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission dbproxy1020.eqiad.wmnet - https://phabricator.wikimedia.org/T383025#10436933 (10Marostegui) This is ready for #dc-ops
[12:04:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10436936 (10phaultfinder)
[12:05:26] <icinga-wm>	 PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:09:35] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/1108735
[12:15:45] <wikibugs>	 (03PS1) 10Hnowlan: changeprop-jobqueue: remove support for video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108737 (https://phabricator.wikimedia.org/T355292)
[12:15:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] site.pp: Remove dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/1108735 (owner: 10Marostegui)
[12:19:52] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2031.codfw.wmnet with reason: host reimage
[12:20:17] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2030.codfw.wmnet with reason: host reimage
[12:23:23] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2031.codfw.wmnet with reason: host reimage
[12:23:41] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] "followed the logic and as far as I can tell, this should work as intended" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108724 (https://phabricator.wikimedia.org/T382382) (owner: 10Urbanecm)
[12:26:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2030.codfw.wmnet with reason: host reimage
[12:29:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10436980 (10phaultfinder)
[12:31:13] <wikibugs>	 (03PS1) 10Marostegui: check_private_data_report: Update list of emails [puppet] - 10https://gerrit.wikimedia.org/r/1108740
[12:34:25] <wikibugs>	 (03PS2) 10Marostegui: check_private_data_report: Update list of emails [puppet] - 10https://gerrit.wikimedia.org/r/1108740
[12:36:00] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] check_private_data_report: Update list of emails [puppet] - 10https://gerrit.wikimedia.org/r/1108740 (owner: 10Marostegui)
[12:42:02] <wikibugs>	 (03PS1) 10Muehlenhoff: postgresql::user: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108741
[12:42:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] postgresql::user: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108741 (owner: 10Muehlenhoff)
[12:44:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2031.codfw.wmnet with OS bookworm
[12:45:32] <wikibugs>	 (03PS2) 10Muehlenhoff: postgresql::user: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108741
[12:46:35] <icinga-wm>	 RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:46:50] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1253-1255].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[12:47:21] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2030.codfw.wmnet with OS bookworm
[12:47:40] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1264-1266].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[12:48:33] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1253.eqiad.wmnet with OS bookworm
[12:49:23] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1264.eqiad.wmnet with OS bookworm
[12:50:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1300)
[13:00:56] <wikibugs>	 (03CR) 10Btullis: mediawiki: Add support for dumps persistent _volumes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104605 (https://phabricator.wikimedia.org/T352650) (owner: 10Giuseppe Lavagetto)
[13:01:22] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108741 (owner: 10Muehlenhoff)
[13:05:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] alerts: add toolsadmin probe [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250) (owner: 10David Caro)
[13:08:52] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1253.eqiad.wmnet with reason: host reimage
[13:09:49] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1264.eqiad.wmnet with reason: host reimage
[13:12:24] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1253.eqiad.wmnet with reason: host reimage
[13:14:37] <wikibugs>	 (03PS1) 10Muehlenhoff: osm: On Bookworm create OSM users using system::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1108745 (https://phabricator.wikimedia.org/T381565)
[13:15:08] <wikibugs>	 (03PS2) 10Muehlenhoff: osm: On Bookworm create OSM users using system::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1108745 (https://phabricator.wikimedia.org/T381565)
[13:15:59] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1264.eqiad.wmnet with reason: host reimage
[13:16:50] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108745 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:17:46] <wikibugs>	 (03PS2) 10Klausman: hiera: drop rocm installs from k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/1108744
[13:19:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10437068 (10phaultfinder)
[13:20:09] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Cool, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1108744 (owner: 10Klausman)
[13:21:48] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108745 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[13:22:19] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] hiera: drop rocm installs from k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/1108744 (owner: 10Klausman)
[13:29:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10437112 (10phaultfinder)
[13:30:17] <Amir1>	 !log running mwscript purgeParserCache.php --wiki=aawiki --tag pc5 --age=2592000 --msleep 200 in *eqiad* (T382948)
[13:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:20] <stashbot>	 T382948: ParserCache is not deleting old rows after three months past the expiry in the secondary datacenter - https://phabricator.wikimedia.org/T382948
[13:30:46] <moritzm>	 !log install gst-plugins-base1.0 security updates
[13:30:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:59] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1253.eqiad.wmnet with OS bookworm
[13:32:27] <elukey>	 !log disable puppet fleetwide to allow maintenance on puppetdb2003
[13:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:13] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: migrate ops instance to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087)
[13:33:31] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1254.eqiad.wmnet with OS bookworm
[13:35:31] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1264.eqiad.wmnet with OS bookworm
[13:35:41] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetdb2003.codfw.wmnet with reason: Resync postgres
[13:35:45] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetdb2003.codfw.wmnet with reason: Resync postgres
[13:35:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4735/co" [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[13:37:15] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:39:20] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:39:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:39:51] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1265.eqiad.wmnet with OS bookworm
[13:39:54] <elukey>	 !log stop puppetdb and postgres on puppetdb2003 - T383114
[13:39:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:57] <stashbot>	 T383114: Postgres Replication broken for Puppetdb - https://phabricator.wikimedia.org/T383114
[13:41:05] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: migrate ops instance to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087)
[13:41:47] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.postgresql.postgres-init
[13:43:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4736/co" [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[13:45:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:45:48] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: migrate ops instance to prometheus::instances [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087)
[13:48:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4737/co" [puppet] - 10https://gerrit.wikimedia.org/r/1108746 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[13:48:35] <wikibugs>	 (03CR) 10David Caro: [C:03+2] alerts: add toolsadmin probe [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250) (owner: 10David Caro)
[13:49:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10437172 (10phaultfinder)
[13:52:13] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99)
[13:52:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 158 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:53:57] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1254.eqiad.wmnet with reason: host reimage
[13:54:01] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1254.eqiad.wmnet with reason: host reimage
[13:54:03] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:54:19] <wikibugs>	 (03CR) 10David Caro: [C:03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1066784 (https://phabricator.wikimedia.org/T373250) (owner: 10David Caro)
[13:54:52] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 on db2197 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: nowiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:55:58] <marostegui>	 ^ fixing
[13:57:09] <elukey>	 !log start postgres and puppetdb on puppetdb2003
[13:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:52] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 on db2197 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:59:43] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1265.eqiad.wmnet with reason: host reimage
[13:59:46] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1265.eqiad.wmnet with reason: host reimage
[14:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1400). nyaa~
[14:00:05] <jouncebot>	 ihurbain and ZhaoFJx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:15] <ihurbain>	 o/
[14:00:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:01:18] <ihurbain>	 i need a deployer because a/ i don't have +2 rights b/ i've never done scap (although i PROBABLY have the rights for that) and i don't have the brain cellz to learn how to do that now ^^;
[14:02:08] <ZhaoFJx>	 I need a deployer too for my two patches
[14:02:34] <elukey>	 !log re-enable puppet fleetwide after maintenance
[14:02:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:03] <jinxer-wm>	 FIRING: [3x] KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:04:23] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10437208 (10Jhancock.wm) is it still safe to work on this machine today?
[14:08:03] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2030.codfw.wmnet
[14:08:05] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2030.codfw.wmnet
[14:08:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2031.codfw.wmnet
[14:08:14] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2031.codfw.wmnet
[14:08:15] <Amir1>	 ihurbain: let me help, fwiw, if you can ssh into deploy2002, it's just "scap backport #gerritpatchid" on a screen
[14:08:52] <ihurbain>	 you forgot the step "and scream if something go wrong" :D
[14:09:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100850 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[14:10:27] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:10:38] <wikibugs>	 (03Merged) 10jenkins-bot: Reactivate Parsoid+Kartographer on hewiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100850 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin)
[14:10:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1254:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1254 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:10:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2028-2029].codfw.wmnet
[14:11:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet
[14:11:43] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1100850|Reactivate Parsoid+Kartographer on hewiki and commonswiki (T373454 T373460)]]
[14:11:48] <stashbot>	 T373454: [warn/kartographer] Could not add tracking category kartographer-tracking-category - https://phabricator.wikimedia.org/T373454
[14:11:48] <stashbot>	 T373460: Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T373460
[14:12:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2028-2029].codfw.wmnet
[14:13:00] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2028.codfw.wmnet with OS bookworm
[14:13:01] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2029.codfw.wmnet with OS bookworm
[14:13:02] <ihurbain>	 *raises eyebrow* looking at gerrit log, is it correct to understand that if one can start scap backport, one technically doesn't need +2 rights on mediawiki-config to run a backport?
[14:13:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2028
[14:13:09] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2028
[14:13:18] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[14:13:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2029
[14:13:20] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2029
[14:13:39] <urbanecm>	 ihurbain: yep, scap will +2 on your behalf
[14:13:45] <ihurbain>	 oh neat
[14:14:03] <jinxer-wm>	 FIRING: [3x] KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:14:19] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1254.eqiad.wmnet with OS bookworm
[14:14:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10437248 (10phaultfinder)
[14:15:27] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:15:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1254:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1254 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:16:06] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1255.eqiad.wmnet with OS bookworm
[14:16:10] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1254:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:16:38] <icinga-wm>	 PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:17:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet
[14:17:08] <icinga-wm>	 PROBLEM - BGP status on lsw1-b6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:18:57] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1265.eqiad.wmnet with OS bookworm
[14:19:03] <jinxer-wm>	 FIRING: [3x] KubernetesCalicoDown: wikikube-worker1243.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:19:54] <Lucas_WMDE>	 o_
[14:19:56] <Lucas_WMDE>	 * o/
[14:20:03] <Lucas_WMDE>	 I totally missed the beginning of the window, sorry ^^
[14:20:24] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup, ihurbain: Backport for [[gerrit:1100850|Reactivate Parsoid+Kartographer on hewiki and commonswiki (T373454 T373460)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:20:28] <stashbot>	 T373454: [warn/kartographer] Could not add tracking category kartographer-tracking-category - https://phabricator.wikimedia.org/T373454
[14:20:29] <stashbot>	 T373460: Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T373460
[14:20:34] <ihurbain>	 aha
[14:20:37] <Amir1>	 ihurbain: it's live in testwiki
[14:20:40] <Amir1>	 test and let me know
[14:20:43] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1266.eqiad.wmnet with OS bookworm
[14:20:44] <ihurbain>	 having a look.
[14:20:55] <jinxer-wm>	 RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1254:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:20:56] <Amir1>	 Lucas_WMDE: I probably have to go after this deploy, feel free to take over after me
[14:21:24] <dbrant>	 (I'm happy to help too)
[14:21:32] <ZhaoFJx>	 Lucas_WMDE could you take a look for gerrit change 1108459 and 1108525? thanks
[14:24:30] <Lucas_WMDE>	 sure
[14:25:11] <Amir1>	 ihurbain: sorry for stupid mistake, it's not live in testwiki, it's live in mwdebug
[14:25:13] <wikibugs>	 (03CR) 10Ssingh: varnish: Hide X-Client-IP on error page by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall)
[14:25:23] <ihurbain>	 Amir1: i gathered :D
[14:26:23] <Amir1>	 :D
[14:26:47] <ihurbain>	 Amir1: ship it, it doesn't seem to puke in the logs anymore, i think we're good.
[14:26:58] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup, ihurbain: Continuing with sync
[14:27:12] <ihurbain>	 thank you!!
[14:27:13] <Amir1>	 not puking is good. Shipping.
[14:27:20] <ihurbain>	 :D
[14:28:00] <Lucas_WMDE>	 isn’t puking common on ships
[14:28:31] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] cswikivoyage: Change the wordmark v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[14:29:22] <ihurbain>	 a valid point
[14:29:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10437311 (10phaultfinder)
[14:31:01] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2028.codfw.wmnet with reason: host reimage
[14:31:20] <ZhaoFJx>	 Lucas_WMDE sorry for unclear, I'd like to deploy these two patches today in UTC afternoon backport window, which is now
[14:31:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2029.codfw.wmnet with reason: host reimage
[14:31:32] <ZhaoFJx>	 and thank you for +1 too
[14:31:54] <wikibugs>	 (03CR) 10Elukey: [C:03+2] charts: improve Kartotherian's statsd config (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) (owner: 10Elukey)
[14:33:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[14:34:11] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[14:34:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2028.codfw.wmnet with reason: host reimage
[14:36:09] <Lucas_WMDE>	 ZhaoFJx: you want to deploy yourself?
[14:36:12] <Lucas_WMDE>	 or you want me to deploy them?
[14:36:30] <Lucas_WMDE>	 (either way I believe we’re still waiting for Amir1  / ihurbain to finish first)
[14:36:40] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1100850|Reactivate Parsoid+Kartographer on hewiki and commonswiki (T373454 T373460)]] (duration: 24m 56s)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:44] <stashbot>	 T373454: [warn/kartographer] Could not add tracking category kartographer-tracking-category - https://phabricator.wikimedia.org/T373454
[14:36:45] <stashbot>	 T373460: Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at end of string (2 byte sequence) - https://phabricator.wikimedia.org/T373460
[14:36:50] <ihurbain>	 successful invocation
[14:37:30] <ZhaoFJx>	 Lucas_WMDE I don’t have production access, so maybe you can help me to deploy them
[14:37:43] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2029.codfw.wmnet with reason: host reimage
[14:38:01] <ZhaoFJx>	 And take your time, no rush
[14:38:39] <Lucas_WMDE>	 ok, then let’s…
[14:38:43] <Lucas_WMDE>	 well
[14:39:03] <Lucas_WMDE>	 …wait for them to come back I guess
[14:40:01] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10437344 (10MatthewVernon) Yes, thanks, please go ahead.
[14:40:20] <Lucas_WMDE>	 wb ZhaoFJx  ^^
[14:40:24] <Lucas_WMDE>	 let’s start with that then
[14:40:36] <Lucas_WMDE>	 I think we can deploy both of those changes at once
[14:40:47] <ZhaoFJx>	 Nice
[14:41:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108459 (https://phabricator.wikimedia.org/T367306) (owner: 10ZhaoFJx)
[14:41:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[14:41:55] <wikibugs>	 (03Merged) 10jenkins-bot: Enable AutoModerator on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108459 (https://phabricator.wikimedia.org/T367306) (owner: 10ZhaoFJx)
[14:41:58] <wikibugs>	 (03Merged) 10jenkins-bot: cswikivoyage: Change the wordmark v2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108525 (https://phabricator.wikimedia.org/T382779) (owner: 10ZhaoFJx)
[14:42:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1108459|Enable AutoModerator on zhwiki (T367306)]], [[gerrit:1108525|cswikivoyage: Change the wordmark v2 (T382779)]]
[14:42:33] <stashbot>	 T367306: Enable AutoModerator on zh.wiki - https://phabricator.wikimedia.org/T367306
[14:42:34] <stashbot>	 T382779: Change wordmark of Czech Wikivoyage - https://phabricator.wikimedia.org/T382779
[14:49:13] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 zhaofjx, lucaswerkmeister-wmde: Backport for [[gerrit:1108459|Enable AutoModerator on zhwiki (T367306)]], [[gerrit:1108525|cswikivoyage: Change the wordmark v2 (T382779)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:49:17] <stashbot>	 T367306: Enable AutoModerator on zh.wiki - https://phabricator.wikimedia.org/T367306
[14:49:18] <stashbot>	 T382779: Change wordmark of Czech Wikivoyage - https://phabricator.wikimedia.org/T382779
[14:49:32] <Lucas_WMDE>	 ZhaoFJx: can you test the two changes using WikimediaDebug?
[14:51:31] <moritzm>	 !log installing intel-microcode security updates
[14:51:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:32] <ZhaoFJx>	 Lucas_WMDE of course, just one second
[14:54:38] <Lucas_WMDE>	 ok!
[14:54:46] <icinga-wm>	 RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:55:10] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2028.codfw.wmnet with OS bookworm
[14:56:00] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:56:32] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:56:40] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: deploy instances from a single configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[14:57:36] <icinga-wm>	 RECOVERY - BGP status on lsw1-b6-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:58:00] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2029.codfw.wmnet with OS bookworm
[15:02:30] <Lucas_WMDE>	 the cswiki logo change looks alright to me
[15:02:37] <Lucas_WMDE>	 not sure how to find out whether AutoModerator is enabled or not tbh
[15:03:26] <Lucas_WMDE>	 ok, I can see it on https://zh.wikipedia.org/w/index.php?title=Special:%E7%89%88%E6%9C%AC&uselang=en at least (Special:Version)
[15:04:24] <wikibugs>	 (03CR) 10Herron: [C:03+1] otelcol: drop service-runner healthchecks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108086 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis)
[15:04:27] <Lucas_WMDE>	 looks like it doesn’t come with any API modules (or hook into them)
[15:04:34] <ZhaoFJx>	 Yep
[15:04:39] <ZhaoFJx>	 Just some config change
[15:04:58] <ZhaoFJx>	 https://zh.wikipedia.org/wiki/Special:CommunityConfiguration/AutoModerator
[15:05:06] <Lucas_WMDE>	 yeah, I just found that too ^^
[15:05:19] <Lucas_WMDE>	 ok that looks reasonable
[15:05:27] <Lucas_WMDE>	 all good then? or are you still testing?
[15:05:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4738/co" [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[15:05:31] <ZhaoFJx>	 Ig
[15:05:35] <ZhaoFJx>	 I guess all good
[15:05:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 zhaofjx, lucaswerkmeister-wmde: Continuing with sync
[15:05:42] <Lucas_WMDE>	 ok!
[15:05:50] <ZhaoFJx>	 Thanks a lot!
[15:05:53] <ZhaoFJx>	 Have a good one
[15:06:08] <wikibugs>	 (03CR) 10Herron: [C:03+1] draft: scrub echostore userids [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108106 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis)
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:20] <wikibugs>	 (03PS1) 10Klausman: Revert "hiera: drop rocm installs from k8s nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1108761
[15:07:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2029.codfw.wmnet
[15:07:33] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2029.codfw.wmnet
[15:07:39] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2028.codfw.wmnet
[15:07:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2028.codfw.wmnet
[15:08:08] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1255.eqiad.wmnet with reason: host reimage
[15:08:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2026-2027].codfw.wmnet
[15:08:50] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Oh OK." [puppet] - 10https://gerrit.wikimedia.org/r/1108761 (owner: 10Klausman)
[15:09:38] <wikibugs>	 (03CR) 10Klausman: [C:03+2] Revert "hiera: drop rocm installs from k8s nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1108761 (owner: 10Klausman)
[15:09:54] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2026-2027].codfw.wmnet
[15:11:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2027.codfw.wmnet with OS bookworm
[15:11:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2026.codfw.wmnet with OS bookworm
[15:11:42] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1255.eqiad.wmnet with reason: host reimage
[15:11:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2027
[15:11:53] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2027
[15:11:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2026
[15:11:53] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2026
[15:11:59] <wikibugs>	 (03PS1) 10Audrey Penven: shellbox: release image 2025-01-07-141744 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108762 (https://phabricator.wikimedia.org/T380751)
[15:13:07] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108459|Enable AutoModerator on zhwiki (T367306)]], [[gerrit:1108525|cswikivoyage: Change the wordmark v2 (T382779)]] (duration: 30m 38s)
[15:13:12] <stashbot>	 T367306: Enable AutoModerator on zh.wiki - https://phabricator.wikimedia.org/T367306
[15:13:12] <stashbot>	 T382779: Change wordmark of Czech Wikivoyage - https://phabricator.wikimedia.org/T382779
[15:13:25] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[15:13:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:34] <Lucas_WMDE>	 I might also deploy some shellbox updates soon if that’s okay with everyone
[15:13:41] <Lucas_WMDE>	 jouncebot: now
[15:13:41] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 46 minute(s)
[15:13:47] <Lucas_WMDE>	 since there’s a bit of a break before the next window
[15:14:32] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM, I’ll try to deploy this soon assuming nobody objects" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108762 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven)
[15:14:35] <Lucas_WMDE>	 ^
[15:15:47] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-serve-ctrl2001:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:15:47] <icinga-wm>	 PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:15:49] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:16:22] <jynus>	 A network issue?
[15:16:28] <jynus>	 or something else?
[15:16:42] * Lucas_WMDE is not doing anything rn
[15:16:55] <jynus>	 yeah, please wait until it is clear
[15:17:08] <Lucas_WMDE>	 ok
[15:17:34] <jynus>	 a few probes failed on codfw
[15:17:55] <jynus>	 but it is 2 different switches
[15:17:58] <vgutierrez>	 https://grafana.wikimedia.org/goto/iTdUtWDNR?orgId=1
[15:18:07] <vgutierrez>	 something is saturating the network interface on ml-serve instances
[15:18:15] <wikibugs>	 (03CR) 10Elukey: [C:03+1] osm: On Bookworm create OSM users using system::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1108745 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:18:40] <wikibugs>	 (03CR) 10Elukey: [C:03+1] postgresql::server: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108707 (owner: 10Muehlenhoff)
[15:18:55] <wikibugs>	 (03CR) 10Elukey: [C:03+1] postgresql::user: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108741 (owner: 10Muehlenhoff)
[15:19:13] <wikibugs>	 (03CR) 10Elukey: [C:03+1] postgresql::dirs: Use wmflib::debian_postgresql_version() [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff)
[15:19:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:20:00] <wikibugs>	 (03CR) 10Elukey: [C:03+1] api: allow to abort before run() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105351 (https://phabricator.wikimedia.org/T365454) (owner: 10Volans)
[15:20:47] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-serve-ctrl2001:6443 has failed probes (http_ml_serve_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ml-serve-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:20:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] osm: On Bookworm create OSM users using system::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1108745 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:21:00] <jynus>	 yeah, I was seeing latencies going down
[15:21:34] <jynus>	 maybe oncall can keep an eye on ml hosts on codfw and file a ticket if it happens again
[15:22:19] <jynus>	 Lucas_WMDE: all good for me, seems it was a service overload
[15:24:16] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: sync
[15:24:19] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: sync
[15:24:33] <Lucas_WMDE>	 ok thanks
[15:24:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] changeprop-jobqueue: remove support for video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108737 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan)
[15:26:05] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1266.eqiad.wmnet with reason: host reimage
[15:26:45] <ZhaoFJx>	 Lucas_WMDE https://zh.wikipedia.org/wiki/Special:CommunityConfiguration/AutoModerator is function now
[15:27:28] <Lucas_WMDE>	 nice
[15:27:43] <wikibugs>	 (03PS1) 10Elukey: role::puppetdb: increase WAL kept segments [puppet] - 10https://gerrit.wikimedia.org/r/1108768 (https://phabricator.wikimedia.org/T383114)
[15:29:11] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:29:27] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:29:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage
[15:29:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10437483 (10phaultfinder)
[15:29:45] <wikibugs>	 (03PS2) 10Elukey: role::puppetdb: increase WAL kept segments [puppet] - 10https://gerrit.wikimedia.org/r/1108768 (https://phabricator.wikimedia.org/T383114)
[15:29:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2027.codfw.wmnet with reason: host reimage
[15:31:03] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:31:09] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1255.eqiad.wmnet with OS bookworm
[15:31:12] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1108768 (https://phabricator.wikimedia.org/T383114) (owner: 10Elukey)
[15:31:12] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1253-1255].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[15:31:17] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53368 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:31:46] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1266.eqiad.wmnet with reason: host reimage
[15:33:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] role::puppetdb: increase WAL kept segments [puppet] - 10https://gerrit.wikimedia.org/r/1108768 (https://phabricator.wikimedia.org/T383114) (owner: 10Elukey)
[15:33:20] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP: prometheus: k8s instances migration [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087)
[15:33:53] <wikibugs>	 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10437497 (10Andrew) https://wts.wmcloud.org/wiki/Main_Page.html now has a search bar (just on that one page) which is VERY ugly but which does provide useful (to my eyes,...
[15:35:06] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2027.codfw.wmnet with reason: host reimage
[15:35:57] <wikibugs>	 (03PS1) 10Muehlenhoff: maps::osm_master: Inline osm class [puppet] - 10https://gerrit.wikimedia.org/r/1108773 (https://phabricator.wikimedia.org/T381565)
[15:37:15] <wikibugs>	 (03CR) 10Hnowlan: [C:04-1] "Thanks for the change! A little tweak required but overall the idea works for me" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber)
[15:38:29] <Lucas_WMDE>	 (I’m still not deploying btw, trying to figure out first how I can verify the new shellbox version)
[15:38:43] <Lucas_WMDE>	 (so if anyone else wants to deploy go right ahead)
[15:38:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2026.codfw.wmnet with reason: host reimage
[15:41:11] <wikibugs>	 (03CR) 10Btullis: [C:03+1] data-engineering: add alerts for dumps2 flink app. [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) (owner: 10Gmodena)
[15:41:24] <wikibugs>	 (03PS1) 10Muehlenhoff: osmborder: Add .gitreview config [debs/osmborder] - 10https://gerrit.wikimedia.org/r/1108774
[15:41:40] <wikibugs>	 (03CR) 10Gmodena: [C:03+2] data-engineering: add alerts for dumps2 flink app. (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) (owner: 10Gmodena)
[15:42:14] <hnowlan>	 Lucas_WMDE: o/ are you trying to update the shellbox service image? 
[15:42:56] <wikibugs>	 (03Merged) 10jenkins-bot: data-engineering: add alerts for dumps2 flink app. [alerts] - 10https://gerrit.wikimedia.org/r/1101849 (https://phabricator.wikimedia.org/T379362) (owner: 10Gmodena)
[15:43:43] <Lucas_WMDE>	 hnowlan: hi! yes!
[15:43:56] <Lucas_WMDE>	 I was hoping I could put together a call to it with curl, to test the behavior of a new version in staging
[15:44:12] <Lucas_WMDE>	 but now that I’ve discovered that an HMAC check is involved, I feel like that might not be realistically doable
[15:44:37] <Lucas_WMDE>	 and probably that’s why https://wikitech.wikimedia.org/wiki/Shellbox#Smoke_test also only has a test for the containers running and not any real functionality :S
[15:45:24] <Lucas_WMDE>	 (and AFAICT I can’t target the staging termbox from a MediaWiki PHP shell… unless I do something horrible like override the client’s private $url variable via reflection)
[15:45:45] <hnowlan>	 yep, unfortunately there isn't a very easy way to test new releases on our infra aiui 
[15:45:57] <Lucas_WMDE>	 so I might just have to roll out the new version and test it afterwards
[15:46:10] <Lucas_WMDE>	 (at least the constraints shellbox is separate from the other ones, so if anything breaks, it only breaks that and nothing else…)
[15:46:21] <hnowlan>	 I think that's the best course of action for now 
[15:46:29] <Lucas_WMDE>	 ok thanks :)
[15:47:33] <Lucas_WMDE>	 jouncebot: next
[15:47:33] <jouncebot>	 In 0 hour(s) and 12 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1600)
[15:47:51] <Lucas_WMDE>	 that’s not a huge amount of time to deploy all the new shellbox versions…
[15:48:01] <Lucas_WMDE>	 is that window usually busy? I don’t usually pay attention to it ^^
[15:48:27] <cdanis>	 first time I've seen it
[15:49:03] <hnowlan>	 I'd say you'll be fine
[15:49:44] <wikibugs>	 (03PS1) 10Muehlenhoff: osmborder: Build for Bookworm and bump debhelper compat to 12 [debs/osmborder] - 10https://gerrit.wikimedia.org/r/1108776 (https://phabricator.wikimedia.org/T381565)
[15:50:04] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1108773 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[15:50:15] <hnowlan>	 generally we'd advise care and I don't think we have hard and fast rules yet, but shellbox isn't necessarily something you need to deploy in a backport window 
[15:50:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4743/co" [puppet] - 10https://gerrit.wikimedia.org/r/1104980 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi)
[15:50:42] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Some local investigation suggests, and hnowlan in #wikimedia-operations confirms, that it’s basically not doable to test this in productio" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108762 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven)
[15:50:57] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:50:59] <Lucas_WMDE>	 alright, then I’ll get started
[15:51:02] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1266.eqiad.wmnet with OS bookworm
[15:51:05] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1264-1266].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[15:51:08] <Lucas_WMDE>	 worst case we’ll just have to revert the deployment-charts change
[15:51:10] <cdanis>	 hnowlan: I thought that's what the 'mediawiki infrastructure' deploy windows were for tbh but I don't know :)
[15:51:12] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] shellbox: release image 2025-01-07-141744 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108762 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven)
[15:51:42] <Lucas_WMDE>	 my understanding was that deployment-charts changes are usually fine to deploy whenever nobody else is busy
[15:51:53] <icinga-wm>	 RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:51:53] <Lucas_WMDE>	 I just didn’t know how busy the next window was going to be ^^
[15:52:12] <hnowlan>	 cdanis: ah, you're right 
[15:52:58] <hnowlan>	 but even with that this does feel like a bit of a grey area in some senses
[15:52:59] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es2025.codfw.wmnet - https://phabricator.wikimedia.org/T383029#10437540 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:53:04] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: release image 2025-01-07-141744 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108762 (https://phabricator.wikimedia.org/T380751) (owner: 10Audrey Penven)
[15:54:42] <hnowlan>	 shellbox testing is a bit of a recurrent problem, I had to hack together a rather ungodly client to do testing during the videoscaler migration but it's highly specific. we can't expect that to happen on every upgrade/for every feature
[15:54:57] <wikibugs>	 (03PS2) 10Muehlenhoff: maps::osm_master: Inline osm class [puppet] - 10https://gerrit.wikimedia.org/r/1108773 (https://phabricator.wikimedia.org/T381565)
[15:54:59] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:55:00] <wikibugs>	 (03PS1) 10Brouberol: Increase the hadoop heap thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1108779
[15:55:15] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2027.codfw.wmnet with OS bookworm
[15:55:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[15:55:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[15:55:48] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] osmborder: Add .gitreview config [debs/osmborder] - 10https://gerrit.wikimedia.org/r/1108774 (owner: 10Muehlenhoff)
[15:55:59] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:56:50] <Lucas_WMDE>	 `(kube_env shellbox-constraints staging; curl https://staging.svc.eqiad.wmnet:$(kubectl get service shellbox-main-tls-service -o jsonpath='{.spec.ports[0].nodePort}')/healthz)` looks good
[15:57:00] <Lucas_WMDE>	 continuing with eqiad and codfw for shellbox-constraints then
[15:57:02] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1108779 (owner: 10Brouberol)
[15:57:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[15:57:12] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Increase the hadoop heap thresholds [alerts] - 10https://gerrit.wikimedia.org/r/1108779 (owner: 10Brouberol)
[15:57:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[15:57:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[15:57:52] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] role::puppetdb: increase WAL kept segments [puppet] - 10https://gerrit.wikimedia.org/r/1108768 (https://phabricator.wikimedia.org/T383114) (owner: 10Elukey)
[15:58:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[15:58:40] <Lucas_WMDE>	 okay, I’ll try to test that in a debug shell / REPL
[15:58:44] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2026.codfw.wmnet with OS bookworm
[15:58:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10437563 (10cmooney)
[15:58:55] <Lucas_WMDE>	 I still need to either deploy the other shellboxes or revert the commit at some point
[15:58:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2027.codfw.wmnet
[15:58:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2027.codfw.wmnet
[15:59:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2026.codfw.wmnet
[15:59:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2026.codfw.wmnet
[16:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: gettimeofday() says it's time for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1600)
[16:01:11] <icinga-wm>	 PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100%
[16:03:40] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] trafficserver: explicitly specify user/group for systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh)
[16:06:50] <elukey>	 !log reloaded postgres config on puppetdb1003 to pick up new wal size settings
[16:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:59] <Lucas_WMDE>	 hnowlan: any idea where shellbox errors might end up in logstash by any chance?
[16:07:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] crm: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1108088 (owner: 10Muehlenhoff)
[16:07:32] <Lucas_WMDE>	 I don’t see them in mediawiki-errors, and the PHP side (from mw-debug-repl) just gave me a very unhelpful “ShellboxError  Shellbox server returned incorrect Content-Type.” :/
[16:09:26] <hnowlan>	 Lucas_WMDE: sorry, in a meeting atm - if there are any they'll be in the logstash for the corresponding kubernetes_namespace
[16:09:33] <Lucas_WMDE>	 ok thanks!
[16:09:38] <Lucas_WMDE>	 I’ll try to follow https://wikitech.wikimedia.org/wiki/Shellbox#Logs
[16:13:06] <Lucas_WMDE>	 hm, I don’t see any log fields there, only logsource
[16:13:30] <Lucas_WMDE>	 ok, one of the other messages has a log
[16:17:19] <Lucas_WMDE>	 I think I’m giving up on testing this via shell.php
[16:18:06] <Lucas_WMDE>	 new plan: deploy the rest of the shellboxes now, just to be in a consistent state (as it looks like nothing is more broken than before, at least); tomorrow, backport https://gerrit.wikimedia.org/r/1105786 to the current wmf branch and test it that way
[16:22:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply
[16:22:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[16:22:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[16:22:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[16:23:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[16:23:05] <Lucas_WMDE>	 (if the SRE collaboration services office hours people want me to stop deploying, shout ^^)
[16:23:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[16:23:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[16:23:40] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437705 (10Dzahn) 05Open→03In progress p:05Triage→03High a:03Dzahn
[16:24:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[16:24:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[16:24:38] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[16:24:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10437710 (10phaultfinder)
[16:24:41] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437709 (10Dzahn)
[16:24:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[16:25:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[16:25:25] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437714 (10Dzahn)
[16:25:56] <Lucas_WMDE>	 proceeding with eqiad and then codfw in a moment…
[16:26:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[16:26:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[16:26:53] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[16:26:55] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[16:27:00] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[16:27:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[16:27:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[16:27:46] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[16:27:50] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[16:28:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[16:28:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[16:29:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[16:29:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[16:30:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[16:30:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[16:30:24] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[16:30:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[16:30:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[16:30:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437768 (10Dzahn)
[16:30:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[16:31:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[16:31:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[16:31:59] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[16:32:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[16:32:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[16:35:00] * Lucas_WMDE done deploying
[16:35:29] <wikibugs>	 (03PS1) 10Dzahn: admin: offboard user muja [puppet] - 10https://gerrit.wikimedia.org/r/1108784 (https://phabricator.wikimedia.org/T383056)
[16:35:33] <icinga-wm>	 RECOVERY - Host ms-be2075 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms
[16:41:20] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10437801 (10Jhancock.wm) we didn't have any spares that would work. a lot of the power cables are directly connected to the control board of the internal PDU.  powered off, drain...
[16:43:55] <mutante>	 !log krb1001 - sudo manage_principals.py delete muja@WIKIMEDIA (T383056)
[16:43:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:58] <stashbot>	 T383056: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056
[16:44:04] <jayme>	 !log puppet ca destroy termbox.discovery.wmnet - T381474
[16:44:06] <jayme>	 !log puppet ca destroy mathoid.discovery.wmnet - T381474
[16:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:08] <jayme>	 !log puppet ca destroy citoid.discovery.wmnet - T381474
[16:44:09] <stashbot>	 T381474: Handle expiring puppet certificates - https://phabricator.wikimedia.org/T381474
[16:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:31] <claime>	 jayme: 🔥
[16:47:39] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: offboard user muja [puppet] - 10https://gerrit.wikimedia.org/r/1108784 (https://phabricator.wikimedia.org/T383056) (owner: 10Dzahn)
[16:47:52] <jayme>	 claime: hn?
[16:48:10] <claime>	 jayme: just appreciating the destruction of old certs
[16:48:17] <jayme>	 ah, okay :)
[16:48:24] <jayme>	 I thought I caused a fire
[16:48:29] <claime>	 haha no sorry
[16:48:32] <dancy>	 haha
[16:48:49] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, and 2 others: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437813 (10Dzahn)
[16:49:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, and 2 others: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437816 (10Dzahn) ` [ldap-maint1001:~] $ offboard-user -l muja ... Is not member of any LDAP group Is not a member in...
[16:50:48] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10437819 (10MatthewVernon) Thanks for checking, I'm afraid the problem still exists: ` [Tue Jan  7 16:49:18 2025] sd 0:0:25:0: Power-on or device reset occurred [Tue Jan  7 16:49...
[16:51:29] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10437822 (10MatthewVernon) @Jhancock.wm sorry, failed to notice the request for a ping if it was still unhappy. See previous comment :)
[16:54:37] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10437841 (10Jhancock.wm) aw darn. reaching out to Dell
[16:55:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:58:09] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10437846 (10Dzahn) a:03thcipriani
[16:58:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10437850 (10Dzahn) 05Open→03In progress
[16:58:56] <Lucas_WMDE>	 Audrey managed to figure out how to test the shellbox change after all, and it looks like the new version is working as intended \o/
[16:59:05] <claime>	 yay
[16:59:28] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for LorenMora - https://phabricator.wikimedia.org/T382377#10437853 (10Dzahn) a:03LMora-WMF
[16:59:37] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for LorenMora - https://phabricator.wikimedia.org/T382377#10437854 (10Dzahn) 05Open→03In progress
[16:59:52] <Lucas_WMDE>	 (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseQualityConstraints/+/1105786/5#message-adbd1eacc4b02eaa681d1be52dbf7c2e6837d6ce, I was missing the <?php at the beginning of the tmp file 🤦)
[16:59:53] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for LorenMora - https://phabricator.wikimedia.org/T382377#10437857 (10Dzahn) p:05Triage→03Medium
[17:00:05] <jouncebot>	 jhathaway and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1700). nyaa~
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:29] <Lucas_WMDE>	 jouncebot: OwO
[17:00:54] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Disable varnish handling of /beacon/event on cp1100 [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[17:01:08] <claime>	 we should make jouncebot go full uwu for april 1st
[17:02:06] <Lucas_WMDE>	 aww wise fow UTC aftewnoon mediawiki depwoyment window
[17:02:49] <ottomata>	 i am trying to ping bcornwall but don't remember his IRC nick...
[17:04:34] <ottomata>	 ah ha!  
[17:04:36] <ottomata>	 brett:  hello!
[17:04:58] <ottomata>	 ah I see you in -traffic, will respond there
[17:05:00] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4745/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105078 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[17:05:06] <brett>	 ah, sorry, didn't see this!
[17:06:34] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] Disable varnish handling of /beacon/event on cp1100 [puppet] - 10https://gerrit.wikimedia.org/r/1105076 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[17:06:39] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, and 2 others: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437893 (10Dzahn) @Muehlenhoff @SLyngshede-WMF @WMDE-leszek   I ...  - manually removed from wmde, nda and airflow-wm...
[17:08:41] <swfrench-wmf>	 jouncebot: nowandnext
[17:08:41] <jouncebot>	 For the next 0 hour(s) and 51 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1700)
[17:08:41] <jouncebot>	 In 0 hour(s) and 51 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1800)
[17:10:58] <swfrench-wmf>	 unless there are any objections, I'd like to make some changes to the mw-videoscaler deployment to fix some logging issues there. since this will take a couple of steps, it would be good to get this started earlier than the infra window would normally start.
[17:11:42] <cdanis>	 swfrench-wmf: the puppet deploy window is usually unused :)
[17:12:22] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10437924 (10MatthewVernon) Thanks! And sorry...
[17:12:23] <wikibugs>	 (03PS2) 10Ottomata: Disable varnish handling of /beacon/event to decommission eventlogging backend [puppet] - 10https://gerrit.wikimedia.org/r/1105078 (https://phabricator.wikimedia.org/T238230)
[17:14:22] <swfrench-wmf>	 cdanis: ack, indeed - and seems to be so today as well :) (though noting some other things going on concurrently, like o.ttomata's change)
[17:14:58] <wikibugs>	 (03PS1) 10Ladsgroup: Fully depool ParserCache section if load of the primary is zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108794 (https://phabricator.wikimedia.org/T373037)
[17:15:23] <swfrench-wmf>	 alright, let's make this happen (FYI, hnowlan)
[17:15:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10437928 (10phaultfinder)
[17:15:48] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105821 (https://phabricator.wikimedia.org/T382630) (owner: 10Scott French)
[17:15:50] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mediawiki: add mercurius release generation token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105821 (https://phabricator.wikimedia.org/T382630) (owner: 10Scott French)
[17:17:33] <wikibugs>	 (03CR) 10CDanis: [C:03+1] Fully depool ParserCache section if load of the primary is zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108794 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup)
[17:18:20] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: add mercurius release generation token [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105821 (https://phabricator.wikimedia.org/T382630) (owner: 10Scott French)
[17:18:51] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.idm.logout Logging Muhammad Jaziraly out of all services on: 2313 hosts
[17:19:32] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10437936 (10thcipriani) a:05thcipriani→03Dzahn Thanks for the poke @Dzahn   >>! In T382616#10421415, @Volans wrote: > Adding @thcipriani for the group approva...
[17:20:05] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muhammad Jaziraly out of all services on: 2313 hosts
[17:20:09] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, and 2 others: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437944 (10Dzahn) Half an hour after the puppet patch was merged I ran the `sre.idm.logout` cookbook.
[17:23:04] <wikibugs>	 (03PS2) 10Scott French: mediawiki: add rsyslog container to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517)
[17:23:20] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10437949 (10Dzahn) a:05Dzahn→03Bmueller Thanks Tyler! Handing over to Birgit next.
[17:24:22] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[17:24:29] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[17:25:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10437963 (10phaultfinder)
[17:25:44] <wikibugs>	 (03PS1) 10Majavah: hieradata: Upgrade striker-toolsbeta to 2025-01-07-172314-production [puppet] - 10https://gerrit.wikimedia.org/r/1108795
[17:26:39] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.idm.logout Logging Muhammad Jaziraly out of all services on: 4 hosts
[17:26:48] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mediawiki: add rsyslog container to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[17:26:51] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Muhammad Jaziraly out of all services on: 4 hosts
[17:27:01] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Upgrade striker-toolsbeta to 2025-01-07-172314-production [puppet] - 10https://gerrit.wikimedia.org/r/1108795 (owner: 10Majavah)
[17:27:48] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, and 2 others: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437970 (10Dzahn) a:05Dzahn→03None
[17:28:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, and 2 others: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437972 (10Dzahn)
[17:28:12] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, and 2 others: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10437973 (10Dzahn) p:05High→03Medium
[17:28:33] <wikibugs>	 (03CR) 10David Caro: "I think this broke puppet on the cloud hosts (puppetdb ones):" [puppet] - 10https://gerrit.wikimedia.org/r/1108768 (https://phabricator.wikimedia.org/T383114) (owner: 10Elukey)
[17:28:59] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[17:29:04] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[17:29:05] <wikibugs>	 (03CR) 10David Caro: role::puppetdb: increase WAL kept segments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108768 (https://phabricator.wikimedia.org/T383114) (owner: 10Elukey)
[17:31:25] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[17:34:02] <wikibugs>	 06SRE, 06Commons, 06Traffic: Backend fetch failed - https://phabricator.wikimedia.org/T383013#10438014 (10Dzahn)
[17:35:05] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mediawiki: add rsyslog container to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[17:35:08] <wikibugs>	 06SRE, 06Traffic: "Backend fetch failed" on edit save - https://phabricator.wikimedia.org/T382790#10438017 (10Dzahn)
[17:37:02] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: add rsyslog container to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105800 (https://phabricator.wikimedia.org/T382517) (owner: 10Scott French)
[17:37:43] <wikibugs>	 (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-01-07-173704-production [puppet] - 10https://gerrit.wikimedia.org/r/1108797
[17:38:17] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-01-07-173704-production [puppet] - 10https://gerrit.wikimedia.org/r/1108797 (owner: 10Majavah)
[17:40:11] <wikibugs>	 (03PS1) 10David Caro: cloud.yaml: add missing wal_keep_segments [puppet] - 10https://gerrit.wikimedia.org/r/1108798 (https://phabricator.wikimedia.org/T383114)
[17:40:39] <wikibugs>	 (03CR) 10David Caro: role::puppetdb: increase WAL kept segments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108768 (https://phabricator.wikimedia.org/T383114) (owner: 10Elukey)
[17:43:12] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[17:43:20] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[17:45:30] <wikibugs>	 (03CR) 10Ottomata: Disable varnish handling of /beacon/event to decommission eventlogging backend [puppet] - 10https://gerrit.wikimedia.org/r/1105078 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[17:45:47] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1105078 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[17:46:03] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Disable varnish handling of /beacon/event to decommission eventlogging backend [puppet] - 10https://gerrit.wikimedia.org/r/1105078 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[17:49:09] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[17:49:14] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[17:49:32] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Switch to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1108716 (owner: 10Muehlenhoff)
[17:50:12] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] Disable varnish handling of /beacon/event to decommission eventlogging backend [puppet] - 10https://gerrit.wikimedia.org/r/1105078 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[17:50:39] <ottomata>	 !log Disable varnish handling of /beacon/event to decommission eventlogging backend [puppet] - T238230 T353817
[17:50:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:43] <stashbot>	 T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230
[17:50:44] <stashbot>	 T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1800)
[18:00:09] <wikibugs>	 (03PS1) 10Xcollazo: dse-k8s: content-history: Temporarily 10x resources for initial reconcile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108801 (https://phabricator.wikimedia.org/T382953)
[18:00:24] <wikibugs>	 (03CR) 10BCornwall: varnish: Hide X-Client-IP on error page by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall)
[18:02:29] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] dse-k8s: content-history: Temporarily 10x resources for initial reconcile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108801 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)
[18:07:22] <swfrench-wmf>	 FYI, I'll keep an eye on things for a bit, but I'm done with work on mw-videoscaler
[18:07:54] <wikibugs>	 (03CR) 10Gmodena: dse-k8s: content-history: Temporarily 10x resources for initial reconcile (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108801 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)
[18:07:59] <wikibugs>	 (03PS1) 10CDanis: group0: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552)
[18:08:13] <wikibugs>	 (03CR) 10CDanis: [C:04-2] group0: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[18:08:17] <wikibugs>	 (03CR) 10Ottomata: [C:03+2] dse-k8s: content-history: Temporarily 10x resources for initial reconcile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108801 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)
[18:08:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] group0: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[18:09:06] <wikibugs>	 (03PS2) 10CDanis: group0: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552)
[18:09:38] <wikibugs>	 (03Merged) 10jenkins-bot: dse-k8s: content-history: Temporarily 10x resources for initial reconcile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108801 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)
[18:11:36] <wikibugs>	 06SRE, 06Traffic, 10Wikidata, 06Wikidata Dev Team, 07Performance Issue: Frequent 500 Errors and Timeouts When Adding Statements to New Properties - https://phabricator.wikimedia.org/T374230#10438174 (10Dzahn)
[18:11:50] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on ms-be2075.codfw.wmnet with reason: host is awaiting attention from Dell
[18:12:01] <wikibugs>	 (03CR) 10David Caro: "Oh, it's not failing now :/, not sure what's going on..." [puppet] - 10https://gerrit.wikimedia.org/r/1108798 (https://phabricator.wikimedia.org/T383114) (owner: 10David Caro)
[18:12:04] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on ms-be2075.codfw.wmnet with reason: host is awaiting attention from Dell
[18:12:15] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10438175 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=207ed568-35e8-41d5-b367-bb9f043b91bf) set by mvernon@cumin1002 for 8 days, 0:00:00 on 1 host(s) and t...
[18:12:59] <wikibugs>	 (03CR) 10Xcollazo: dse-k8s: content-history: Temporarily 10x resources for initial reconcile (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108801 (https://phabricator.wikimedia.org/T382953) (owner: 10Xcollazo)
[18:14:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[18:16:59] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, January 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[18:17:09] <logmsgbot>	 !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[18:17:35] <logmsgbot>	 !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[18:25:29] <wikibugs>	 (03CR) 10Ssingh: varnish: Hide X-Client-IP on error page by default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall)
[18:25:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10438215 (10phaultfinder)
[18:27:54] <logmsgbot>	 !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: sync
[18:27:58] <logmsgbot>	 !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: sync
[18:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T383076#10438227 (10phaultfinder)
[18:35:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10438250 (10phaultfinder)
[18:41:17] <logmsgbot>	 !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[18:41:23] <logmsgbot>	 !log xcollazo@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[18:43:02] <wikibugs>	 (03PS3) 10CDanis: group0: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552)
[18:46:16] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM! Sorry if I missed it!" [puppet] - 10https://gerrit.wikimedia.org/r/1108798 (https://phabricator.wikimedia.org/T383114) (owner: 10David Caro)
[18:47:41] <wikibugs>	 (03PS1) 10Majavah: hieradata: Use a separate cache prefix for toolsadmin-toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/1108808 (https://phabricator.wikimedia.org/T383143)
[18:48:16] <wikibugs>	 (03CR) 10Scott French: [C:03+1] group0: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[18:49:06] <wikibugs>	 (03CR) 10Ladsgroup: "Making sure Manuel is also onboard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108794 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup)
[18:49:24] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Use a separate cache prefix for toolsadmin-toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/1108808 (https://phabricator.wikimedia.org/T383143) (owner: 10Majavah)
[19:00:05] <jouncebot>	 dduvall and dancy: May I have your attention please! MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1900)
[19:00:41] <icinga-wm>	 PROBLEM - SSH on prometheus5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:00:59] <icinga-wm>	 PROBLEM - SSH on prometheus6002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:01:39] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:01:41] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:01:51] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:01:51] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:01:51] <icinga-wm>	 RECOVERY - SSH on prometheus6002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:02:31] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6005 is OK: HTTP OK: HTTP/1.0 200 OK - 36885 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:02:31] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6016 is OK: HTTP OK: HTTP/1.0 200 OK - 37165 bytes in 0.309 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:02:43] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6003 is OK: HTTP OK: HTTP/1.0 200 OK - 36908 bytes in 0.309 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:02:43] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6014 is OK: HTTP OK: HTTP/1.0 200 OK - 37141 bytes in 0.311 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:03:43] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Nice! As discussed in the linked task, this seems like the best / simplest option." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108794 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup)
[19:04:41] <icinga-wm>	 RECOVERY - SSH on prometheus5002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:04:47] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job lvs_realserver in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:09:47] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job lvs_realserver in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:12:44] <wikibugs>	 (03CR) 10CDanis: "Unscheduled, as it seems the wmf.11 deploy isn't actually happening this week" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[19:20:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1108810
[19:20:49] <wikibugs>	 (03PS1) 10AOkoth: doc: change active host to doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1108812 (https://phabricator.wikimedia.org/T382610)
[19:21:59] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108813 (https://phabricator.wikimedia.org/T382362)
[19:22:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108813 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot)
[19:22:47] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108813 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot)
[19:22:59] <wikibugs>	 (03PS1) 10AOkoth: wmnet: failover doc host [dns] - 10https://gerrit.wikimedia.org/r/1108814 (https://phabricator.wikimedia.org/T382610)
[19:23:12] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10438432 (10Andrew) Note that these servers are not currently in service, so this move can happen anytime w/out W...
[19:30:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10438480 (10phaultfinder)
[19:36:47] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.11  refs T382362
[19:36:52] <stashbot>	 T382362: 1.44.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T382362
[19:40:53] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Also add the cluster SSH key to /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108036 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff)
[19:42:10] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] Fix permissions for /var/lib/ganeti/known_hosts in managed mode [puppet] - 10https://gerrit.wikimedia.org/r/1108092 (https://phabricator.wikimedia.org/T382870) (owner: 10Muehlenhoff)
[19:43:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10438533 (10Andrew) Looks like these are ready to go into service, should they be reassigned to @bking?  (This drive-by br...
[19:44:23] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] postgresql::server: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108707 (owner: 10Muehlenhoff)
[19:45:00] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] postgresql::user: Use wmflib::debian_postgresql_version [puppet] - 10https://gerrit.wikimedia.org/r/1108741 (owner: 10Muehlenhoff)
[19:49:02] <wikibugs>	 (03CR) 10JHathaway: postgresql::dirs: Use wmflib::debian_postgresql_version() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108710 (owner: 10Muehlenhoff)
[19:52:13] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply
[19:52:20] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply
[19:55:58] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[19:56:05] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[19:56:06] <cdanis>	 jouncebot: refresh
[19:56:06] <jouncebot>	 I refreshed my knowledge about deployments.
[19:56:09] <cdanis>	 jouncebot: nowandnext
[19:56:09] <jouncebot>	 For the next 1 hour(s) and 3 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T1900)
[19:56:09] <jouncebot>	 In 1 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T2100)
[19:56:33] <cdanis>	 dduvall: all good with the train?
[19:59:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[19:59:16] <wikibugs>	 (03CR) 10CDanis: [C:03+2] group0: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[19:59:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[19:59:32] <dduvall>	 cdanis: yes!
[19:59:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10438582 (10phaultfinder)
[19:59:39] <cdanis>	 perfect :D
[19:59:58] <wikibugs>	 (03Merged) 10jenkins-bot: group0: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108802 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[20:00:24] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[20:00:35] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[20:00:38] <logmsgbot>	 !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1108802|group0: enable OpenTelemetry exports (T340552)]]
[20:00:42] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[20:07:35] <logmsgbot>	 !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1108802|group0: enable OpenTelemetry exports (T340552)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:07:38] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[20:07:58] <wikibugs>	 06SRE, 10Domains, 06Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10438607 (10CRoslof) It took quite a while to go through the formal processes (after attempting to simply acquire them directly), but the Foundation now has control of `wikipedia.ro` and `wikimedia.ro`. They ar...
[20:08:46] <logmsgbot>	 !log cdanis@deploy2002 cdanis: Continuing with sync
[20:10:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10438609 (10phaultfinder)
[20:15:10] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[20:15:19] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[20:16:45] <logmsgbot>	 !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108802|group0: enable OpenTelemetry exports (T340552)]] (duration: 16m 06s)
[20:16:48] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[20:18:26] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] wmnet: failover doc host [dns] - 10https://gerrit.wikimedia.org/r/1108814 (https://phabricator.wikimedia.org/T382610) (owner: 10AOkoth)
[20:18:59] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[20:19:01] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] doc: change active host to doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/1108812 (https://phabricator.wikimedia.org/T382610) (owner: 10AOkoth)
[20:19:04] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[20:27:37] <wikibugs>	 (03PS1) 10Dzahn: add wikimedia.ro and wikipedia.ro, link to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1108826 (https://phabricator.wikimedia.org/T222080)
[20:29:20] <wikibugs>	 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10438677 (10Dzahn) 05Stalled→03Open
[20:29:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10438678 (10phaultfinder)
[20:38:45] <icinga-wm>	 PROBLEM - Host doc1003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:40:06] <hashar>	 ^ eek
[20:40:17] <jinxer-wm>	 FIRING: ProbeDown: Service doc1003.eqiad.wmnet:443 has failed probes (http_doc1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc1003.eqiad.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:43:32] <wikibugs>	 (03CR) 10BryanDavis: [C:03+1] "Things in deployment-prep seem to still be fine. I think this is ready to land as soon as folks are comfortable that we have sufficient co" [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis)
[20:45:57] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on doc1003.eqiad.wmnet with reason: maintenance
[20:46:13] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on doc1003.eqiad.wmnet with reason: maintenance
[20:46:46] <mutante>	 doc1003 is maintenance work and not serving traffic. downtime would have been ideal.
[20:47:13] <icinga-wm>	 RECOVERY - Host doc1003 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms
[20:49:52] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[20:49:54] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[20:49:57] <cdanis>	 jouncebot: refresh
[20:49:58] <jouncebot>	 I refreshed my knowledge about deployments.
[20:53:50] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] add wikimedia.ro and wikipedia.ro, link to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1108826 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn)
[20:54:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] add wikimedia.ro and wikipedia.ro, link to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1108826 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn)
[20:55:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:56:03] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[20:56:06] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T2100).
[21:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[21:10:27] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:11:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 938890072 and 41 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:12:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 47904 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10438872 (10phaultfinder)
[21:42:10] <cdanis>	 jouncebot: nowandnext
[21:42:10] <jouncebot>	 For the next 0 hour(s) and 17 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T2100)
[21:42:11] <jouncebot>	 In 0 hour(s) and 17 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T2200)
[21:47:28] <wikibugs>	 (03PS2) 10BCornwall: varnish: Hide X-Client-IP on error page by default [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062)
[21:48:07] <wikibugs>	 (03CR) 10BCornwall: "PS2 includes a `color` for the text as the red clashed with the light scheme's font." [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall)
[21:51:32] <wikibugs>	 06SRE, 10Domains, 06Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10438925 (10BCornwall) Thank you, @CRoslof!
[21:57:07] <wikibugs>	 (03PS1) 10Ladsgroup: Stop producing Yahoo! abstract dumps [dumps] - 10https://gerrit.wikimedia.org/r/1108844 (https://phabricator.wikimedia.org/T382069)
[21:57:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Stop producing Yahoo! abstract dumps [dumps] - 10https://gerrit.wikimedia.org/r/1108844 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[21:59:02] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [dumps] - 10https://gerrit.wikimedia.org/r/1108844 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250107T2200)
[22:02:10] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-2] "Not until Feb 7." [dumps] - 10https://gerrit.wikimedia.org/r/1108844 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[22:02:40] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[22:07:47] <wikibugs>	 (03PS3) 10BCornwall: varnish: Hide X-Client-IP on error page by default [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062)
[22:08:12] <wikibugs>	 (03CR) 10Hashar: "recheck after having deleted /srv/zuul/git/operations/dumps/dcat from both zuul-merger instances (T157818)" [dumps] - 10https://gerrit.wikimedia.org/r/1108844 (https://phabricator.wikimedia.org/T382069) (owner: 10Ladsgroup)
[22:08:21] <wikibugs>	 (03CR) 10BCornwall: "Okay, done playing designer now; I made the summary use the pointer cursor instead of the I beam." [puppet] - 10https://gerrit.wikimedia.org/r/1108485 (https://phabricator.wikimedia.org/T383062) (owner: 10BCornwall)
[22:09:14] <cdanis>	 jouncebot: next
[22:09:14] <jouncebot>	 In 8 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T0700)
[22:09:28] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[22:14:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:15:32] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[22:22:29] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[22:22:58] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[22:24:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10439042 (10phaultfinder)
[22:34:30] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[22:35:14] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[22:39:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10439089 (10phaultfinder)
[22:45:27] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:46:39] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:47:58] <logmsgbot>	 !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest2001.codfw.wmnet with OS bookworm
[22:49:17] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:49:29] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53368 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:59:23] <wikibugs>	 (03PS1) 10CDanis: tracing: Disable tracing in CLI mode [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1108850 (https://phabricator.wikimedia.org/T340552)
[22:59:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10439142 (10phaultfinder)
[23:02:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1108850 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[23:05:41] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2001.codfw.wmnet with OS bookworm
[23:15:28] <jinxer-wm>	 FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:16:28] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[23:18:21] <urbanecm>	 jouncebot: nowandnext
[23:18:21] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 41 minute(s)
[23:18:21] <jouncebot>	 In 7 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250108T0700)
[23:19:28] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2001.codfw.wmnet with reason: host reimage
[23:19:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10439189 (10phaultfinder)
[23:19:46] <wikibugs>	 (03PS3) 10Urbanecm: [Growth] enwiki: Deploy Add Link to 5% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108724 (https://phabricator.wikimedia.org/T382382)
[23:19:49] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [Growth] enwiki: Deploy Add Link to 5% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108724 (https://phabricator.wikimedia.org/T382382) (owner: 10Urbanecm)
[23:20:13] <wikibugs>	 (03Merged) 10jenkins-bot: tracing: Disable tracing in CLI mode [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1108850 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[23:20:36] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] enwiki: Deploy Add Link to 5% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108724 (https://phabricator.wikimedia.org/T382382) (owner: 10Urbanecm)
[23:20:45] <urbanecm>	 aha, cdanis is deploying
[23:20:48] <logmsgbot>	 !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1108850|tracing: Disable tracing in CLI mode (T340552)]]
[23:20:51] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[23:20:58] <cdanis>	 urbanecm: sorry, I broke maintenance scripts with wmf.11
[23:21:05] <urbanecm>	 no worries, i'll wait
[23:24:35] <cdanis>	 urbanecm: while you wait, you can peek at https://trace.wikimedia.org/trace/227ffb5c77838770d53cc352009c5851 :)
[23:25:26] <urbanecm>	 this...looks very useful!
[23:26:17] <cdanis>	 <3
[23:26:21] <cdanis>	 still more work to do
[23:26:28] <cdanis>	 but yes that's the hope :)
[23:26:30] <urbanecm>	 fingers crosssed!
[23:27:02] <logmsgbot>	 !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1108850|tracing: Disable tracing in CLI mode (T340552)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:27:05] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[23:27:12] <logmsgbot>	 !log cdanis@deploy2002 cdanis: Continuing with sync
[23:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10439216 (10phaultfinder)
[23:32:43] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review, Valentin. This is the first of the two patches I mentioned, which just adds a test case to be a bit more" [puppet] - 10https://gerrit.wikimedia.org/r/1101104 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French)
[23:33:54] <wikibugs>	 (03CR) 10Scott French: "Thanks again, Valentin. This is now the second, more consequential patch. Since nothing is setting this cookie yet, this "should be" a noo" [puppet] - 10https://gerrit.wikimedia.org/r/1082581 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French)
[23:33:56] <cdanis>	 23:33:45 K8s deployment progress:  93% (ok: 2284; fail: 0; left: 152) \
[23:35:11] <logmsgbot>	 !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108850|tracing: Disable tracing in CLI mode (T340552)]] (duration: 14m 23s)
[23:35:16] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[23:35:19] <cdanis>	 urbanecm: all done <3
[23:35:23] <urbanecm>	 ty!
[23:36:04] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1108724|[Growth] enwiki: Deploy Add Link to 5% of users (T382382)]]
[23:36:07] <stashbot>	 T382382: Add a link (Structured task): Increase rollout on English Wikipedia to 5% - https://phabricator.wikimedia.org/T382382
[23:36:33] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2001.codfw.wmnet with OS bookworm
[23:42:19] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1108724|[Growth] enwiki: Deploy Add Link to 5% of users (T382382)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:42:22] <stashbot>	 T382382: Add a link (Structured task): Increase rollout on English Wikipedia to 5% - https://phabricator.wikimedia.org/T382382
[23:42:24] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[23:49:39] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1108724|[Growth] enwiki: Deploy Add Link to 5% of users (T382382)]] (duration: 13m 34s)
[23:49:42] <stashbot>	 T382382: Add a link (Structured task): Increase rollout on English Wikipedia to 5% - https://phabricator.wikimedia.org/T382382
[23:49:53] * urbanecm done