[00:05:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on wikikube-worker1148:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:08:13] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167976
[00:08:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167976 (owner: 10TrainBranchBot)
[00:18:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167258 (owner: 10Krinkle)
[00:19:29] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Remove beta-specific 'http' entry for wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167258 (owner: 10Krinkle)
[00:21:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167259 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[00:21:52] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage
[00:22:17] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Move beta wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167259 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[00:22:31] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1167259|beta: Move beta wikipedia canonical to beta.wmcloud.org (T289318)]]
[00:22:35] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[00:24:31] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1167259|beta: Move beta wikipedia canonical to beta.wmcloud.org (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:26:24] <wikibugs>	 (03PS1) 10Krinkle: beta: Change FileRepo zone URL to upload.wikimedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167985 (https://phabricator.wikimedia.org/T289318)
[00:27:59] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage
[00:28:28] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167976 (owner: 10TrainBranchBot)
[00:29:20] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[00:34:45] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167259|beta: Move beta wikipedia canonical to beta.wmcloud.org (T289318)]] (duration: 12m 13s)
[00:34:49] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[00:39:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:42:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167985 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[00:43:30] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Change FileRepo zone URL to upload.wikimedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167985 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[00:43:46] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1167985|beta: Change FileRepo zone URL to upload.wikimedia.beta.wmcloud.org (T289318)]]
[00:43:53] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[00:45:41] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1167985|beta: Change FileRepo zone URL to upload.wikimedia.beta.wmcloud.org (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[00:46:40] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/c0c858105d9a6d6edb9405fa560c5bfba6e11a5808e356f3fc849e196f5c4227/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[00:47:23] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bookworm
[00:47:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037.eqiad.wm...
[00:50:01] <logmsgbot>	 !log krinkle@deploy1003 krinkle: Continuing with sync
[00:53:58] <wikibugs>	 (03Abandoned) 10Andrew Bogott: cloudcephosd1037: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167916 (https://phabricator.wikimedia.org/T396651) (owner: 10Andrew Bogott)
[00:55:16] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167985|beta: Change FileRepo zone URL to upload.wikimedia.beta.wmcloud.org (T289318)]] (duration: 11m 30s)
[00:55:20] <stashbot>	 T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318
[00:57:34] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1038.eqiad.wmnet
[01:03:43] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1038.eqiad.wmnet
[01:06:40] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:17:28] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1038.eqiad.wmnet
[01:17:31] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1038.eqiad.wmnet
[01:50:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[01:55:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:01:42] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet
[02:09:34] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet
[02:10:05] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet
[02:10:43] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet
[02:11:29] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet
[02:11:35] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet
[02:13:52] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1040.eqiad.wmnet
[02:17:03] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1040.eqiad.wmnet
[02:30:25] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1040.eqiad.wmnet
[02:30:28] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1040.eqiad.wmnet
[03:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on wikikube-worker1148:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:13:38] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1041.eqiad.wmnet
[03:15:12] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[03:21:23] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1041.eqiad.wmnet
[03:35:02] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1041.eqiad.wmnet
[03:35:05] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1041.eqiad.wmnet
[03:54:00] <wikibugs>	 06SRE, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#10993982 (10cscott) My current position is still outlined in T275319#9826396 above, and I'd love to help get some traction on tho...
[03:54:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993983 (10Andrew) 05Open→03Resolved I upgraded the firmware on all of these. My attempts to get them to bookworm at the s...
[05:03:43] <jinxer-wm>	 FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:03:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:03:58] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:04:08] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:20:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:25:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:35:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:38:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:39:26] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "nice! to be fully tested but the approach lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1167883 (owner: 10JHathaway)
[05:40:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:48:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[05:51:00] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[05:55:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:57:59] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: cache-text: remove static rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1166798 (https://phabricator.wikimedia.org/T398668)
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250711T0600)
[06:00:56] <icinga-wm>	 PROBLEM - Host rpki2003 is DOWN: PING CRITICAL - Packet loss = 100%
[06:01:29] <XioNoX>	 that should come back up ^
[06:04:10] <jinxer-wm>	 FIRING: GanetiBGPDown: BGP session down between ganeti2034 and lsw1-a4-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=lsw1-a4-codfw:9804&var-bgp_group=Ganeti4&var-bgp_neighbor=ganeti2034 - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown
[06:05:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job jmx_idp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:05:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[06:06:04] <icinga-wm>	 RECOVERY - Host rpki2003 is UP: PING WARNING - Packet loss = 75%, RTA = 33.79 ms
[06:09:10] <jinxer-wm>	 RESOLVED: GanetiBGPDown: BGP session down between ganeti2034 and lsw1-a4-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=lsw1-a4-codfw:9804&var-bgp_group=Ganeti4&var-bgp_neighbor=ganeti2034 - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown
[06:10:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job jmx_idp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:11:16] <icinga-wm>	 PROBLEM - Host rpki2003 is DOWN: PING CRITICAL - Packet loss = 100%
[06:11:30] <icinga-wm>	 RECOVERY - Host rpki2003 is UP: PING OK - Packet loss = 0%, RTA = 33.94 ms
[06:25:25] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es1048 [puppet] - 10https://gerrit.wikimedia.org/r/1168036 (https://phabricator.wikimedia.org/T395771)
[06:26:31] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1048 [puppet] - 10https://gerrit.wikimedia.org/r/1168036 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui)
[06:30:10] <wikibugs>	 (03PS1) 10Marostegui: db2213: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168037 (https://phabricator.wikimedia.org/T398928)
[06:30:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2213: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168037 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui)
[06:31:53] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2213.codfw.wmnet with reason: Maintenance
[06:31:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2213 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78892 and previous config saved to /var/cache/conftool/dbconfig/20250711-063156-marostegui.json
[06:39:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78893 and previous config saved to /var/cache/conftool/dbconfig/20250711-063922-root.json
[06:54:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78894 and previous config saved to /var/cache/conftool/dbconfig/20250711-065428-root.json
[06:55:40] <wikibugs>	 (03PS1) 10Krinkle: varnish: Improve GeoIP to use cookie domain similar to prod [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226)
[06:56:52] <wikibugs>	 (03PS2) 10Krinkle: varnish: Improve GeoIP to use cookie domain similar to prod [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226)
[06:56:54] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) (owner: 10Krinkle)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250711T0700)
[07:08:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:08:48] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:09:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78895 and previous config saved to /var/cache/conftool/dbconfig/20250711-070933-root.json
[07:18:42] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:18:43] <jinxer-wm>	 RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:18:43] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:23:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T399097#10994182 (10cmooney) The link remains down, Arelion are awaiting a replacement card for an optical system in Atlanta it seems:...
[07:23:43] <jinxer-wm>	 FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:23:48] <jinxer-wm>	 FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:24:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78896 and previous config saved to /var/cache/conftool/dbconfig/20250711-072439-root.json
[07:28:43] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:28:48] <jinxer-wm>	 FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:28:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#10994194 (10cmooney)
[07:33:43] <jinxer-wm>	 FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:34:44] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#10994199 (10MoritzMuehlenhoff)
[07:34:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#10994201 (10MoritzMuehlenhoff) @Milimetric @Ahoelzl @Ottomata This needs your approval
[07:36:48] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS trixie
[07:36:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10994206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host sretest1003.eqiad.wmnet with OS trixie
[07:37:10] <wikibugs>	 (03PS1) 10Jgiannelos: changeprop: Ignore more commons NS on pcs rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041
[07:38:43] <jinxer-wm>	 FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:43:43] <jinxer-wm>	 FIRING: [16x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:43:48] <jinxer-wm>	 FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:43:58] <jinxer-wm>	 FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:48:43] <jinxer-wm>	 FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:48:43] <jinxer-wm>	 RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:48:53] <jinxer-wm>	 RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:50:09] <wikibugs>	 (03PS3) 10Krinkle: varnish: Improve GeoIP to use cookie domain similar to prod [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226)
[07:50:58] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10994228 (10cmooney)
[07:53:43] <jinxer-wm>	 RESOLVED: [9x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh
[07:56:22] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[08:02:21] <wikibugs>	 (03CR) 10Krinkle: "I'm looking at `./modules/varnish/files/tests/docker_run.sh cp1110.eqiad.wmnet 1168038` (after simulating a nearby failure on PS2) to look" [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) (owner: 10Krinkle)
[08:02:34] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[08:06:45] <wikibugs>	 06SRE, 10decommission-hardware, 06Infrastructure-Foundations: decommission puppetserver2003 - https://phabricator.wikimedia.org/T398607#10994286 (10MoritzMuehlenhoff)
[08:18:05] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts puppetserver2003.codfw.wmnet
[08:19:50] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2223.codfw.wmnet with reason: Maintenance
[08:19:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2223 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78897 and previous config saved to /var/cache/conftool/dbconfig/20250711-081953-marostegui.json
[08:20:47] <logmsgbot>	 jmm@cumin1003 decommission (PID 1245901) is awaiting input
[08:26:45] <wikibugs>	 (03PS2) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041
[08:27:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78898 and previous config saved to /var/cache/conftool/dbconfig/20250711-082725-root.json
[08:27:36] <wikibugs>	 (03PS3) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072)
[08:30:46] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.dns.netbox
[08:33:13] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10994459 (10MoritzMuehlenhoff) >>! In T378028#10993697, @Dzahn wrote: > But another question comes to mind.. and that is.. do VRTS machi...
[08:33:58] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetserver2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003"
[08:34:17] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetserver2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003"
[08:34:17] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:34:18] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetserver2003.codfw.wmnet
[08:34:27] <wikibugs>	 06SRE, 10decommission-hardware, 06Infrastructure-Foundations: decommission puppetserver2003 - https://phabricator.wikimedia.org/T398607#10994460 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `puppetserver2003.codfw.wmnet` - puppetserver2003.codfw.wmnet (**PASS**...
[08:36:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetserver2003 [puppet] - 10https://gerrit.wikimedia.org/r/1168121 (https://phabricator.wikimedia.org/T398607)
[08:37:36] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm! thanks for the amend! lets merge and move on to the next thing!" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[08:41:28] <wikibugs>	 (03PS1) 10Ayounsi: magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122
[08:41:45] <wikibugs>	 (03PS2) 10Ayounsi: magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 (https://phabricator.wikimedia.org/T389767)
[08:41:47] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[08:42:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[08:42:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78899 and previous config saved to /var/cache/conftool/dbconfig/20250711-084230-root.json
[08:45:37] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::docker::reporter: add Wikikube and ML serve prod clusters [puppet] - 10https://gerrit.wikimedia.org/r/1167885 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey)
[08:51:28] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'.
[08:51:29] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[08:51:43] <wikibugs>	 (03Abandoned) 10Slyngshede: data.yaml offboarding trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1164707 (owner: 10Slyngshede)
[08:51:51] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'.
[08:51:55] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'.
[08:54:25] <wikibugs>	 (03PS4) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072)
[08:57:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78900 and previous config saved to /var/cache/conftool/dbconfig/20250711-085736-root.json
[09:00:14] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1168124 (https://phabricator.wikimedia.org/T399280)
[09:01:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the patch -- LGTM, I'm thinking we can add raises=False to task_comment() since I don't think it is fatal if we don't commen" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167887 (owner: 10Volans)
[09:02:25] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "I think it is worth to try it, let's see how it goes and if we have to follow up or not!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar)
[09:03:29] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "Done" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156733 (owner: 10Hashar)
[09:03:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver2003 [puppet] - 10https://gerrit.wikimedia.org/r/1168121 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff)
[09:03:55] <wikibugs>	 (03Abandoned) 10Elukey: TEST - fix http_boot_once for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1166378 (owner: 10Elukey)
[09:04:16] <logmsgbot>	 !log jmm@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1003.eqiad.wmnet with OS trixie
[09:04:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10994543 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host sretest1003.eqiad.wmnet with OS trixie executed with errors: - srete...
[09:05:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission puppetserver2003 - https://phabricator.wikimedia.org/T398607#10994546 (10MoritzMuehlenhoff)
[09:06:42] <wikibugs>	 (03CR) 10Elukey: [C:03+1] I/F: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 (owner: 10Volans)
[09:07:53] <wikibugs>	 (03PS5) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072)
[09:09:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:12:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78901 and previous config saved to /var/cache/conftool/dbconfig/20250711-091242-root.json
[09:12:47] <wikibugs>	 (03PS6) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072)
[09:13:25] <wikibugs>	 (03PS7) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072)
[09:15:19] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T399280
[09:15:22] <stashbot>	 T399280: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T399280
[09:15:59] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10994585 (10Arnoldokoth) @Dzahn We used to run it on VMs but we kept running into resource issues (especially with `clamav`) even after...
[09:16:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox: remove old cr2-codfw Switch Control Board inventory items - https://phabricator.wikimedia.org/T398940#10994586 (10ayounsi) We can remove them from Netbox if they're not in the device anymore. and add them to the spare tracking...
[09:18:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2213 from API/vslow/dump T399280', diff saved to https://phabricator.wikimedia.org/P78902 and previous config saved to /var/cache/conftool/dbconfig/20250711-091812-root.json
[09:20:35] <wikibugs>	 (03CR) 10Gmodena: "LGTM. Just left two nit/questions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[09:21:57] <logmsgbot>	 jmm@cumin1003 reimage (PID 1251864) is awaiting input
[09:22:14] <wikibugs>	 (03PS2) 10Elukey: EventStreamConfig: add the maps.tiles_change_bookworm stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565)
[09:22:36] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1168124 (https://phabricator.wikimedia.org/T399280) (owner: 10Gerrit maintenance bot)
[09:22:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli)
[09:23:29] <wikibugs>	 (03PS8) 10Jgiannelos: changeprop: Ignore more namespaces on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072)
[09:24:35] <wikibugs>	 (03CR) 10Elukey: "Thanks Gabriele!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[09:24:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:25:15] <moritzm>	 !log imported perccli for trixie-wikimedia T391083
[09:25:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:19] <stashbot>	 T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083
[09:27:49] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS trixie
[09:27:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10994632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host sretest1003.eqiad.wmnet with OS trixie
[09:28:06] <wikibugs>	 (03PS9) 10Jgiannelos: changeprop: Ignore more namespaces on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072)
[09:29:27] <marostegui>	 !log Starting s5 codfw failover from db2192 to db2213 - T399280
[09:29:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:30] <stashbot>	 T399280: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T399280
[09:30:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2213 to s5 primary T399280', diff saved to https://phabricator.wikimedia.org/P78903 and previous config saved to /var/cache/conftool/dbconfig/20250711-093006-marostegui.json
[09:31:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2192 T399280', diff saved to https://phabricator.wikimedia.org/P78904 and previous config saved to /var/cache/conftool/dbconfig/20250711-093115-root.json
[09:33:32] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] changeprop: Ignore more namespaces on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos)
[09:34:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 (owner: 10Volans)
[09:37:03] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10994674 (10MoritzMuehlenhoff) That said, if you striclty need a physical host for the tests, you could use puppetserver2003. I decommed...
[09:38:03] <wikibugs>	 (03PS1) 10Marostegui: db2192: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168131 (https://phabricator.wikimedia.org/T398928)
[09:38:44] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2192: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168131 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui)
[09:39:14] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2192.codfw.wmnet with reason: Maintenance
[09:42:51] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10994702 (10Ladsgroup)
[09:43:13] <wikibugs>	 (03PS1) 10Cathal Mooney: admin: grant aranyap access and add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1168132 (https://phabricator.wikimedia.org/T398650)
[09:44:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: grant aranyap access and add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1168132 (https://phabricator.wikimedia.org/T398650) (owner: 10Cathal Mooney)
[09:45:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78905 and previous config saved to /var/cache/conftool/dbconfig/20250711-094527-root.json
[09:45:40] <wikibugs>	 (03PS2) 10Cathal Mooney: admin: grant aranyap access and add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1168132 (https://phabricator.wikimedia.org/T398650)
[09:46:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1168132 (https://phabricator.wikimedia.org/T398650) (owner: 10Cathal Mooney)
[09:55:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:55:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] admin: grant aranyap access and add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1168132 (https://phabricator.wikimedia.org/T398650) (owner: 10Cathal Mooney)
[09:55:36] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] changeprop: Ignore more namespaces on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos)
[09:57:11] <wikibugs>	 (03CR) 10Ladsgroup: "None of the tables being dropped exist in production but maybe we should reload the triggers or something but it's noop AFAIK" [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup)
[09:57:29] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: Ignore more namespaces on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos)
[09:58:41] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "I try to merge and deploy it next week if noone beats me to it." [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80)
[09:59:26] <wikibugs>	 (03PS2) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407
[10:00:05] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Make functionally identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80)
[10:00:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78906 and previous config saved to /var/cache/conftool/dbconfig/20250711-100033-root.json
[10:01:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2192', diff saved to https://phabricator.wikimedia.org/P78907 and previous config saved to /var/cache/conftool/dbconfig/20250711-100106-root.json
[10:03:04] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "When merged, this will be automatically added to the mediawiki-cli image during its next full build, due to this: https://gitlab.wikimedia" [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80)
[10:03:58] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] EventStreamConfig: add the maps.tiles_change_bookworm stream (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[10:05:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78908 and previous config saved to /var/cache/conftool/dbconfig/20250711-100522-root.json
[10:07:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi)
[10:09:43] <wikibugs>	 (03CR) 10Elukey: EventStreamConfig: add the maps.tiles_change_bookworm stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[10:10:41] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10994757 (10cmooney) Hi @aranyap.    I have added your public ssh key and username 'aranyap' to the //analytics-privatedata-users...
[10:10:45] <wikibugs>	 (03PS3) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407
[10:11:39] <wikibugs>	 (03PS1) 10Hnowlan: changeprop: correct amended regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168144 (https://phabricator.wikimedia.org/T397072)
[10:11:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:12:30] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] changeprop: correct amended regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168144 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan)
[10:15:59] <wikibugs>	 (03PS2) 10Hnowlan: changeprop: correct amended regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168144 (https://phabricator.wikimedia.org/T397072)
[10:17:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi)
[10:18:59] <wikibugs>	 (03PS4) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407
[10:19:00] <logmsgbot>	 jmm@cumin1003 reimage (PID 1251864) is awaiting input
[10:19:58] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] changeprop: correct amended regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168144 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan)
[10:20:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78909 and previous config saved to /var/cache/conftool/dbconfig/20250711-102027-root.json
[10:25:40] <wikibugs>	 (03PS1) 10Muehlenhoff: late-command: Check whether qemu_fw_cfg.ko is present [puppet] - 10https://gerrit.wikimedia.org/r/1168145 (https://phabricator.wikimedia.org/T391083)
[10:26:27] <logmsgbot>	 !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1003.eqiad.wmnet with OS trixie
[10:27:44] <wikibugs>	 (03PS2) 10Elukey: services: configure tegola in codfw to use maps-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565)
[10:27:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10994794 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host sretest1003.eqiad.wmnet with OS trixie execute...
[10:27:44] <wikibugs>	 (03CR) 10Elukey: services: configure tegola in codfw to use maps-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[10:27:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10994796 (10cmooney)
[10:29:23] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: correct amended regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168144 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan)
[10:30:40] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[10:30:48] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[10:31:26] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[10:31:52] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[10:32:05] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[10:32:15] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[10:34:35] <wikibugs>	 (03PS1) 10Ladsgroup: private_tables: Drop private tables that don't exist in production [puppet] - 10https://gerrit.wikimedia.org/r/1168148 (https://phabricator.wikimedia.org/T398945)
[10:35:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78910 and previous config saved to /var/cache/conftool/dbconfig/20250711-103533-root.json
[10:36:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi)
[10:37:58] <wikibugs>	 (03PS1) 10Hnowlan: profile::hcaptcha: don't serve / or robots.txt [puppet] - 10https://gerrit.wikimedia.org/r/1168149 (https://phabricator.wikimedia.org/T397841)
[10:39:51] <wikibugs>	 (03PS1) 10Tiziano Fogli: nrpe wrapper: add wrapper to be invoked a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446)
[10:41:01] <wikibugs>	 (03PS5) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407
[10:41:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10994846 (10STran)
[10:49:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10994883 (10cmooney)
[10:50:37] <wikibugs>	 (03PS1) 10Cathal Mooney: admin: add user 'stran' to analytics-privatedata-users and enable kerberos [puppet] - 10https://gerrit.wikimedia.org/r/1168152 (https://phabricator.wikimedia.org/T399107)
[10:50:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78911 and previous config saved to /var/cache/conftool/dbconfig/20250711-105039-root.json
[10:50:43] <icinga-wm>	 PROBLEM - ganeti-noded running on ganeti1036 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[10:51:43] <icinga-wm>	 RECOVERY - ganeti-noded running on ganeti1036 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[10:51:57] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10994891 (10cmooney) I verified the above key over slack and can confirm that it is not in use for WMCS access.
[10:58:34] <wikibugs>	 (03PS1) 10Btullis: Increase the limitranges for the spark-history service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168153 (https://phabricator.wikimedia.org/T396617)
[10:59:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78912 and previous config saved to /var/cache/conftool/dbconfig/20250711-105922-root.json
[11:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250711T0700)
[11:00:04] <jouncebot>	 jelto, arnoldokoth, and mutante: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250711T1100). Please do the needful.
[11:03:41] <wikibugs>	 (03CR) 10Marostegui: "thanks, fine to merge from my side" [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup)
[11:14:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78913 and previous config saved to /var/cache/conftool/dbconfig/20250711-111428-root.json
[11:15:13] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[11:20:55] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] late-command: Check whether qemu_fw_cfg.ko is present [puppet] - 10https://gerrit.wikimedia.org/r/1168145 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff)
[11:26:32] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1031.eqiad.wmnet with reason: Maintenance
[11:26:45] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[11:29:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78914 and previous config saved to /var/cache/conftool/dbconfig/20250711-112933-root.json
[11:29:42] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance
[11:30:13] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for es1031.eqiad.wmnet
[11:30:13] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es1031.eqiad.wmnet
[11:31:04] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1032.eqiad.wmnet with reason: Maintenance
[11:34:58] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1034.eqiad.wmnet with reason: Maintenance
[11:35:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool es1034 for upgrade', diff saved to https://phabricator.wikimedia.org/P78915 and previous config saved to /var/cache/conftool/dbconfig/20250711-113532-fceratto.json
[11:44:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78916 and previous config saved to /var/cache/conftool/dbconfig/20250711-114439-root.json
[11:45:00] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1168152 (https://phabricator.wikimedia.org/T399107) (owner: 10Cathal Mooney)
[11:45:51] <wikibugs>	 (03PS1) 10Btullis: Enable greater timeouts and rewriting for the spark-history service [puppet] - 10https://gerrit.wikimedia.org/r/1168165 (https://phabricator.wikimedia.org/T396617)
[11:46:35] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6242/co" [puppet] - 10https://gerrit.wikimedia.org/r/1168165 (https://phabricator.wikimedia.org/T396617) (owner: 10Btullis)
[11:46:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:52:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10995078 (10sowmya.guru) Hey folks the NDA is signed by me!
[11:52:46] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.hosts.reboot-single for host es1034.eqiad.wmnet
[11:56:26] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp7006 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[11:57:26] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp7006 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[11:58:00] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Upgrade db2200 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168166 (https://phabricator.wikimedia.org/T399298)
[12:01:29] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:01:41] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:02:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] late-command: Check whether qemu_fw_cfg.ko is present [puppet] - 10https://gerrit.wikimedia.org/r/1168145 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff)
[12:03:14] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es1034.eqiad.wmnet
[12:04:10] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:04:15] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:06:51] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:06:53] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:08:09] <wikibugs>	 (03CR) 10Daimona Eaytoy: mariadb: Remove tables that are not cataloged from filtered_tables.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup)
[12:09:58] <wikibugs>	 (03CR) 10Muehlenhoff: openstack: nova: Load nf_conntrack module at boot (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri)
[12:16:45] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for es1034.eqiad.wmnet
[12:16:45] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es1034.eqiad.wmnet
[12:17:11] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[12:17:32] <wikibugs>	 (03PS1) 10Marostegui: db2187: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168167 (https://phabricator.wikimedia.org/T399298)
[12:17:34] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:17:38] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:17:52] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es1034 gradually with 4 steps - Pooling in
[12:17:54] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es1034 gradually with 4 steps - Pooling in
[12:18:15] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es1034 gradually with 4 steps - Pooling in
[12:18:18] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es1034 gradually with 4 steps - Pooling in
[12:18:30] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es1034 gradually with 4 steps - Pooling in
[12:18:38] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS trixie
[12:20:46] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[12:22:26] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:22:31] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:24:49] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:24:53] <logmsgbot>	 !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:28:09] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1032.eqiad.wmnet with reason: Maintenance
[12:28:47] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es1032 - Depooling RO host
[12:28:51] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Clean up some settings for special wikis no longer in wikipedia group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168169 (https://phabricator.wikimedia.org/T183549)
[12:28:56] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es1032 - Depooling RO host
[12:29:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Clean up some settings for special wikis no longer in wikipedia group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168169 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy)
[12:30:12] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bookworm
[12:30:14] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bookworm
[12:30:14] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bookworm
[12:30:20] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es1032 - Depooling RO host
[12:30:24] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es1032 - Depooling RO host
[12:30:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995162 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1049.eq...
[12:30:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eq...
[12:30:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eq...
[12:31:11] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] admin: add user 'stran' to analytics-privatedata-users and enable kerberos [puppet] - 10https://gerrit.wikimedia.org/r/1168152 (https://phabricator.wikimedia.org/T399107) (owner: 10Cathal Mooney)
[12:33:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10995177 (10Jclark-ctr)
[12:33:15] <logmsgbot>	 !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[12:34:38] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10995186 (10cmooney) Ok @STran I think you should be good to go now if you want to test the access.
[12:38:22] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[12:39:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#10995188 (10cmooney) Still no fix ` 7/11/2025 10:22:47 AM   ETA is 1:00 PM UTC 7/11/2025 9:23:02  AM   We have escalated with our vendor to ensure the testing a...
[12:39:26] <jakob_WMDE>	 hello, is T399297 the task to keep in eye on regarding the beta sites being down?
[12:39:26] <stashbot>	 T399297: Widespread instances down in project deployment-prep - https://phabricator.wikimedia.org/T399297
[12:40:28] <denisse>	 !incidents
[12:40:28] <sirenbot>	 No incidents occurred in the past 24 hours for team SRE
[12:40:35] <denisse>	 🥳
[12:41:33] <wikibugs>	 (03PS4) 10Dreamy Jazz: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy)
[12:42:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2187: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168167 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui)
[12:42:45] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2187.codfw.wmnet with reason: Maintenance
[12:42:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2187 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78919 and previous config saved to /var/cache/conftool/dbconfig/20250711-124249-marostegui.json
[12:49:14] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage
[12:49:15] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage
[12:49:29] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage
[12:50:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78921 and previous config saved to /var/cache/conftool/dbconfig/20250711-125022-root.json
[12:52:35] <wikibugs>	 (03PS1) 10Btullis: Increase the CPU and memory limits for the spark-history service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168171 (https://phabricator.wikimedia.org/T396617)
[12:52:55] <logmsgbot>	 !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS trixie
[12:52:57] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage
[12:53:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10995233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host sretest1003.eqiad.wmnet with OS trixie completed: - sretest1003 (**P...
[12:55:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1168172
[12:56:22] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage
[12:57:15] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10995239 (10MoritzMuehlenhoff) Installations with Trixie are now possible, which directly install the backport of Puppet 7, all known issues affecting the Puppet base clas...
[12:57:46] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[12:57:53] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[13:00:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10995251 (10VRiley-WMF) I will look into this. I believe it may be due to lvs1017's nic being misconfigured. I will update it and test it out
[13:03:11] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage
[13:03:56] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1034 gradually with 4 steps - Pooling in
[13:05:01] <wikibugs>	 (03PS6) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407
[13:05:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78924 and previous config saved to /var/cache/conftool/dbconfig/20250711-130528-root.json
[13:10:17] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bookworm
[13:10:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995338 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad....
[13:11:35] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[13:11:50] <wikibugs>	 (03PS3) 10Ayounsi: magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 (https://phabricator.wikimedia.org/T389767)
[13:12:57] <wikibugs>	 (03PS5) 10Dreamy Jazz: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy)
[13:14:04] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bookworm
[13:14:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995347 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1049.eqiad....
[13:14:56] <wikibugs>	 (03PS1) 10Jcrespo: raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446)
[13:15:10] <wikibugs>	 (03PS2) 10Jcrespo: raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446)
[13:16:50] <wikibugs>	 (03CR) 10Jcrespo: "This is a draft so I do not forget over the weekend. This is (I belive) a bug on raid output." [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) (owner: 10Jcrespo)
[13:17:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1168172 (owner: 10Muehlenhoff)
[13:18:14] <wikibugs>	 (03PS2) 10Gmodena: eventbus: register with team-data-engineering. [alerts] - 10https://gerrit.wikimedia.org/r/1168119 (https://phabricator.wikimedia.org/T398437)
[13:18:39] <wikibugs>	 (03PS3) 10Gmodena: eventgate: alert on traffic deviation. [alerts] - 10https://gerrit.wikimedia.org/r/1167620 (https://phabricator.wikimedia.org/T398437)
[13:19:43] <wikibugs>	 (03PS3) 10Jcrespo: raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446)
[13:20:28] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bookworm
[13:20:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78925 and previous config saved to /var/cache/conftool/dbconfig/20250711-132034-root.json
[13:20:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995381 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad....
[13:20:58] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#10995383 (10hashar)
[13:21:39] <wikibugs>	 (03PS1) 10Dreamy Jazz: Enable hCaptcha on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148)
[13:22:17] <wikibugs>	 (03PS6) 10Dreamy Jazz: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy)
[13:22:30] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10995390 (10Arnoldokoth) Thanks @MoritzMuehlenhoff We'll consider that... But I'm doubtful we "strictly" need to test this on hardware....
[13:22:54] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10995393 (10cmooney) >>! In T398686#10995078, @sowmya.guru wrote: > The NDA is signed by me!   Thanks!  Once we get confirmation it's on file I will get going on the access.
[13:23:16] <wikibugs>	 (03PS7) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407
[13:24:36] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:24:45] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:25:53] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[13:28:21] <wikibugs>	 (03CR) 10Ayounsi: "`configure-switch-interfaces` tested in https://phabricator.wikimedia.org/P78926" [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi)
[13:28:34] <jinxer-wm>	 FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:28:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:29:50] <wikibugs>	 (03PS1) 10Muehlenhoff: icinga: Use systemd::sysuser to create the metamonitor system user [puppet] - 10https://gerrit.wikimedia.org/r/1168179
[13:34:00] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168179 (owner: 10Muehlenhoff)
[13:35:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78927 and previous config saved to /var/cache/conftool/dbconfig/20250711-133539-root.json
[13:36:45] <jinxer-wm>	 FIRING: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[13:37:54] <sukhe>	 hmmm
[13:39:21] <wikibugs>	 (03PS1) 10Btullis: Use sed to identify any md based swaps during cephosd server reimage [puppet] - 10https://gerrit.wikimedia.org/r/1168181 (https://phabricator.wikimedia.org/T399281)
[13:40:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] Use sed to identify any md based swaps during cephosd server reimage [puppet] - 10https://gerrit.wikimedia.org/r/1168181 (https://phabricator.wikimedia.org/T399281) (owner: 10Btullis)
[13:40:58] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Use sed to identify any md based swaps during cephosd server reimage [puppet] - 10https://gerrit.wikimedia.org/r/1168181 (https://phabricator.wikimedia.org/T399281) (owner: 10Btullis)
[13:41:11] <wikibugs>	 (03PS2) 10Muehlenhoff: icinga: Use systemd::sysuser to create the metamonitor system user [puppet] - 10https://gerrit.wikimedia.org/r/1168179
[13:41:33] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[13:45:25] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:45:29] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:48:11] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[13:48:31] <logmsgbot>	 !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[13:48:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168179 (owner: 10Muehlenhoff)
[13:48:39] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:48:45] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.001 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[13:49:15] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:49:19] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:49:36] <jinxer-wm>	 FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:53:34] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:55:06] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[13:55:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[13:56:45] <jinxer-wm>	 RESOLVED: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[13:58:14] <wikibugs>	 (03PS1) 10Marostegui: db2242: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168184 (https://phabricator.wikimedia.org/T399298)
[13:58:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:58:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2242: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168184 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui)
[13:59:15] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2242.codfw.wmnet with reason: Maintenance
[13:59:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2242 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78929 and previous config saved to /var/cache/conftool/dbconfig/20250711-135919-marostegui.json
[14:00:00] <sukhe>	 that's the NTT link (eqsin -> ulsfo)
[14:00:05] <sukhe>	 09:56:45 <+jinxer-wm> RESOLVED: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota   - 
[14:00:39] <wikibugs>	 (03PS2) 10Jforrester: Add phan and use it to detect duplicated array keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy)
[14:03:55] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "Neat!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy)
[14:06:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78930 and previous config saved to /var/cache/conftool/dbconfig/20250711-140648-root.json
[14:09:58] <akosiaris>	 !log sudo swapoff /dev/md1 on cloudcephosd1036 T399281￼
[14:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:10] <stashbot>	 T399281: 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281
[14:13:08] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 (https://phabricator.wikimedia.org/T389767) (owner: 10Ayounsi)
[14:13:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Nice!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy)
[14:13:46] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Add phan and use it to detect duplicated array keys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy)
[14:13:50] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10995529 (10Dzahn) Thanks all. I am not sure though if the request was for "temp testing setup" or just for "a new system to replace the...
[14:16:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:21:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:21:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78931 and previous config saved to /var/cache/conftool/dbconfig/20250711-142154-root.json
[14:24:00] <logmsgbot>	 !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[14:25:29] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[14:25:34] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[14:27:56] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[14:37:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78932 and previous config saved to /var/cache/conftool/dbconfig/20250711-143659-root.json
[14:44:29] <logmsgbot>	 !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage
[14:48:52] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage
[14:52:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78933 and previous config saved to /var/cache/conftool/dbconfig/20250711-145205-root.json
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:49] <logmsgbot>	 !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
[15:15:13] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[15:16:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10995789 (10Eevans) >>! In T396970#10989045, @Eevans wrote: >>>! In T396970#10965457, @VRiley-WMF wrote: >> Is there a time when we can plan for me to look and try to swap at least one of those drives? I'll nee...
[15:16:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:16:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox: remove old cr2-codfw Switch Control Board inventory items - https://phabricator.wikimedia.org/T398940#10995790 (10RobH) >>! In T398940#10994586, @ayounsi wrote: > We can remove them from Netbox if they're not in the device anym...
[15:22:26] <wikibugs>	 (03Merged) 10jenkins-bot: Increase the limitranges for the spark-history service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168153 (https://phabricator.wikimedia.org/T396617) (owner: 10Btullis)
[15:28:09] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: testing issues with primary arelion link, T399221]
[15:28:09] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: testing issues with primary arelion link, T399221]
[15:28:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10995827 (10elukey) @Jclark-ctr I have the feeling that we'll have to pause this work for a bit of time, I'll need to set some time off to figure out what's different a...
[15:28:09] <stashbot>	 T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221
[15:29:03] <wikibugs>	 (03PS1) 10Fabfur: cache::haproxy: add x_analytics log variable to http frontend too [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167)
[15:32:38] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur)
[15:32:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:35:22] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:36:59] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:37:44] <wikibugs>	 (03PS1) 10Cwhite: logstash: convert numerics - remove field removal and tracking [puppet] - 10https://gerrit.wikimedia.org/r/1168201 (https://phabricator.wikimedia.org/T234565)
[15:37:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:38:00] <logmsgbot>	 !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:38:15] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:38:43] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bullseye
[15:39:37] <logmsgbot>	 !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:41:48] <jinxer-wm>	 FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[15:43:15] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[15:43:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: convert numerics - remove field removal and tracking [puppet] - 10https://gerrit.wikimedia.org/r/1168201 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:44:16] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[15:44:31] <logmsgbot>	 !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[15:51:05] <wikibugs>	 (03PS2) 10Cwhite: logstash: convert numerics - remove field removal and tracking [puppet] - 10https://gerrit.wikimedia.org/r/1168201 (https://phabricator.wikimedia.org/T234565)
[15:54:03] <topranks>	 !log un-drain Arelion CCT from codfw to eqsin T399221
[15:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:08] <stashbot>	 T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221
[15:54:49] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10995939 (10elukey) @Jhancock.wm I tried with 2045 since I wasn't able to log in on 2044, I get the same failures in provisioning: no nics reported. As Riccardo pointed...
[15:55:08] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: convert numerics - remove field removal and tracking [puppet] - 10https://gerrit.wikimedia.org/r/1168201 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:56:01] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168209
[15:56:41] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bullseye
[15:56:48] <jinxer-wm>	 RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[16:06:59] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10996028 (10wiki_willy) Hi @elukey - can you or @Volans send me an email summarizing everything you need from Dell?  I'll add the Technical Account Rep to the email thre...
[16:16:04] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage
[16:19:53] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage
[16:21:24] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168214
[16:22:05] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "thanks!  no diff in compiler https://puppet-compiler.wmflabs.org/output/1129920/6245/" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[16:22:26] <wikibugs>	 (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168209 (owner: 10PipelineBot)
[16:29:31] <wikibugs>	 (03PS1) 10Dzahn: gerrit: also rename "passive" to "spare" server in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1168216 (https://phabricator.wikimedia.org/T387833)
[16:30:01] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "noop in prod confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[16:31:55] <topranks>	 !log drain Arelion CCT from codfw to eqsin - still see minor packet loss which is affecting purged T399221
[16:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:59] <stashbot>	 T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221
[16:38:08] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1036.eqiad.wmnet with OS bullseye
[16:39:47] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye
[16:51:07] <logmsgbot>	 !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1035.eqiad.wmnet with OS bullseye
[16:51:47] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye
[17:10:17] <logmsgbot>	 !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: done testing issues with primary arelion link, T399221]
[17:10:21] <stashbot>	 T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221
[17:10:22] <logmsgbot>	 !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: done testing issues with primary arelion link, T399221]
[17:11:01] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage
[17:14:03] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage
[17:16:43] <wikibugs>	 (03Abandoned) 10Cwhite: add docs for string_to_numeric_conversion_failure [software/ecs] - 10https://gerrit.wikimedia.org/r/1166008 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[17:22:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996311 (10VRiley-WMF) @Marostegui thanks! I will be installing this as a "new" unit of db1259
[17:23:14] <wikibugs>	 (03CR) 10Krinkle: "Yes." [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle)
[17:26:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996337 (10Marostegui) >>! In T393296#10996311, @VRiley-WMF wrote: > @Marostegui thanks! I will be installing this as a "new" unit of db1259  <3
[17:27:09] <wikibugs>	 (03CR) 10Daimona Eaytoy: Add phan and use it to detect duplicated array keys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy)
[17:28:07] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: remove filter_on_templates v1 [puppet] - 10https://gerrit.wikimedia.org/r/1167942 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[17:28:26] <wikibugs>	 (03PS2) 10Cwhite: logstash: rename filter-on-templates.rb [puppet] - 10https://gerrit.wikimedia.org/r/1167943 (https://phabricator.wikimedia.org/T234565)
[17:32:32] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1035.eqiad.wmnet with OS bullseye
[17:39:03] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bullseye
[17:44:18] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: also rename "passive" to "spare" server in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1168216 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[17:44:24] <wikibugs>	 (03PS2) 10Dzahn: gerrit: also rename "passive" to "spare" server in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1168216 (https://phabricator.wikimedia.org/T387833)
[17:45:12] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1007.eqiad.wmnet with OS bullseye
[17:48:24] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1008.eqiad.wmnet with OS bullseye
[17:48:34] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: also rename "passive" to "spare" server in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1168216 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn)
[17:54:33] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "cloudceph osd.yaml: update some nic names for Bookworm reimages" [puppet] - 10https://gerrit.wikimedia.org/r/1168227 (https://phabricator.wikimedia.org/T399281)
[17:55:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[17:57:04] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage
[18:03:27] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage
[18:03:34] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1007.eqiad.wmnet with reason: host reimage
[18:05:27] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:06:55] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage
[18:07:01] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1007.eqiad.wmnet with reason: host reimage
[18:07:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:09:40] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage
[18:10:27] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:12:58] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:13:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "cloudceph osd.yaml: update some nic names for Bookworm reimages" [puppet] - 10https://gerrit.wikimedia.org/r/1168227 (https://phabricator.wikimedia.org/T399281) (owner: 10Andrew Bogott)
[18:20:03] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bullseye
[18:23:11] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1007.eqiad.wmnet with OS bullseye
[18:24:45] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1008.eqiad.wmnet with OS bullseye
[18:36:38] <wikibugs>	 (03PS2) 10Dreamy Jazz: Enable hCaptcha on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148)
[18:49:53] <icinga-wm>	 PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100%
[18:52:04] <wikibugs>	 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10996416 (10KFrancis) Hello all, I am confirming the NDA is fully signed.  Thanks!
[18:52:23] <icinga-wm>	 RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms
[19:10:44] <wikibugs>	 (03PS1) 10Cwhite: Revert "logstash: remove event.duration when value is hyphen" [puppet] - 10https://gerrit.wikimedia.org/r/1168234
[19:11:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996452 (10VRiley-WMF) Just to verify with you @Marostegui the server is now in netbox. However, this seed server only has a single 1.92TB drive, while the other server has ten 1.92 drives. Is it safe...
[19:15:13] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[19:28:32] <wikibugs>	 (03PS1) 10Dreamy Jazz: Document Trust and Safety Product Team database tables [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302)
[19:33:04] <wikibugs>	 (03CR) 10Dreamy Jazz: Document Trust and Safety Product Team database tables (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz)
[19:34:00] <wikibugs>	 (03PS2) 10Dreamy Jazz: mariadb: Document Trust and Safety Product Team database tables [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302)
[19:36:12] <wikibugs>	 (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz)
[20:00:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996515 (10Marostegui) Yes, absolutely! Go for it
[21:07:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:08:39] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:10:13] <icinga-wm>	 PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Sun 27 Jul 2025 09:10:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[21:18:56] <wikibugs>	 (03CR) 10BryanDavis: "Cause of T399216. The `hieradata/common/profile/*` files are not loaded by any codepath for a Cloud VPS instance as far as I can tell." [labs/private] - 10https://gerrit.wikimedia.org/r/1155221 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková)
[21:43:23] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[21:43:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996704 (10VRiley-WMF) Provisioning now...
[21:46:36] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt db1259 - vriley@cumin1002"
[21:46:41] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt db1259 - vriley@cumin1002"
[21:46:41] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:48:17] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host db1259
[21:48:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:49:03] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#10996713 (10cmooney) Arelion came back to say they did move a path but that they see CRC errors inbound from us in codfw: ` 2025-07-11 19:48  Hello Team,  We ha...
[21:49:14] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#10996717 (10cmooney) p:05Triage→03High
[21:49:30] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1259
[21:50:23] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1259.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:55:06] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[22:07:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:08:39] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:10:27] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1259.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:26:23] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1259.eqiad.wmnet with OS bookworm
[22:26:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1259.eqiad.wmnet with OS bookworm
[22:47:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996821 (10VRiley-WMF) Before proceeding with the imaging, I wanted to make sure, it's okay for me to wipe these drives, correct? I think that's why it may fail on the reimage
[23:15:13] <jinxer-wm>	 FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire  - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[23:16:10] <logmsgbot>	 vriley@cumin1002 reimage (PID 2981850) is awaiting input
[23:38:17] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1168275
[23:38:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1168275 (owner: 10TrainBranchBot)
[23:50:36] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1168275 (owner: 10TrainBranchBot)
[23:57:00] <wikibugs>	 (03CR) 10Ladsgroup: "If you don't set Hosts: footer, the check experimental trigger PCC on all production hosts which is an extremely expensive operation and s" [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz)