[00:01:41] <wikibugs>	 (03PS1) 10Dzahn: devtools: update host name for new gerrit test instance [puppet] - 10https://gerrit.wikimedia.org/r/1036767 (https://phabricator.wikimedia.org/T363196)
[00:02:13] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] devtools: update host name for new gerrit test instance [puppet] - 10https://gerrit.wikimedia.org/r/1036767 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn)
[00:03:08] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1036596 (owner: 10TrainBranchBot)
[00:13:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P63490 and previous config saved to /var/cache/conftool/dbconfig/20240529-001303-marostegui.json
[00:18:20] <wikibugs>	 (03PS4) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/909763
[00:18:57] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:20:49] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add parameter to toggle lfs_replica_sync ensure [puppet] - 10https://gerrit.wikimedia.org/r/1036771
[00:21:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: add parameter to toggle lfs_replica_sync ensure [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (owner: 10Dzahn)
[00:28:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P63491 and previous config saved to /var/cache/conftool/dbconfig/20240529-002811-marostegui.json
[00:43:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T364299)', diff saved to https://phabricator.wikimedia.org/P63492 and previous config saved to /var/cache/conftool/dbconfig/20240529-004319-marostegui.json
[00:43:22] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[00:43:27] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[00:43:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[00:43:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T364299)', diff saved to https://phabricator.wikimedia.org/P63493 and previous config saved to /var/cache/conftool/dbconfig/20240529-004343-marostegui.json
[01:48:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T364299)', diff saved to https://phabricator.wikimedia.org/P63494 and previous config saved to /var/cache/conftool/dbconfig/20240529-014845-marostegui.json
[01:48:53] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[01:58:07] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[02:03:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P63495 and previous config saved to /var/cache/conftool/dbconfig/20240529-020353-marostegui.json
[02:19:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P63496 and previous config saved to /var/cache/conftool/dbconfig/20240529-021901-marostegui.json
[02:34:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T364299)', diff saved to https://phabricator.wikimedia.org/P63497 and previous config saved to /var/cache/conftool/dbconfig/20240529-023409-marostegui.json
[02:34:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[02:34:16] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[02:34:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[02:34:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T364299)', diff saved to https://phabricator.wikimedia.org/P63498 and previous config saved to /var/cache/conftool/dbconfig/20240529-023432-marostegui.json
[02:36:48] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:42:47] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9840665 (10CDanis) = tldr: * Adding the new control plane workers in eqiad turned what was a CPU saturation issue (causing blackbox probes to be slow but still within timeouts), into a simultaneous...
[02:56:48] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:04:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:17:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T364069)', diff saved to https://phabricator.wikimedia.org/P63499 and previous config saved to /var/cache/conftool/dbconfig/20240529-031710-marostegui.json
[03:17:20] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[03:18:57] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:29:56] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9840685 (10Soda) a:05Soda→03None Sent the information. (in an email titled `Re: Information for T366032 (Sohom Datta)`)
[03:32:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P63500 and previous config saved to /var/cache/conftool/dbconfig/20240529-033221-marostegui.json
[03:38:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T364299)', diff saved to https://phabricator.wikimedia.org/P63501 and previous config saved to /var/cache/conftool/dbconfig/20240529-033814-marostegui.json
[03:38:22] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[03:47:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P63502 and previous config saved to /var/cache/conftool/dbconfig/20240529-034728-marostegui.json
[03:53:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P63503 and previous config saved to /var/cache/conftool/dbconfig/20240529-035323-marostegui.json
[03:55:17] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[03:55:30] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[03:55:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T352010)', diff saved to https://phabricator.wikimedia.org/P63504 and previous config saved to /var/cache/conftool/dbconfig/20240529-035538-ladsgroup.json
[03:55:46] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[04:02:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T364069)', diff saved to https://phabricator.wikimedia.org/P63505 and previous config saved to /var/cache/conftool/dbconfig/20240529-040236-marostegui.json
[04:02:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[04:02:43] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[04:02:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[04:03:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T364069)', diff saved to https://phabricator.wikimedia.org/P63506 and previous config saved to /var/cache/conftool/dbconfig/20240529-040259-marostegui.json
[04:08:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P63507 and previous config saved to /var/cache/conftool/dbconfig/20240529-040831-marostegui.json
[04:21:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:23:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T364299)', diff saved to https://phabricator.wikimedia.org/P63508 and previous config saved to /var/cache/conftool/dbconfig/20240529-042339-marostegui.json
[04:23:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[04:23:44] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[04:23:55] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[04:24:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T364299)', diff saved to https://phabricator.wikimedia.org/P63509 and previous config saved to /var/cache/conftool/dbconfig/20240529-042402-marostegui.json
[04:36:24] <wikibugs>	 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T366134 (10phaultfinder) 03NEW
[04:41:27] <wikibugs>	 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T366134#9840740 (10phaultfinder)
[04:42:56] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:43:26] <wikibugs>	 (03CR) 10AOkoth: [C:03+1] vrts: add missing comma to vrts_aliases.py [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[04:43:30] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:46:29] <wikibugs>	 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T366134#9840741 (10phaultfinder)
[05:21:07] <wikibugs>	 (03PS1) 10Marostegui: db1211: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036782
[05:21:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1211: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036782 (owner: 10Marostegui)
[05:39:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:58:07] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T0600)
[06:22:32] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:22:40] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:44:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T364299)', diff saved to https://phabricator.wikimedia.org/P63510 and previous config saved to /var/cache/conftool/dbconfig/20240529-064453-marostegui.json
[06:45:00] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[06:47:34] <wikibugs>	 (03PS4) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372)
[06:49:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1218.eqiad.wmnet
[06:52:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1218 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036910 (https://phabricator.wikimedia.org/T349619)
[06:55:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1218 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036910 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[06:59:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1218.eqiad.wmnet
[07:00:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P63511 and previous config saved to /var/cache/conftool/dbconfig/20240529-070001-marostegui.json
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:15:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P63512 and previous config saved to /var/cache/conftool/dbconfig/20240529-071509-marostegui.json
[07:16:06] <wikibugs>	 (03PS1) 10Marostegui: db2170: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036912
[07:16:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2170: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1036912 (owner: 10Marostegui)
[07:29:52] <wikibugs>	 (03PS1) 10Marostegui: core_test.pp: Add MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1036916 (https://phabricator.wikimedia.org/T365805)
[07:30:13] <wikibugs>	 (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036916 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui)
[07:30:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T364299)', diff saved to https://phabricator.wikimedia.org/P63513 and previous config saved to /var/cache/conftool/dbconfig/20240529-073017-marostegui.json
[07:30:21] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[07:30:24] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[07:30:34] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[07:31:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for backup roles [puppet] - 10https://gerrit.wikimedia.org/r/1032636 (owner: 10Muehlenhoff)
[07:32:12] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1018 is OK: OK - Categories lag: 2:32:11.288501 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:35:16] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2013 is OK: OK - Categories lag: 2:35:15.453618 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:35:16] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2025 is OK: OK - Categories lag: 2:35:15.479951 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:35:16] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2011 is OK: OK - Categories lag: 2:35:15.489814 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:35:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1219.eqiad.wmnet
[07:37:10] <wikibugs>	 (03PS1) 10Jelto: gitlab: bump exporter version to v1.0.10 [puppet] - 10https://gerrit.wikimedia.org/r/1036987 (https://phabricator.wikimedia.org/T354656)
[07:38:16] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2007 is OK: OK - Categories lag: 2:38:14.638224 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:38:38] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: bump exporter version to v1.0.10 [puppet] - 10https://gerrit.wikimedia.org/r/1036987 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto)
[07:39:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1219 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036989 (https://phabricator.wikimedia.org/T349619)
[07:41:12] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: use latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036992 (https://phabricator.wikimedia.org/T365692)
[07:41:16] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2009 is OK: OK - Categories lag: 2:41:14.574669 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:41:18] <dcausse>	 jouncebot: nowandnext
[07:41:18] <jouncebot>	 For the next 0 hour(s) and 18 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T0700)
[07:41:18] <jouncebot>	 In 0 hour(s) and 18 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T0800)
[07:41:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1219 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036989 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:47:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1219.eqiad.wmnet
[07:47:12] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1017 is OK: OK - Categories lag: 2:47:10.428264 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:47:12] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1015 is OK: OK - Categories lag: 2:47:10.489025 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:47:12] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1019 is OK: OK - Categories lag: 2:47:11.314602 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:47:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1228.eqiad.wmnet
[07:48:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1228 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036993 (https://phabricator.wikimedia.org/T349619)
[07:49:25] <wikibugs>	 (03PS1) 10Stevemunene: Remove datahub from LVS [puppet] - 10https://gerrit.wikimedia.org/r/1036994 (https://phabricator.wikimedia.org/T366137)
[07:49:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1228 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1036993 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[07:50:12] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2008 is OK: OK - Categories lag: 2:50:11.301393 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:50:14] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2014 is OK: OK - Categories lag: 2:50:12.729355 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:50:14] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2010 is OK: OK - Categories lag: 2:50:12.746571 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:50:14] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2012 is OK: OK - Categories lag: 2:50:12.759660 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:50:15] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2024 is OK: OK - Categories lag: 2:50:13.429869 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:50:15] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2022 is OK: OK - Categories lag: 2:50:13.427291 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[07:51:13] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: use latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036992 (https://phabricator.wikimedia.org/T365692) (owner: 10DCausse)
[07:52:12] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: use latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036992 (https://phabricator.wikimedia.org/T365692) (owner: 10DCausse)
[07:54:10] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[07:54:37] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[07:55:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1228.eqiad.wmnet
[07:56:20] <wikibugs>	 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9840949 (10AndrewTavis_WMDE) Moving this to verification given the work in T364965. Thanks for all of this, @Lucas_Werkmeister_WMDE! Maybe we can reso...
[08:00:05] <jouncebot>	 dancy and andre: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T0800).
[08:00:10] <wikibugs>	 (03CR) 10Muehlenhoff: vrts: add missing comma to vrts_aliases.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[08:00:45] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: sync
[08:01:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] ml/etcd: remove obsolete certificites [puppet] - 10https://gerrit.wikimedia.org/r/1036619 (owner: 10Muehlenhoff)
[08:05:19] <wikibugs>	 (03PS2) 10Dzahn: vrts: add missing comma to vrts_aliases.py [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145)
[08:05:19] <wikibugs>	 (03CR) 10Dzahn: vrts: add missing comma to vrts_aliases.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[08:05:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[08:06:29] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995
[08:06:57] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995
[08:07:20] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli)
[08:09:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete wikikube/staging etcd certificates [puppet] - 10https://gerrit.wikimedia.org/r/1036998 (https://phabricator.wikimedia.org/T357750)
[08:10:54] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: sync
[08:11:33] <wikibugs>	 (03PS1) 10Hashar: Merge tag 'v3.9.5' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1036999 (https://phabricator.wikimedia.org/T354887)
[08:11:47] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036998 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[08:12:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[08:14:07] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli)
[08:15:09] <wikibugs>	 (03PS2) 10Mvolz: Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093)
[08:15:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Thanks!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129) (owner: 10Pppery)
[08:15:49] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Update links to point to non-wiki privacy policy and bypass redirects [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129) (owner: 10Pppery)
[08:15:57] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Update links to point to non-wiki privacy policy and bypass redirects [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1013156 (https://phabricator.wikimedia.org/T350129) (owner: 10Pppery)
[08:18:19] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129#9840992 (10SLyngshede-WMF) The updated template will be rolled out with the next version bump of CAS.
[08:18:36] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] role::thanos::frontend: move all envoy TLS certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036643 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[08:20:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9841008 (10akosiaris) >>! In T363212#9839469, @Dzahn wrote: > @akosiaris re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1035769/1/modules/profile/da...
[08:21:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:22:59] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: sync
[08:23:14] <wikibugs>	 (03CR) 10Ayounsi: "codfw/eqiad IPs lgtm, I can't vouch for the SPF settings though." [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) (owner: 10JHathaway)
[08:23:24] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC OK  https://puppet-compiler.wmflabs.org/output/1036995/2671/" [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli)
[08:23:29] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Merge tag 'v3.9.5' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1036999 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar)
[08:24:52] <wikibugs>	 07Puppet: Repeated Puppet failures for PetScan - https://phabricator.wikimedia.org/T366141 (10Magnus) 03NEW
[08:27:05] <wikibugs>	 (03PS1) 10Slyngshede: Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140)
[08:28:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete wikikube etcd certificates [puppet] - 10https://gerrit.wikimedia.org/r/1037002 (https://phabricator.wikimedia.org/T357750)
[08:29:08] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 49666
[08:29:32] <wikibugs>	 (03Merged) 10jenkins-bot: Merge tag 'v3.9.5' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1036999 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar)
[08:31:09] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 49666
[08:33:08] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: sync
[08:35:30] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 8674
[08:36:20] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] core_test.pp: Add MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1036916 (https://phabricator.wikimedia.org/T365805) (owner: 10Marostegui)
[08:39:22] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[08:40:09] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[08:42:16] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 45 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:47:32] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 14 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:48:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037002 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[08:51:37] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli)
[08:54:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] profile::elasticsearch::cirrus: Remove obsolete http2 parameter [puppet] - 10https://gerrit.wikimedia.org/r/1036556 (owner: 10Muehlenhoff)
[08:58:34] <wikibugs>	 (03PS1) 10Aklapper: Remove FIXME comment for waxing and waning moon phases [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853)
[08:58:34] <wikibugs>	 (03PS3) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995
[08:59:17] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance
[08:59:30] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance
[09:00:12] <wikibugs>	 (03PS4) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995
[09:05:30] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8674
[09:05:59] <wikibugs>	 (03CR) 10Volans: "Nice addition! Couple of suggestions inline, looks already good." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[09:06:53] <wikibugs>	 (03PS1) 10Muehlenhoff: tlsproxy::localssl: Remove support for HTTP2 [puppet] - 10https://gerrit.wikimedia.org/r/1037029
[09:07:58] <wikibugs>	 (03PS5) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995
[09:09:41] <wikibugs>	 (03CR) 10Muehlenhoff: memcached: minor fixes in class and profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli)
[09:10:46] <wikibugs>	 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9841121 (10Marostegui) For the record (in...
[09:11:30] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145 (10JoelyRooke-WMDE) 03NEW
[09:11:58] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037029 (owner: 10Muehlenhoff)
[09:12:05] <marostegui>	 !log Deploy schema change on s7 eqiad dbmaint T307501
[09:12:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:11] <stashbot>	 T307501: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501
[09:12:26] <wikibugs>	 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9841147 (10Marostegui)
[09:13:11] <wikibugs>	 (03CR) 10Effie Mouzeli: memcached: minor fixes in class and profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli)
[09:14:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli)
[09:15:56] <wikibugs>	 (03PS2) 10Muehlenhoff: tlsproxy::localssl: Remove support for HTTP2 [puppet] - 10https://gerrit.wikimedia.org/r/1037029
[09:16:12] <wikibugs>	 (03PS1) 10Santiago Faci: aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408)
[09:16:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1232.eqiad.wmnet
[09:17:03] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9841151 (10WMDECyn) I approve the request on WMDE's behalf
[09:17:15] <akosiaris>	 FYI, doing some pod rolling restarts in eqiad trying to reproduce https://phabricator.wikimedia.org/T366094
[09:18:15] <wikibugs>	 (03PS2) 10Santiago Faci: aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408)
[09:19:22] <wikibugs>	 (03PS2) 10Slyngshede: Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140)
[09:20:40] <wikibugs>	 (03PS6) 10Effie Mouzeli: memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995
[09:22:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1232 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037035 (https://phabricator.wikimedia.org/T349619)
[09:23:49] <wikibugs>	 (03CR) 10Muehlenhoff: Bump to version 6.6.15.1 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede)
[09:24:25] <wikibugs>	 (03PS3) 10Santiago Faci: aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408)
[09:25:03] <wikibugs>	 (03CR) 10Muehlenhoff: Bump to version 6.6.15.1 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede)
[09:26:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1232 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037035 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:27:33] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync
[09:27:54] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] memcached: minor fixes in class and profile [puppet] - 10https://gerrit.wikimedia.org/r/1036995 (owner: 10Effie Mouzeli)
[09:28:53] <wikibugs>	 (03PS6) 10Santiago Faci: editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408)
[09:28:54] <wikibugs>	 (03PS3) 10Slyngshede: Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140)
[09:29:00] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037029 (owner: 10Muehlenhoff)
[09:29:09] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync
[09:30:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[09:31:22] <wikibugs>	 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9841191 (10Marostegui)
[09:31:40] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[09:31:52] <wikibugs>	 (03PS4) 10Slyngshede: Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140)
[09:32:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1232.eqiad.wmnet
[09:32:09] <wikibugs>	 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9841192 (10Marostegui)
[09:33:03] <wikibugs>	 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9841193 (10Marostegui) 05Open→03Res...
[09:33:15] <wikibugs>	 (03Merged) 10jenkins-bot: aqs-http-gateway chart and edit-analytic service k8s configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037033 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[09:33:57] <wikibugs>	 (03CR) 10Slyngshede: Bump to version 6.6.15.1 (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede)
[09:35:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[09:36:26] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync
[09:36:26] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync
[09:37:29] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync
[09:38:07] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync
[09:38:46] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply
[09:39:33] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply
[09:39:42] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:39:56] <icinga-wm_>	 RECOVERY - Memcached on mc2049 is OK: TCP OK - 0.031 second response time on 10.192.32.81 port 11214 https://wikitech.wikimedia.org/wiki/Memcached
[09:41:42] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached::instance: add the actual datafile in the options [puppet] - 10https://gerrit.wikimedia.org/r/1037037
[09:43:11] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: use latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037038
[09:43:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede)
[09:44:12] <wikibugs>	 (03PS5) 10Slyngshede: Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140)
[09:44:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:47:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1234.eqiad.wmnet
[09:48:58] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: use latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037038 (owner: 10DCausse)
[09:49:55] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: use latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037038 (owner: 10DCausse)
[09:50:40] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[09:51:02] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[09:51:28] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached::instance: add the actual datafile in the options [puppet] - 10https://gerrit.wikimedia.org/r/1037037
[09:52:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1234 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037039 (https://phabricator.wikimedia.org/T349619)
[09:54:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[09:54:30] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[09:54:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T364299)', diff saved to https://phabricator.wikimedia.org/P63514 and previous config saved to /var/cache/conftool/dbconfig/20240529-095437-marostegui.json
[09:54:43] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[09:55:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1234 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037039 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:57:34] <wikibugs>	 (03PS1) 10Hashar: Gerrit 3.9.5, rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1037041 (https://phabricator.wikimedia.org/T354887)
[09:57:46] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply
[09:58:08] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:59:03] <wikibugs>	 (03Abandoned) 10Hnowlan: cassandra-http-gateway: use cassandra module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015085 (owner: 10Hnowlan)
[09:59:15] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply
[09:59:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1000)
[10:00:41] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply
[10:00:46] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns
[10:00:52] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[10:00:53] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:01:06] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[10:01:18] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:02:18] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply
[10:04:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1234.eqiad.wmnet
[10:04:43] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: disable puppet and k8s controlplane
[10:04:57] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: disable puppet and k8s controlplane
[10:05:07] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1002.eqiad.wmnet with reason: disable puppet and k8s controlplane
[10:05:13] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841272 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b65d2df8-871b-4064-b329-026af4d7ec1d) set by akosiaris@cumin1002 for 2:00:00 on 1 host(s) and their services with reason:...
[10:05:21] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1002.eqiad.wmnet with reason: disable puppet and k8s controlplane
[10:05:32] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync
[10:05:32] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync
[10:05:34] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841277 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8fa8366a-d3f2-4a77-8e2b-45de66551026) set by akosiaris@cumin1002 for 2:00:00 on 1 host(s) and their services with reason:...
[10:05:36] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[10:05:51] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[10:05:54] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:06:01] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[10:06:27] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[10:06:31] <wikibugs>	 (03Merged) 10jenkins-bot: editor-analytics deployment: big refactoring and snapshot automation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036191 (https://phabricator.wikimedia.org/T355408) (owner: 10Santiago Faci)
[10:06:55] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync
[10:07:03] <moritzm>	 !log installing systemd security updates
[10:07:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:07] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync
[10:07:28] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] memcached::instance: add the actual datafile in the options [puppet] - 10https://gerrit.wikimedia.org/r/1037037 (owner: 10Effie Mouzeli)
[10:08:56] <wikibugs>	 (03PS4) 10Klausman: install/partman: Tweak kubelet partition size for ML workers [puppet] - 10https://gerrit.wikimedia.org/r/1036195 (https://phabricator.wikimedia.org/T365971)
[10:09:33] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply
[10:10:19] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply
[10:10:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1235.eqiad.wmnet
[10:10:55] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2678/console" [puppet] - 10https://gerrit.wikimedia.org/r/1036195 (https://phabricator.wikimedia.org/T365971) (owner: 10Klausman)
[10:12:48] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply
[10:13:03] <wikibugs>	 (03CR) 10Klausman: [V:03+1 C:03+2] install/partman: Tweak kubelet partition size for ML workers [puppet] - 10https://gerrit.wikimedia.org/r/1036195 (https://phabricator.wikimedia.org/T365971) (owner: 10Klausman)
[10:14:26] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply
[10:14:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[10:14:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[10:15:06] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply
[10:16:33] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply
[10:16:44] <moritzm>	 !log installing python-idna security updates
[10:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:50] <logmsgbot>	 !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubemaster1002.eqiad.wmnet with reason: disable puppet and k8s controlplane
[10:17:04] <logmsgbot>	 !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubemaster1002.eqiad.wmnet with reason: disable puppet and k8s controlplane
[10:17:15] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841311 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2f1b90d9-2cd4-4705-bbf1-70fdacf169cd) set by akosiaris@cumin1002 for 2:00:00 on 1 host(s) and their services with reason:...
[10:17:30] <wikibugs>	 (03PS1) 10Klausman: install/partman: Separate out DSE cluster partman recipe from ML [puppet] - 10https://gerrit.wikimedia.org/r/1037042 (https://phabricator.wikimedia.org/T365971)
[10:18:03] <wikibugs>	 (03PS2) 10Klausman: install/partman: Separate out DSE cluster partman recipe from ML [puppet] - 10https://gerrit.wikimedia.org/r/1037042 (https://phabricator.wikimedia.org/T365971)
[10:19:50] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync
[10:19:51] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync
[10:20:22] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:20:22] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:21:03] <akosiaris>	 ah dammit
[10:21:27] <akosiaris>	 but it shouldn't reply anyway
[10:22:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] otelcol: add three new k8s ctrl IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis)
[10:24:35] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=inactive; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=kubemaster1002.eqiad.wmnet
[10:24:41] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync
[10:24:41] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync
[10:24:46] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync
[10:24:57] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync
[10:26:01] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Bump to version 6.6.15.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1037000 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede)
[10:26:32] <moritzm>	 !log installing intel-microcode security updates
[10:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:41] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync
[10:26:43] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync
[10:28:06] <wikibugs>	 06SRE, 10Wikimedia-SVG-rendering: Install 'ttf-ubuntu-font-family' on clusters rendering SVG to PNG - https://phabricator.wikimedia.org/T32288#9841338 (10Arthur2e5) Undone by https://phabricator.wikimedia.org/rOPUP33b0f4f1308bd03d1422f34e23c0ac8794ab86bf because Ubuntu is non-free. Welp, there goes my fanc...
[10:28:17] <wikibugs>	 (03PS1) 10Jelto: docker_registry_ha: replace deprecated /-/jwks endpoint on gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1037043 (https://phabricator.wikimedia.org/T365675)
[10:29:06] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in moss-be1002 - https://phabricator.wikimedia.org/T366153 (10MatthewVernon) 03NEW
[10:29:52] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1049.eqiad.wmnet with OS bookworm
[10:30:19] <wikibugs>	 (03PS1) 10Slyngshede: P:idp::build remove duplicate rsync restart. [puppet] - 10https://gerrit.wikimedia.org/r/1037044
[10:30:53] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2679/co" [puppet] - 10https://gerrit.wikimedia.org/r/1037043 (https://phabricator.wikimedia.org/T365675) (owner: 10Jelto)
[10:30:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037044 (owner: 10Slyngshede)
[10:35:22] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:35:31] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idp::build remove duplicate rsync restart. [puppet] - 10https://gerrit.wikimedia.org/r/1037044 (owner: 10Slyngshede)
[10:35:45] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=inactive; selector: name=parse1002.eqiad.wmnet
[10:36:48] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync
[10:36:48] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync
[10:38:52] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync
[10:38:53] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync
[10:39:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[10:43:20] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: sync
[10:43:20] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: sync
[10:43:20] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync
[10:43:20] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync
[10:43:24] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1049.eqiad.wmnet with reason: host reimage
[10:43:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: sync
[10:43:40] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: sync
[10:44:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[10:44:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[10:45:02] <wikibugs>	 (03PS1) 10Marostegui: filtered_tables.txt: Remove gu_salt [puppet] - 10https://gerrit.wikimedia.org/r/1037046 (https://phabricator.wikimedia.org/T366123)
[10:45:27] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync
[10:45:28] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync
[10:46:25] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1049.eqiad.wmnet with reason: host reimage
[10:49:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[10:49:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[10:50:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[10:50:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[10:51:09] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: sync
[10:51:10] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: sync
[10:51:10] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync
[10:51:10] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: sync
[10:51:10] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync
[10:51:10] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: sync
[10:51:44] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: sync
[10:52:23] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: sync
[10:54:17] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[10:54:18] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, and 2 others: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9841481 (10dcaro) 05Resolved→03In progress Thank @Jclark-ctr, I don't see the drive on the host (sda) though: ` root@cloudcephosd1031:~# ls -la /dev/sd? br...
[10:54:30] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[10:54:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:54:47] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:54:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T366123)', diff saved to https://phabricator.wikimedia.org/P63515 and previous config saved to /var/cache/conftool/dbconfig/20240529-105454-marostegui.json
[10:55:01] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[10:55:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 21.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:55:26] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync
[10:55:29] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync
[10:55:42] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: sync
[10:55:43] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync
[10:55:43] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: sync
[10:55:43] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync
[10:55:43] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: sync
[10:55:43] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: sync
[10:55:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[10:55:45] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: sync
[10:56:03] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: sync
[10:56:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T366123)', diff saved to https://phabricator.wikimedia.org/P63516 and previous config saved to /var/cache/conftool/dbconfig/20240529-105604-marostegui.json
[10:56:08] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: sync
[10:56:19] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: sync
[10:56:32] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: sync
[10:56:43] <akosiaris>	 !incidents
[10:56:44] <sirenbot>	 4709 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[10:56:44] <sirenbot>	 4708 (RESOLVED)  [2x] ProbeDown sre (kubemaster1002:6443 probes/custom eqiad)
[10:56:44] <sirenbot>	 4707 (RESOLVED)  [2x] ProbeDown sre (kubemaster1001:6443 probes/custom eqiad)
[10:56:44] <sirenbot>	 4706 (RESOLVED)  ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad)
[10:56:44] <sirenbot>	 4705 (RESOLVED)  ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad)
[10:56:44] <sirenbot>	 4703 (RESOLVED)  ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad)
[10:56:56] <akosiaris>	 what's cache_text about?
[10:57:15] <kamila_>	 akosiaris: I assume unrelated to the k8s thing, looking
[10:57:34] <jelto>	 at least the linked metric in https://grafana.wikimedia.org/d/000000479/cdn-frontend-traffic?viewPanel=13&orgId=1&from=now-30m&to=now is recovering again
[10:57:54] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync
[10:57:54] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync
[10:58:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.702s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:58:21] <akosiaris>	 I have 1 rollback btw
[10:58:23] <akosiaris>	 had*
[10:58:32] <akosiaris>	 which explains some of the high latencies etc
[10:58:51] <kamila_>	 oh, okay
[10:59:51] <wikibugs>	 (03PS1) 10Santiago Faci: device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037048 (https://phabricator.wikimedia.org/T360524)
[10:59:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:00:04] <jouncebot>	 mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1100).
[11:00:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 21.69% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[11:00:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[11:02:50] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037048 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci)
[11:03:13] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: sync
[11:03:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.067s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:03:20] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037048 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci)
[11:03:20] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1049.eqiad.wmnet with OS bookworm
[11:03:42] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[11:03:43] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[11:03:56] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[11:03:57] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[11:04:12] <akosiaris>	 !log redeploy opentelemetry collector T366094
[11:04:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:17] <stashbot>	 T366094: k8s master capacity issues - https://phabricator.wikimedia.org/T366094
[11:04:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:05:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T364299)', diff saved to https://phabricator.wikimedia.org/P63517 and previous config saved to /var/cache/conftool/dbconfig/20240529-110501-marostegui.json
[11:05:07] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[11:05:31] <wikibugs>	 (03Merged) 10jenkins-bot: device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037048 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci)
[11:06:39] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[11:06:42] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[11:07:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:10:08] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: sync
[11:10:08] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync
[11:10:08] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: sync
[11:10:08] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: sync
[11:10:08] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: sync
[11:10:09] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: sync
[11:10:56] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: sync
[11:11:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P63518 and previous config saved to /var/cache/conftool/dbconfig/20240529-111112-marostegui.json
[11:11:15] <akosiaris>	 !incidents
[11:11:15] <sirenbot>	 4709 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[11:11:16] <sirenbot>	 4708 (RESOLVED)  [2x] ProbeDown sre (kubemaster1002:6443 probes/custom eqiad)
[11:11:16] <sirenbot>	 4707 (RESOLVED)  [2x] ProbeDown sre (kubemaster1001:6443 probes/custom eqiad)
[11:11:16] <sirenbot>	 4706 (RESOLVED)  ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad)
[11:11:16] <sirenbot>	 4705 (RESOLVED)  ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad)
[11:11:16] <sirenbot>	 4703 (RESOLVED)  ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad)
[11:12:03] <wikibugs>	 (03CR) 10Ladsgroup: Use pt-heartbeat for all non-static external clusters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893835 (https://phabricator.wikimedia.org/T129093) (owner: 10Aaron Schulz)
[11:12:04] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: sync
[11:14:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[11:14:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[11:14:52] <marostegui>	 here we go again
[11:15:01] <akosiaris>	 yeah, that one was expected 
[11:15:08] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync
[11:15:10] <akosiaris>	 it was my last test, pinky promise
[11:15:13] <marostegui>	 haha
[11:15:14] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: sync
[11:15:19] <mvolz>	 thinking of deploying... should I hold off? 
[11:15:38] <akosiaris>	 mvolz: yeah, wait like 5-10 m
[11:15:48] <mvolz>	 gotcha
[11:16:07] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: sync
[11:17:16] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.593s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:18:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:18:50] <akosiaris>	 gerrit isn't related to my tests btw
[11:18:56] <akosiaris>	 !incidents
[11:18:56] <sirenbot>	 4710 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[11:18:56] <sirenbot>	 4709 (RESOLVED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[11:18:57] <sirenbot>	 4708 (RESOLVED)  [2x] ProbeDown sre (kubemaster1002:6443 probes/custom eqiad)
[11:18:57] <sirenbot>	 4707 (RESOLVED)  [2x] ProbeDown sre (kubemaster1001:6443 probes/custom eqiad)
[11:18:57] <sirenbot>	 4706 (RESOLVED)  ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad)
[11:18:57] <sirenbot>	 4705 (RESOLVED)  ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad)
[11:18:57] <sirenbot>	 4703 (RESOLVED)  ProbeDown sre (2620:0:861:103:10:64:32:116 ip6 kubemaster1002:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad)
[11:19:29] <Kizule>	 Hi, is Gerrit working for you?
[11:19:33] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: sync
[11:19:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[11:19:45] <jinxer-wm>	 FIRING: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[11:20:07] <akosiaris>	 Kizule: we have an active alert for gerrit that fired 1minute ago
[11:20:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P63519 and previous config saved to /var/cache/conftool/dbconfig/20240529-112009-marostegui.json
[11:20:21] <Kizule>	 akosiaris: I haven't seen it, sorry for asking then.
[11:20:36] <akosiaris>	 no worries, just letting you know we are aware of the problem
[11:20:45] <jinxer-wm>	 FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[11:20:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for mabualruz [puppet] - 10https://gerrit.wikimedia.org/r/1037051
[11:20:51] <jelto>	 I can take a look at gerrit
[11:20:57] <wikibugs>	 (03PS1) 10Santiago Faci: device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037052 (https://phabricator.wikimedia.org/T360524)
[11:21:11] <akosiaris>	 thanks jelto!
[11:21:48] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:22:16] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.297s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:22:51] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037052 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci)
[11:22:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:23:27] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[11:23:29] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[11:23:57] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:24:14] <wikibugs>	 (03Merged) 10jenkins-bot: device-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037052 (https://phabricator.wikimedia.org/T360524) (owner: 10Santiago Faci)
[11:24:45] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[11:25:45] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[11:25:51] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply
[11:25:53] <Kizule>	 Gerrit is back for me. :)
[11:25:57] <Kizule>	 Thanks!
[11:26:17] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=kubemaster1002.eqiad.wmnet
[11:26:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P63520 and previous config saved to /var/cache/conftool/dbconfig/20240529-112621-marostegui.json
[11:26:45] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: service=kubemaster,dc=eqiad,cluster=kubernetes,name=wikikube-ctrl1001.eqiad.wmnet
[11:26:48] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply
[11:27:16] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.08s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:28:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:31:56] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached: test extstore on 10 servers [puppet] - 10https://gerrit.wikimedia.org/r/1037053 (https://phabricator.wikimedia.org/T352885)
[11:32:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for mabualruz [puppet] - 10https://gerrit.wikimedia.org/r/1037051 (owner: 10Muehlenhoff)
[11:32:50] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached: test extstore on 10 servers [puppet] - 10https://gerrit.wikimedia.org/r/1037053 (https://phabricator.wikimedia.org/T352885)
[11:35:13] <jelto>	 yeah gerrit should be back :)
[11:35:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P63521 and previous config saved to /var/cache/conftool/dbconfig/20240529-113517-marostegui.json
[11:35:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch db1235 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037054 (https://phabricator.wikimedia.org/T349619)
[11:38:35] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply
[11:38:47] <wikibugs>	 (03Abandoned) 10Zabe: filtered_tables: Remove gu_salt [puppet] - 10https://gerrit.wikimedia.org/r/1031608 (https://phabricator.wikimedia.org/T364435) (owner: 10Zabe)
[11:39:55] <wikibugs>	 (03PS3) 10Santiago Faci: geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525)
[11:40:23] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply
[11:40:42] <wikibugs>	 (03PS1) 10Kosta Harlan: alertmanager: route Trust and Safety Product team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1037056 (https://phabricator.wikimedia.org/T366165)
[11:41:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T366123)', diff saved to https://phabricator.wikimedia.org/P63522 and previous config saved to /var/cache/conftool/dbconfig/20240529-114129-marostegui.json
[11:41:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[11:41:35] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[11:41:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[11:41:47] <hnowlan>	 !log homer "cr*eqiad*" commit 'adding bgp state for wikikube-ctrl1002' 
[11:41:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T366123)', diff saved to https://phabricator.wikimedia.org/P63523 and previous config saved to /var/cache/conftool/dbconfig/20240529-114153-marostegui.json
[11:42:18] <wikibugs>	 (03PS4) 10Santiago Faci: media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526)
[11:42:43] <marostegui>	 !log recreate triggers on s7 eqiad db maint db1155:3317 T366167 
[11:42:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:48] <stashbot>	 T366167: Update centralauth triggers - https://phabricator.wikimedia.org/T366167
[11:42:58] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply
[11:44:45] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply
[11:44:53] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) (owner: 10Santiago Faci)
[11:44:59] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) (owner: 10Santiago Faci)
[11:45:48] <wikibugs>	 (03CR) 10Kosta Harlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037056 (https://phabricator.wikimedia.org/T366165) (owner: 10Kosta Harlan)
[11:45:52] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] alertmanager: route Trust and Safety Product team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1037056 (https://phabricator.wikimedia.org/T366165) (owner: 10Kosta Harlan)
[11:46:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Mabualruz out of all services on: 2198 hosts
[11:46:14] <wikibugs>	 (03PS5) 10Santiago Faci: page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523)
[11:46:40] <wikibugs>	 (03CR) 10Marostegui: "Sorry, I forgot you also created the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1031608 (https://phabricator.wikimedia.org/T364435) (owner: 10Zabe)
[11:46:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Mabualruz out of all services on: 2198 hosts
[11:47:45] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) (owner: 10Santiago Faci)
[11:49:23] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) (owner: 10Santiago Faci)
[11:50:22] <wikibugs>	 (03Merged) 10jenkins-bot: page-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036261 (https://phabricator.wikimedia.org/T360523) (owner: 10Santiago Faci)
[11:50:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T364299)', diff saved to https://phabricator.wikimedia.org/P63524 and previous config saved to /var/cache/conftool/dbconfig/20240529-115025-marostegui.json
[11:50:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance
[11:50:32] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[11:50:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch db1235 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037054 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:50:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance
[11:50:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T364299)', diff saved to https://phabricator.wikimedia.org/P63525 and previous config saved to /var/cache/conftool/dbconfig/20240529-115051-marostegui.json
[11:53:54] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2002.codfw.wmnet with OS bookworm
[11:54:46] <wikibugs>	 (03PS2) 10Hashar: contint: enable zuul-merger daemon on contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/1036762 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[11:55:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1235.eqiad.wmnet
[12:02:09] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] memcached: test extstore on 10 servers [puppet] - 10https://gerrit.wikimedia.org/r/1037053 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli)
[12:04:24] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2048.codfw.wmnet with OS bookworm
[12:05:29] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1048.eqiad.wmnet with OS bookworm
[12:06:56] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply
[12:07:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T366123)', diff saved to https://phabricator.wikimedia.org/P63526 and previous config saved to /var/cache/conftool/dbconfig/20240529-120730-marostegui.json
[12:07:37] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[12:07:39] <wikibugs>	 (03CR) 10Esanders: [C:03+1] "I don't have +2 in this repo, but LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz)
[12:07:50] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply
[12:08:52] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply
[12:10:20] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply
[12:11:04] <wikibugs>	 (03PS1) 10Slyngshede: IDP: Failover for 6.6.15 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1037061 (https://phabricator.wikimedia.org/T366140)
[12:11:13] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply
[12:12:49] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply
[12:13:52] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) (owner: 10Santiago Faci)
[12:14:37] <wikibugs>	 (03Merged) 10jenkins-bot: geo-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036192 (https://phabricator.wikimedia.org/T360525) (owner: 10Santiago Faci)
[12:14:41] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage
[12:15:34] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply
[12:16:20] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply
[12:16:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] alertmanager: route Trust and Safety Product team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1037056 (https://phabricator.wikimedia.org/T366165) (owner: 10Kosta Harlan)
[12:17:08] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply
[12:17:12] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2002.codfw.wmnet with reason: host reimage
[12:18:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1037061 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede)
[12:18:28] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP: Failover for 6.6.15 upgrade [dns] - 10https://gerrit.wikimedia.org/r/1037061 (https://phabricator.wikimedia.org/T366140) (owner: 10Slyngshede)
[12:18:57] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply
[12:19:02] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1048.eqiad.wmnet with reason: host reimage
[12:19:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] role::thanos::frontend: move all envoy TLS certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036643 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[12:19:40] <slyngs>	 !log Failover idp.wikimedia.org for CAS upgrade to 6.6.15
[12:19:41] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply
[12:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:54] <wikibugs>	 (03PS1) 10Ladsgroup: admin: Remove home files for several departed staff [puppet] - 10https://gerrit.wikimedia.org/r/1037062
[12:21:27] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply
[12:22:20] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) (owner: 10Santiago Faci)
[12:22:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P63527 and previous config saved to /var/cache/conftool/dbconfig/20240529-122239-marostegui.json
[12:22:45] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2048.codfw.wmnet with reason: host reimage
[12:22:49] <wikibugs>	 (03Abandoned) 10Ladsgroup: admin: Remove home files for several departed staff [puppet] - 10https://gerrit.wikimedia.org/r/1037062 (owner: 10Ladsgroup)
[12:23:14] <wikibugs>	 (03Merged) 10jenkins-bot: media-analytics deployment: AQS 2.0 refactoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036194 (https://phabricator.wikimedia.org/T360526) (owner: 10Santiago Faci)
[12:23:42] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:24:22] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1048.eqiad.wmnet with reason: host reimage
[12:25:07] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply
[12:26:03] <logmsgbot>	 !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply
[12:28:02] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2048.codfw.wmnet with reason: host reimage
[12:29:11] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply
[12:30:40] <logmsgbot>	 !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply
[12:34:49] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2002.codfw.wmnet with OS bookworm
[12:35:34] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply
[12:36:36] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] role::thanos::frontend: move all envoy TLS certs to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036643 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey)
[12:37:09] <logmsgbot>	 !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply
[12:37:42] <ottomata>	 upcoming backport deployers: I have to drop kid off at daycare, may be back slightly after the hour.  cc RoanKattouw Lucas_WMDE etc.
[12:37:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P63528 and previous config saved to /var/cache/conftool/dbconfig/20240529-123746-marostegui.json
[12:38:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove skel files for former WMF staff members [puppet] - 10https://gerrit.wikimedia.org/r/1037064
[12:39:17] <elukey>	 !log move thanos-fe100[3,4] and thanos-fe2* to PKI TLS certs (envoy, backends for thanos-swift.discovery.wmnet) - T344324
[12:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:22] <stashbot>	 T344324: Maps Unavailability due to thanos-swift cfssl rollout  (14 Aug 2023) - https://phabricator.wikimedia.org/T344324
[12:39:33] <wikibugs>	 (03CR) 10Jelto: [C:03+2] contint: enable zuul-merger daemon on contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/1036762 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[12:40:27] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1048.eqiad.wmnet with OS bookworm
[12:42:46] <marostegui>	 !log recreate triggers on s7 codfw db maint db1155:3317 T366167 
[12:42:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:51] <stashbot>	 T366167: Update centralauth triggers - https://phabricator.wikimedia.org/T366167
[12:42:54] <marostegui>	 !log recreate triggers on s7 codfw db maint db2187:3317 T366167 
[12:42:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:12] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1196.eqiad.wmnet with reason: reimage
[12:43:25] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1196.eqiad.wmnet with reason: reimage
[12:43:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1196 T364290', diff saved to https://phabricator.wikimedia.org/P63529 and previous config saved to /var/cache/conftool/dbconfig/20240529-124352-arnaudb.json
[12:43:58] <stashbot>	 T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290
[12:45:08] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db[1154,1196].eqiad.wmnet with reason: reimage db1196
[12:45:24] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db[1154,1196].eqiad.wmnet with reason: reimage db1196
[12:45:25] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2048.codfw.wmnet with OS bookworm
[12:46:47] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1196.eqiad.wmnet with OS bookworm
[12:49:07] <wikibugs>	 (03Abandoned) 10Ssingh: mw-api-ext: Add 20 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022062 (owner: 10Clément Goubert)
[12:49:29] <wikibugs>	 (03Abandoned) 10Ssingh: Disable Enterprise bypassing CDN rate limits [puppet] - 10https://gerrit.wikimedia.org/r/1022092 (owner: 10CDanis)
[12:49:56] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Keith: just clearing up the backlog, do we still need to merge this? Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[12:50:33] <wikibugs>	 (03PS1) 10Hashar: gerrit: enable change.diff3ConflictView [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821)
[12:50:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Tested and LGTM, thank you! Adding other o11y folks as heads up" [puppet] - 10https://gerrit.wikimedia.org/r/1036763 (owner: 10JHathaway)
[12:52:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T366123)', diff saved to https://phabricator.wikimedia.org/P63530 and previous config saved to /var/cache/conftool/dbconfig/20240529-125255-marostegui.json
[12:52:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[12:53:01] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[12:53:11] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[12:53:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Perfect!" [puppet] - 10https://gerrit.wikimedia.org/r/1034961 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[12:53:42] <wikibugs>	 (03PS2) 10Hashar: gerrit: enable change.diff3ConflictView [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821)
[12:53:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::core
[12:54:06] <wikibugs>	 (03CR) 10Hashar: "I will upgrade Gerrit to 3.9.x on Monday and we can apply that setting ahead of time to have the feature enabled as we upgrade. `diff3` is" [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821) (owner: 10Hashar)
[12:54:54] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1047.eqiad.wmnet with OS bookworm
[12:54:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rsyslog: notify receiver on cert change [puppet] - 10https://gerrit.wikimedia.org/r/1037066
[12:55:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mariadb::core to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037067 (https://phabricator.wikimedia.org/T349619)
[12:55:05] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2047.codfw.wmnet with OS bookworm
[12:55:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] rsyslog: notify receiver on cert change [puppet] - 10https://gerrit.wikimedia.org/r/1037066 (owner: 10Filippo Giunchedi)
[12:56:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch mariadb::core to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1037067 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:57:06] <wikibugs>	 (03PS5) 10Ayounsi: Add SameSite=Lax attribute to NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624)
[12:57:41] <ottomata>	 I'm back!
[12:57:46] <wikibugs>	 (03CR) 10CDanis: [C:03+2] Add SameSite=Lax attribute to NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624) (owner: 10Ayounsi)
[12:57:46] <wikibugs>	 (03CR) 10Brouberol: [C:04-1] "While these services will need to be removed from the service catalog, thus is too soon. You should follow the instructions at https://wik" [puppet] - 10https://gerrit.wikimedia.org/r/1036994 (https://phabricator.wikimedia.org/T366137) (owner: 10Stevemunene)
[12:59:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: rsyslog: notify receiver on cert change [puppet] - 10https://gerrit.wikimedia.org/r/1037066
[12:59:42] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9841984 (10CDanis) >>! In T366094#9841558, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-sre), href=https://sal.toolforge.org/log/OwwWxI8BGiVuUzOd3n4x} [2024-05-29T11:23:04Z]...
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1300).
[13:00:05] <jouncebot>	 ottomata: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <ottomata>	 o/
[13:00:57] <ottomata>	 Hi, its been a while since I've deployed config, and I only really knew how to do one file at a time.
[13:01:02] <ottomata>	 https://deploy-commands.toolforge.org/bacc/985023 looks new(ish) to me
[13:01:04] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1196.eqiad.wmnet with reason: host reimage
[13:01:19] <ottomata>	 I can do it if it is really that easy :)
[13:01:50] <cdanis>	 ottomata: `scap backport` is really that easy, yes :)
[13:02:01] <ottomata>	 okay, i'm the only one in the window, so I am proceeding
[13:02:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::core
[13:03:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[13:04:30] <wikibugs>	 (03Merged) 10jenkins-bot: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[13:04:33] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1196.eqiad.wmnet with reason: host reimage
[13:05:29] <logmsgbot>	 !log otto@deploy1002 Started scap: Backport for [[gerrit:985023|Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (T353817 T323828)]]
[13:05:34] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9841998 (10MoritzMuehlenhoff)
[13:05:35] <stashbot>	 T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817
[13:05:35] <stashbot>	 T323828: Update Pingback to use the Event Platform - https://phabricator.wikimedia.org/T323828
[13:07:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T364299)', diff saved to https://phabricator.wikimedia.org/P63531 and previous config saved to /var/cache/conftool/dbconfig/20240529-130713-marostegui.json
[13:07:20] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[13:08:21] <logmsgbot>	 !log otto@deploy1002 otto: Backport for [[gerrit:985023|Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (T353817 T323828)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:08:22] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1047.eqiad.wmnet with reason: host reimage
[13:10:58] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1047.eqiad.wmnet with reason: host reimage
[13:11:26] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1036711 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[13:11:29] <moritzm>	 !log installing apache2 security updates
[13:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:22] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2047.codfw.wmnet with reason: host reimage
[13:14:39] <logmsgbot>	 !log otto@deploy1002 otto: Continuing with sync
[13:15:06] <wikibugs>	 (03PS1) 10Marostegui: es*.yaml: Clean up puppet7 lines [puppet] - 10https://gerrit.wikimedia.org/r/1037069
[13:16:03] <fabfur>	 !log temporary disabling puppet on A:cp to rollout https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036711 (T365718)
[13:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:08] <stashbot>	 T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718
[13:16:41] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2047.codfw.wmnet with reason: host reimage
[13:17:26] <wikibugs>	 (03CR) 10Fabfur: [V:03+1 C:03+2] benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1036711 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur)
[13:21:09] <Lucas_WMDE>	 o/
[13:21:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9842054 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF
[13:21:58] <Lucas_WMDE>	 ottomata: yeah, `scap backport` should be all you need :)
[13:22:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P63532 and previous config saved to /var/cache/conftool/dbconfig/20240529-132221-marostegui.json
[13:23:54] <logmsgbot>	 !log otto@deploy1002 Finished scap: Backport for [[gerrit:985023|Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (T353817 T323828)]] (duration: 18m 25s)
[13:24:04] <stashbot>	 T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817
[13:24:05] <stashbot>	 T323828: Update Pingback to use the Event Platform - https://phabricator.wikimedia.org/T323828
[13:25:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[13:25:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[13:25:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T366123)', diff saved to https://phabricator.wikimedia.org/P63533 and previous config saved to /var/cache/conftool/dbconfig/20240529-132553-marostegui.json
[13:26:00] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[13:26:50] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1196.eqiad.wmnet with OS bookworm
[13:27:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63534 and previous config saved to /var/cache/conftool/dbconfig/20240529-132726-arnaudb.json
[13:27:38] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1047.eqiad.wmnet with OS bookworm
[13:28:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1169 T364290', diff saved to https://phabricator.wikimedia.org/P63535 and previous config saved to /var/cache/conftool/dbconfig/20240529-132818-arnaudb.json
[13:28:24] <stashbot>	 T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290
[13:28:47] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1169.eqiad.wmnet with reason: reimage
[13:28:53] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s: add new airflow service to k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1034961 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[13:29:00] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1169.eqiad.wmnet with reason: reimage
[13:30:03] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1169.eqiad.wmnet with OS bookworm
[13:33:00] <wikibugs>	 06SRE, 10SRE-swift-storage: Consolidate TLS cert puppetry for ms and thanos swift frontends - https://phabricator.wikimedia.org/T356412#9842102 (10elukey) 05Stalled→03Resolved a:03elukey Thanos-Swift is running with PKI TLS certs, so now all Swift clusters use PKI. The puppet code seems already clean...
[13:33:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9842107 (10elukey)
[13:34:14] <wikibugs>	 (03CR) 10Elukey: [C:03+1] maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[13:35:17] <wikibugs>	 (03PS1) 10Bking: Revert "dse-k8s: add new airflow service to k8s cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1037014
[13:35:52] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Revert "dse-k8s: add new airflow service to k8s cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1037014 (owner: 10Bking)
[13:36:18] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2047.codfw.wmnet with OS bookworm
[13:37:00] <wikibugs>	 (03PS1) 10Marostegui: redact_sanitarium.sh: Update sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037072
[13:37:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P63536 and previous config saved to /var/cache/conftool/dbconfig/20240529-133729-marostegui.json
[13:38:14] <wikibugs>	 (03PS1) 10NMW03: Enable wmgUseSandboxLink for Swahili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037073 (https://phabricator.wikimedia.org/T365970)
[13:38:34] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Promote db1192 to master [puppet] - 10https://gerrit.wikimedia.org/r/1035315 (https://phabricator.wikimedia.org/T364541)
[13:38:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ms-fe certs [puppet] - 10https://gerrit.wikimedia.org/r/1037074 (https://phabricator.wikimedia.org/T357750)
[13:39:19] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037074 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[13:42:11] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1037066 (owner: 10Filippo Giunchedi)
[13:42:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63537 and previous config saved to /var/cache/conftool/dbconfig/20240529-134232-arnaudb.json
[13:42:43] <wikibugs>	 (03PS1) 10Jgreen: Add an icinga/nsca collector for Fundraising kafka client cert expire check. [puppet] - 10https://gerrit.wikimedia.org/r/1037075 (https://phabricator.wikimedia.org/T360779)
[13:42:56] <ottomata>	 thanks cdanis that was pretty easy :)
[13:43:13] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1036763 (owner: 10JHathaway)
[13:43:47] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1169.eqiad.wmnet with reason: host reimage
[13:45:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:46:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] rsyslog: notify receiver on cert change [puppet] - 10https://gerrit.wikimedia.org/r/1037066 (owner: 10Filippo Giunchedi)
[13:46:37] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: host reimage
[13:49:20] <wikibugs>	 (03PS1) 10Bking: dse-k8s: add new airflow service to k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1037077 (https://phabricator.wikimedia.org/T363001)
[13:51:20] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037077 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[13:52:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T364299)', diff saved to https://phabricator.wikimedia.org/P63538 and previous config saved to /var/cache/conftool/dbconfig/20240529-135237-marostegui.json
[13:52:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance
[13:52:43] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[13:52:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance
[13:53:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T364299)', diff saved to https://phabricator.wikimedia.org/P63539 and previous config saved to /var/cache/conftool/dbconfig/20240529-135300-marostegui.json
[13:54:01] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Looks reasonable to me (assuming PCC doesn't lie!)" [puppet] - 10https://gerrit.wikimedia.org/r/1037074 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[13:55:19] <effie>	 !log label  wikikube-ctrl1002 as master 
[13:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:56] <logmsgbot>	 !log jiji@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-ctrl1002.eqiad.wmnet
[13:57:01] <Lucas_WMDE>	 question for the assembled deployers here. I’m running a maintenance script (T315510, latest comments) which is expected to take about a week longer to finish
[13:57:02] <stashbot>	 T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510
[13:57:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T364069)', diff saved to https://phabricator.wikimedia.org/P63540 and previous config saved to /var/cache/conftool/dbconfig/20240529-135706-marostegui.json
[13:57:13] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[13:57:19] <Lucas_WMDE>	 but I’m on holiday starting tomorrow, so I won’t be able to report whether the script finished successfully or not
[13:57:28] <Lucas_WMDE>	 does that sound okay? or should I stop the script now and hand it over to someone else?
[13:57:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63541 and previous config saved to /var/cache/conftool/dbconfig/20240529-135738-arnaudb.json
[13:58:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1037077 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[13:58:42] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:59:00] <wikibugs>	 (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[13:59:02] <hnowlan>	 wat
[13:59:49] <wikibugs>	 (03PS2) 10CDanis: otelcol: add three new k8s ctrl IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094)
[13:59:49] <wikibugs>	 (03PS1) 10CDanis: otelcol: disable logs & metrics pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037083 (https://phabricator.wikimedia.org/T366094)
[14:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1400)
[14:00:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842253 (10VRiley-WMF) Worked with Dell on kafka-main1009, we were able to replace some of the parts (Power Interface Board, and Right Control Panel) Which go...
[14:01:35] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9842257 (10akosiaris) I 've gone ahead and created the following dashboard today [T366094](https://grafana-rw.wikimedia.org/d/d304d897-54ea-4062-a504-6f2567ed7dba/t366094?orgId=1&from=1716974133223...
[14:02:50] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s: add new airflow service to k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1037077 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[14:04:30] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1035734 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene)
[14:04:42] <wikibugs>	 (03PS3) 10Stevemunene: provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1035734 (https://phabricator.wikimedia.org/T363299)
[14:04:57] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, and 2 others: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9842272 (10Jclark-ctr) @dcaro  the drive was listed as ready in idrac    Converted to non-raid should be visible now
[14:05:13] <wikibugs>	 (03CR) 10Stevemunene: [V:03+2 C:03+2] provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1035734 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene)
[14:07:36] <icinga-wm_>	 RECOVERY - Disk space on backup1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1007&var-datasource=eqiad+prometheus/ops
[14:08:18] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1169.eqiad.wmnet with OS bookworm
[14:09:13] <wikibugs>	 (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1032772 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:09:15] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9842288 (10akosiaris) >>! In T366094#9840665, @CDanis wrote:  Thanks for writing down all of this.  >  ===== This was a capacity crunch triggered by expensive operations > * For the past few months...
[14:09:18] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s: add airflow-analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035015 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[14:09:56] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] global_config: register IP/port for the datahubsearch opensearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/1032772 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:10:07] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-05-13-145903 to 2024-05-23-164021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037084 (https://phabricator.wikimedia.org/T337589)
[14:10:17] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-05-13-145650 to 2024-05-28-185827 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037085 (https://phabricator.wikimedia.org/T348370)
[14:11:09] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] trafficserver: add datahub redirects to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1035731 (https://phabricator.wikimedia.org/T365668) (owner: 10Stevemunene)
[14:11:11] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] maps: Switch kartotherian on maps2007 to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1036236 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[14:11:15] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63542 and previous config saved to /var/cache/conftool/dbconfig/20240529-141114-arnaudb.json
[14:11:51] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-05-13-145903 to 2024-05-23-164021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037084 (https://phabricator.wikimedia.org/T337589) (owner: 10Jforrester)
[14:12:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P63543 and previous config saved to /var/cache/conftool/dbconfig/20240529-141213-marostegui.json
[14:12:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63544 and previous config saved to /var/cache/conftool/dbconfig/20240529-141244-arnaudb.json
[14:12:47] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-05-13-145903 to 2024-05-23-164021 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037084 (https://phabricator.wikimedia.org/T337589) (owner: 10Jforrester)
[14:13:37] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9842327 (10akosiaris) >>! In T366094#9841984, @CDanis wrote: >>>! In T366094#9841558, @Stashbot wrote: >> {nav icon=file, name=Mentioned in SAL (#wikimedia-sre), href=https://sal.toolforge.org/log/...
[14:13:54] <wikibugs>	 (03PS10) 10Brennen Bearnes: gitlab-settings: add timer for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097)
[14:14:22] <wikibugs>	 (03CR) 10Brennen Bearnes: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[14:14:30] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:15:03] <brouberol>	 I'm going to deploy admin_ng to deploy a small external-services addition
[14:15:08] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:15:14] <wikibugs>	 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9842333 (10Lucas_Werkmeister_WMDE) I think we can resolve both.
[14:15:42] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:16:33] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:16:52] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:16:56] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:16:58] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[14:17:21] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[14:18:02] <wikibugs>	 (03PS1) 10Fabfur: Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1037015
[14:18:03] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[14:18:15] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:19:06] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-05-13-145650 to 2024-05-28-185827 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037085 (https://phabricator.wikimedia.org/T348370) (owner: 10Jforrester)
[14:19:22] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:19:49] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:20:36] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:20:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] otelcol: disable logs & metrics pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037083 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis)
[14:21:12] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:21:20] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-05-13-145650 to 2024-05-28-185827 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037085 (https://phabricator.wikimedia.org/T348370) (owner: 10Jforrester)
[14:22:04] <logmsgbot>	 !log brouberol@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[14:22:13] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: add three new k8s ctrl IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis)
[14:22:17] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: disable logs & metrics pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037083 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis)
[14:22:25] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[14:22:46] <logmsgbot>	 !log brouberol@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[14:22:53] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "benthos:cache: switch to rfc5424 format" [puppet] - 10https://gerrit.wikimedia.org/r/1037015 (owner: 10Fabfur)
[14:23:58] <brouberol>	 klausman elukey: Hi! There's an istio-related pending admin-ng change on ml-serve-{eqiad,codfw}. Is that safe to deploy?
[14:24:03] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[14:24:28] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[14:24:32] <logmsgbot>	 !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:24:55] <logmsgbot>	 !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:25:15] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: add three new k8s ctrl IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036708 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis)
[14:25:17] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: disable logs & metrics pipelines [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037083 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis)
[14:25:55] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:26:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T366123)', diff saved to https://phabricator.wikimedia.org/P63545 and previous config saved to /var/cache/conftool/dbconfig/20240529-142619-marostegui.json
[14:26:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63546 and previous config saved to /var/cache/conftool/dbconfig/20240529-142627-arnaudb.json
[14:26:30] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[14:26:37] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1046.eqiad.wmnet with OS bookworm
[14:26:40] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:26:41] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2046.codfw.wmnet with OS bookworm
[14:26:49] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[14:26:55] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[14:27:14] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:27:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P63547 and previous config saved to /var/cache/conftool/dbconfig/20240529-142721-marostegui.json
[14:27:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye
[14:27:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842432 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye
[14:27:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63548 and previous config saved to /var/cache/conftool/dbconfig/20240529-142750-arnaudb.json
[14:28:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1163 T364290', diff saved to https://phabricator.wikimedia.org/P63549 and previous config saved to /var/cache/conftool/dbconfig/20240529-142830-arnaudb.json
[14:28:36] <stashbot>	 T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290
[14:28:43] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[14:28:48] <klausman>	 brouberol: in a meeting, will get back to you in a bit
[14:28:49] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1163.eqiad.wmnet with reason: reimage
[14:29:02] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1163.eqiad.wmnet with reason: reimage
[14:30:04] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1163.eqiad.wmnet with OS bookworm
[14:33:00] <wikibugs>	 (03PS1) 10CDobbins: purged: roll out use_pki flag to all of drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506)
[14:33:04] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:33:26] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:33:49] <wikibugs>	 (03PS11) 10Brennen Bearnes: gitlab-settings: add timer for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097)
[14:35:40] <wikibugs>	 (03CR) 10Brennen Bearnes: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[14:36:32] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2681/console" [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[14:37:32] <fabfur>	 !log enabled puppet on A:cp as https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036711 has been reverted (not applied anywhere but cp4037) (T365718)
[14:37:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:37] <stashbot>	 T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718
[14:38:26] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 30 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:38:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:54] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1046.eqiad.wmnet with reason: host reimage
[14:40:38] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2682/co" [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[14:41:08] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[14:41:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P63550 and previous config saved to /var/cache/conftool/dbconfig/20240529-144129-marostegui.json
[14:41:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63551 and previous config saved to /var/cache/conftool/dbconfig/20240529-144140-arnaudb.json
[14:42:21] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1046.eqiad.wmnet with reason: host reimage
[14:42:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T364069)', diff saved to https://phabricator.wikimedia.org/P63552 and previous config saved to /var/cache/conftool/dbconfig/20240529-144229-marostegui.json
[14:42:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[14:42:36] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[14:42:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[14:43:20] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Discovery IPs for apus service - mvernon@cumin2002"
[14:43:38] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1163.eqiad.wmnet with reason: host reimage
[14:43:45] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[14:43:52] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "I believe this is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[14:44:15] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Discovery IPs for apus service - mvernon@cumin2002"
[14:44:15] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:44:52] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2046.codfw.wmnet with reason: host reimage
[14:45:25] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[14:45:38] <wikibugs>	 (03CR) 10Pppery: [C:03+1] "Probably would have made more sense to do this on the translatewiki.net side rather than via Gerrit, and I would hold off merging this for" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) (owner: 10Aklapper)
[14:45:45] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[14:46:29] <wikibugs>	 (03CR) 10Ladsgroup: redact_sanitarium.sh: Update sanitarium hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui)
[14:47:00] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[14:47:01] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1163.eqiad.wmnet with reason: host reimage
[14:47:21] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[14:47:33] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[14:48:26] <wikibugs>	 (03CR) 10Marostegui: redact_sanitarium.sh: Update sanitarium hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui)
[14:48:30] <wikibugs>	 (03PS2) 10Marostegui: redact_sanitarium.sh: Update sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037072
[14:48:49] <wikibugs>	 (03CR) 10Marostegui: redact_sanitarium.sh: Update sanitarium hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui)
[14:48:59] <wikibugs>	 (03CR) 10Aklapper: "I'm just very clueless about the process so if there's something on the twn side instead I'm cool with that too." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) (owner: 10Aklapper)
[14:49:11] <klausman>	 brouberol: yes, that change can be pushed (or I can do it, if you prefer)
[14:49:29] <brouberol>	 if you could, that'd be great! thanks
[14:49:34] <klausman>	 on it
[14:49:44] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2046.codfw.wmnet with reason: host reimage
[14:49:48] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[14:49:59] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui)
[14:50:10] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] redact_sanitarium.sh: Update sanitarium hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui)
[14:50:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] redact_sanitarium.sh: Update sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/1037072 (owner: 10Marostegui)
[14:50:46] <wikibugs>	 (03PS1) 10MVernon: Add apus svc records in codfw and eqiad [dns] - 10https://gerrit.wikimedia.org/r/1037095 (https://phabricator.wikimedia.org/T279621)
[14:50:49] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[14:50:52] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[14:52:05] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[14:52:46] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[14:53:02] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041']
[14:53:41] <wikibugs>	 (03PS1) 10Marostegui: pc1014: Remove puppet7 entries [puppet] - 10https://gerrit.wikimedia.org/r/1037096
[14:53:52] <klausman>	 brouberol: all done
[14:53:59] <brouberol>	 appreciated!
[14:54:36] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[14:54:37] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync
[14:55:26] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:55:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:56:13] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync
[14:56:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P63553 and previous config saved to /var/cache/conftool/dbconfig/20240529-145637-marostegui.json
[14:56:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63554 and previous config saved to /var/cache/conftool/dbconfig/20240529-145646-arnaudb.json
[14:58:06] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1046.eqiad.wmnet with OS bookworm
[15:00:28] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:04:51] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041']
[15:05:01] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudvirt1041']
[15:05:24] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1041']
[15:05:28] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041']
[15:05:50] <wikibugs>	 (03PS12) 10Brennen Bearnes: gitlab-settings: add timer for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097)
[15:06:00] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudvirt1041']
[15:06:29] <wikibugs>	 (03PS2) 10MVernon: Add apus svc records in codfw and eqiad [dns] - 10https://gerrit.wikimedia.org/r/1037095 (https://phabricator.wikimedia.org/T279621)
[15:06:46] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED
[15:06:57] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 10Release-Engineering-Team (Priority Backlog 📥): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129#9842517 (10Pppery) Is there an estimated timeframe for when that will be?
[15:07:01] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1037069 (owner: 10Marostegui)
[15:07:22] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2046.codfw.wmnet with OS bookworm
[15:07:22] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED
[15:07:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T364299)', diff saved to https://phabricator.wikimedia.org/P63555 and previous config saved to /var/cache/conftool/dbconfig/20240529-150757-marostegui.json
[15:08:03] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[15:08:05] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED
[15:08:57] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1163.eqiad.wmnet with OS bookworm
[15:09:08] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS bullseye
[15:09:13] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye
[15:09:44] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 10Release-Engineering-Team (Priority Backlog 📥): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129#9842536 (10MoritzMuehlenhoff) 05Open→03Resolved It's already live, we updated CAS two hours ago. If you log into idp.wik...
[15:11:02] <wikibugs>	 (03PS5) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372)
[15:11:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T366123)', diff saved to https://phabricator.wikimedia.org/P63556 and previous config saved to /var/cache/conftool/dbconfig/20240529-151145-marostegui.json
[15:11:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[15:11:51] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[15:11:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63557 and previous config saved to /var/cache/conftool/dbconfig/20240529-151152-arnaudb.json
[15:12:06] <wikibugs>	 (03CR) 10Elukey: redfish: expand support for Supermicro hosts (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:12:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[15:12:16] <wikibugs>	 (03PS2) 10JHathaway: rsyslog: include slashes in program names [puppet] - 10https://gerrit.wikimedia.org/r/1036763
[15:12:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T366123)', diff saved to https://phabricator.wikimedia.org/P63558 and previous config saved to /var/cache/conftool/dbconfig/20240529-151219-marostegui.json
[15:12:55] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es*.yaml: Clean up puppet7 lines [puppet] - 10https://gerrit.wikimedia.org/r/1037069 (owner: 10Marostegui)
[15:13:24] <wikibugs>	 (03CR) 10Pppery: [C:03+1] "The change to make would be to edit https://translatewiki.net/wiki/Phabricator:arcanist-core-3a7b8e3fb7aa607f/qqq, and ditto for the other" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) (owner: 10Aklapper)
[15:13:28] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 48 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:14:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T366123)', diff saved to https://phabricator.wikimedia.org/P63559 and previous config saved to /var/cache/conftool/dbconfig/20240529-151430-marostegui.json
[15:14:33] <dancy>	 jouncebot now
[15:14:33] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 45 minute(s)
[15:14:45] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036763 (owner: 10JHathaway)
[15:14:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63560 and previous config saved to /var/cache/conftool/dbconfig/20240529-151455-arnaudb.json
[15:15:01] <wikibugs>	 (03CR) 10Pppery: [C:03+1] "(Translation changes made via Gerrit do work - they cause FuzzyBot to update the page on translatewiki. But it would be cleaner IMO to do " [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037026 (https://phabricator.wikimedia.org/T365853) (owner: 10Aklapper)
[15:15:35] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add apus svc records in codfw and eqiad [dns] - 10https://gerrit.wikimedia.org/r/1037095 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[15:16:01] <dancy>	 jan_drewniak: Are you around to test https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1036665 if I deploy it?
[15:16:02] <wikibugs>	 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 10Release-Engineering-Team (Priority Backlog 📥): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129#9842612 (10Pppery) Thanks.
[15:16:50] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove ms-fe certs [puppet] - 10https://gerrit.wikimedia.org/r/1037074 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff)
[15:16:53] <jan_drewniak>	 dancy: hi! I'm around, but it turns out there are more issues with approach, we're just debating what to do now.
[15:17:07] <dancy>	 ok. I'll wait for word from you.
[15:17:38] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041']
[15:17:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036750 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy)
[15:17:59] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1041']
[15:18:02] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: sync
[15:18:02] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: sync
[15:18:03] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync
[15:18:03] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: sync
[15:18:03] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: sync
[15:18:03] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: sync
[15:18:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:18:19] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync
[15:18:21] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: sync
[15:18:48] <wikibugs>	 (03Merged) 10jenkins-bot: Remove the php symlink (v2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036750 (https://phabricator.wikimedia.org/T359643) (owner: 10Ahmon Dancy)
[15:19:18] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] rsyslog: include slashes in program names [puppet] - 10https://gerrit.wikimedia.org/r/1036763 (owner: 10JHathaway)
[15:19:20] <logmsgbot>	 !log dancy@deploy1002 Started scap: Backport for [[gerrit:1036750|Remove the php symlink (v2) (T359643)]]
[15:19:29] <stashbot>	 T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643
[15:19:32] <James_F>	 dancy: Lovely work removing the symlink!
[15:19:40] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: sync
[15:19:42] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: sync
[15:19:43] <James_F>	 It was the bane of my deploy-life.
[15:19:51] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync
[15:20:00] <dancy>	 James_F: Thanks!  It was always confusing/annoying to me.
[15:20:13] <James_F>	 Back in the day all the mwscript calls would run through it.
[15:20:18] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: sync
[15:20:48] <James_F>	 So every time you synced (this was pre-k8s) you could, but also might not, start running the "wrong" version of the code in some places, and break stuff. Or not! Fun times.
[15:21:38] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041']
[15:22:06] <logmsgbot>	 !log dancy@deploy1002 dancy: Backport for [[gerrit:1036750|Remove the php symlink (v2) (T359643)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:22:10] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1041']
[15:23:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P63561 and previous config saved to /var/cache/conftool/dbconfig/20240529-152305-marostegui.json
[15:23:10] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[15:23:20] <logmsgbot>	 !log dancy@deploy1002 dancy: Continuing with sync
[15:23:31] <wikibugs>	 (03CR) 10Brennen Bearnes: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[15:23:46] <wikibugs>	 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193 (10ssingh) 03NEW
[15:23:52] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041']
[15:23:56] <cdanis>	 James_F: sounds like exactly what you want for things like db schema migrations
[15:24:32] <James_F>	 cdanis: Or purging cache of corrupted contents, or rotating the logs when they're about to reach the privacy cut-off, or…
[15:24:38] <cdanis>	 mhm
[15:24:58] <James_F>	 All these deploy-fails pass, like tears in the rain.
[15:25:00] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[15:25:02] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[15:25:42] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correct IPs for apus - mvernon@cumin2002"
[15:26:32] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correct IPs for apus - mvernon@cumin2002"
[15:26:32] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:27:50] <wikibugs>	 (03PS1) 10Jdlrobson: Revert "Wrap tables with JS" [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037018
[15:27:50] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[15:27:54] <wikibugs>	 (03CR) 10MVernon: [C:03+2] Add apus svc records in codfw and eqiad [dns] - 10https://gerrit.wikimedia.org/r/1037095 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[15:28:22] <wikibugs>	 (03PS2) 10Jdlrobson: Revert "Wrap tables with JS" [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037018 (https://phabricator.wikimedia.org/T330527)
[15:29:04] <logmsgbot>	 !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudvirt1041']
[15:29:36] <wikibugs>	 06SRE, 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9842704 (10Jdforrester-WMF)
[15:29:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P63562 and previous config saved to /var/cache/conftool/dbconfig/20240529-152937-marostegui.json
[15:30:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63563 and previous config saved to /var/cache/conftool/dbconfig/20240529-153001-arnaudb.json
[15:30:15] <wikibugs>	 (03CR) 10Elukey: "Need to fix CI's -1 sigh" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:30:17] <wikibugs>	 (03PS1) 10JHathaway: rsyslog: kafka_shipper, use global_entry function [puppet] - 10https://gerrit.wikimedia.org/r/1037098
[15:30:40] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037098 (owner: 10JHathaway)
[15:30:41] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "We're just discussing an alternative less risky approach here:  https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1037018 after an" [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036665 (https://phabricator.wikimedia.org/T330527) (owner: 10Jdlrobson)
[15:31:43] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041']
[15:31:57] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1041']
[15:32:15] <wikibugs>	 (03PS6) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372)
[15:32:23] <logmsgbot>	 !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1036750|Remove the php symlink (v2) (T359643)]] (duration: 13m 03s)
[15:32:28] <stashbot>	 T359643: Get rid of the /srv/mediawiki/php symbolic link - https://phabricator.wikimedia.org/T359643
[15:32:39] <wikibugs>	 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9842720 (10ssingh) p:05Triage→03Medium
[15:34:36] <wikibugs>	 (03PS7) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372)
[15:34:51] <wikibugs>	 (03PS4) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686)
[15:38:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P63564 and previous config saved to /var/cache/conftool/dbconfig/20240529-153813-marostegui.json
[15:38:26] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041']
[15:39:17] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1041']
[15:40:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:44:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P63565 and previous config saved to /var/cache/conftool/dbconfig/20240529-154446-marostegui.json
[15:45:01] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1041']
[15:45:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63566 and previous config saved to /var/cache/conftool/dbconfig/20240529-154510-arnaudb.json
[15:45:29] <wikibugs>	 (03CR) 10Effie Mouzeli: "@Dduvall, thank you very much! It is sad to see blubberoid go. But it had a good run." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall)
[15:45:50] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] docker_registry_ha: replace deprecated /-/jwks endpoint on gitlab [puppet] - 10https://gerrit.wikimedia.org/r/1037043 (https://phabricator.wikimedia.org/T365675) (owner: 10Jelto)
[15:45:56] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, let me know when this should be merged" [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821) (owner: 10Hashar)
[15:46:07] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Migrate s1 backups to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1037107 (https://phabricator.wikimedia.org/T364290)
[15:46:16] <wikibugs>	 (03CR) 10Jforrester: "🫡 Farewell, Blubberoid." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall)
[15:47:21] <wikibugs>	 (03CR) 10Volans: "The last PSes seems to have diverged a bit from the agreed path" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[15:48:26] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[15:48:30] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:48:32] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on db2141.codfw.wmnet with reason: upgrade to 10.6
[15:48:46] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2141.codfw.wmnet with reason: upgrade to 10.6
[15:48:46] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[15:49:00] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on dbprov1003.eqiad.wmnet with reason: upgrade to 10.6
[15:49:02] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbprov1003.eqiad.wmnet with reason: upgrade to 10.6
[15:49:12] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on dbprov2003.codfw.wmnet with reason: upgrade to 10.6
[15:49:25] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbprov2003.codfw.wmnet with reason: upgrade to 10.6
[15:52:10] <logmsgbot>	 !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cloudvirt1041']
[15:53:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T364299)', diff saved to https://phabricator.wikimedia.org/P63567 and previous config saved to /var/cache/conftool/dbconfig/20240529-155321-marostegui.json
[15:53:25] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance
[15:53:27] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[15:53:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance
[15:53:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance
[15:53:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance
[15:53:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T364299)', diff saved to https://phabricator.wikimedia.org/P63568 and previous config saved to /var/cache/conftool/dbconfig/20240529-155349-marostegui.json
[15:55:06] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1009.eqiad.wmnet with OS bullseye
[15:55:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye executed...
[15:55:40] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm
[15:55:51] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9842872 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm
[15:55:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842874 (10akosiaris) The fail for kafka-main1009 is expected with the current recipe btw. Let me have a quick look.
[15:55:56] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[15:56:31] <wikibugs>	 (03PS1) 10Ahmon Dancy: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1037108
[15:56:44] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[15:56:56] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+2] Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1037108 (owner: 10Ahmon Dancy)
[15:56:59] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz)
[15:57:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9842877 (10akosiaris) >>! In T363212#9842874, @akosiaris wrote: > The fail for kafka-main1009 is expected with the current recipe btw. Let me have a quick loo...
[15:57:38] <wikibugs>	 (03Merged) 10jenkins-bot: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1037108 (owner: 10Ahmon Dancy)
[15:59:01] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[15:59:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T366123)', diff saved to https://phabricator.wikimedia.org/P63569 and previous config saved to /var/cache/conftool/dbconfig/20240529-155954-marostegui.json
[15:59:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[16:00:00] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[16:00:11] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[16:00:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63570 and previous config saved to /var/cache/conftool/dbconfig/20240529-160016-arnaudb.json
[16:00:43] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:01:09] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2045.codfw.wmnet with OS bookworm
[16:01:13] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1045.eqiad.wmnet with OS bookworm
[16:04:18] <ChrisDobbins901_>	 !log sudo cumin 'A:cp and A:drmrs' 'disable-puppet "merging CR 1037089"'
[16:04:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[16:05:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T366123)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240529-160528-marostegui.json
[16:05:41] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[16:06:30] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:09:33] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[16:10:27] <wikibugs>	 (03CR) 10CDobbins: [V:03+1 C:03+2] purged: roll out use_pki flag to all of drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1037089 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[16:10:27] <wikibugs>	 (03CR) 10Lucas Werkmeister: [C:03+1] gerrit: enable change.diff3ConflictView [puppet] - 10https://gerrit.wikimedia.org/r/1037065 (https://phabricator.wikimedia.org/T359821) (owner: 10Hashar)
[16:10:54] <wikibugs>	 (03PS1) 10CDanis: otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094)
[16:11:30] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:12:33] <wikibugs>	 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9842955 (10AndrewTavis_WMDE) Perfect, @Lucas_Werkmeister_WMDE! Glad to have this all cleared up :)
[16:13:13] <wikibugs>	 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9842957 (10AndrewTavis_WMDE) 05Open→03Resolved a:03AndrewTavis_WMDE
[16:13:22] <wikibugs>	 (03PS1) 10JHathaway: rsyslog kafka: add postfix programs [puppet] - 10https://gerrit.wikimedia.org/r/1037114 (https://phabricator.wikimedia.org/T325395)
[16:14:03] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037114 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[16:14:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T366134#9842964 (10Papaul) 05Open→03Resolved a:03Papaul complete
[16:14:28] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1045.eqiad.wmnet with reason: host reimage
[16:15:04] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[16:15:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63572 and previous config saved to /var/cache/conftool/dbconfig/20240529-161522-arnaudb.json
[16:15:26] <wikibugs>	 (03PS2) 10CDanis: otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094)
[16:16:27] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] vrts: add missing comma to vrts_aliases.py [puppet] - 10https://gerrit.wikimedia.org/r/1036760 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[16:17:13] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1037107/2683/" [puppet] - 10https://gerrit.wikimedia.org/r/1037107 (https://phabricator.wikimedia.org/T364290) (owner: 10Jcrespo)
[16:17:13] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for STran, Madalina, Tchanders and JayCano - https://phabricator.wikimedia.org/T366198 (10JayCano) 03NEW
[16:17:14] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] rsyslog kafka: add postfix programs [puppet] - 10https://gerrit.wikimedia.org/r/1037114 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway)
[16:17:36] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1045.eqiad.wmnet with reason: host reimage
[16:18:32] <ChrisDobbins901_>	 !log sudo cumin -b1 -s60 'A:cp and A:drmrs' 'run-puppet-agent --enable "merging CR 1037089"'
[16:18:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:13] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2045.codfw.wmnet with reason: host reimage
[16:19:20] <wikibugs>	 (03PS3) 10CDanis: otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094)
[16:19:36] <jhathaway>	 jynus: can I merge in your s1 backup patch?
[16:20:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P63573 and previous config saved to /var/cache/conftool/dbconfig/20240529-162040-marostegui.json
[16:21:05] <jynus>	 jhathaway: I was asking you on the other channel
[16:21:08] <jynus>	 please do
[16:21:16] <jhathaway>	 nod!
[16:22:12] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9843039 (10Volans) a:03Volans
[16:22:40] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2045.codfw.wmnet with reason: host reimage
[16:23:01] <wikibugs>	 (03Abandoned) 10Jdlrobson: Limit responsive tables to .wikitables [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036665 (https://phabricator.wikimedia.org/T330527) (owner: 10Jdlrobson)
[16:23:55] <jan_drewniak>	 dancy: Hi, regarding the train blocker, we've decided to revert the original change, this is the patch that can be deployed now: https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1037018 
[16:24:36] <dancy>	 jan_drewniak:  OK.  I'll start right now.
[16:24:39] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:25:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037018 (https://phabricator.wikimedia.org/T330527) (owner: 10Jdlrobson)
[16:25:37] <icinga-wm_>	 PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service has not been restarted after /etc/pybal/pybal.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[16:25:42] <sukhe>	 huh
[16:25:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:27:00] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1010.mgmt.eqiad.wmnet with reboot policy FORCED
[16:28:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye
[16:28:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9843081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye
[16:29:41] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 30 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:29:58] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS bullseye
[16:30:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9843083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye
[16:31:03] <wikibugs>	 (03PS1) 10Dzahn: mx: stop ignoring VRTS alias errors, email on error [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145)
[16:32:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage
[16:32:09] <stashbot>	 jclark@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[16:32:29] <icinga-wm_>	 RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted
[16:32:30] <sukhe>	 !log restart pybal on lvs1019
[16:32:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:39] <icinga-wm_>	 PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:34:27] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1045.eqiad.wmnet with OS bookworm
[16:35:21] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage
[16:35:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P63574 and previous config saved to /var/cache/conftool/dbconfig/20240529-163549-marostegui.json
[16:36:20] <wikibugs>	 (03PS5) 10JHathaway: spf recs update: phabricator, gitlab, wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113)
[16:37:49] <wikibugs>	 (03CR) 10Muehlenhoff: mx: stop ignoring VRTS alias errors, email on error (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[16:38:02] <wikibugs>	 (03PS2) 10JHathaway: rsyslog: kafka_shipper, use global_entry function [puppet] - 10https://gerrit.wikimedia.org/r/1037098
[16:38:08] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037098 (owner: 10JHathaway)
[16:38:42] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:39:11] <wikibugs>	 (03PS1) 10Jsn.sherman: CommonSettings: correct AutoModerator load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037118
[16:39:25] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] "Thanks Dallas, Arzhel, & Eoghan for the reviews" [dns] - 10https://gerrit.wikimedia.org/r/1036739 (https://phabricator.wikimedia.org/T366113) (owner: 10JHathaway)
[16:40:16] <wikibugs>	 (03PS2) 10Dzahn: mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145)
[16:40:20] <wikibugs>	 (03CR) 10Dzahn: mx: stop ignoring VRTS alias errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[16:40:39] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2045.codfw.wmnet with OS bookworm
[16:42:15] <icinga-wm_>	 RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms
[16:42:40] <wikibugs>	 (03PS2) 10Jsn.sherman: CommonSettings: correct AutoModerator load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037118 (https://phabricator.wikimedia.org/T366203)
[16:43:42] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:47:09] <wikibugs>	 (03PS1) 10JHathaway: Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"" [puppet] - 10https://gerrit.wikimedia.org/r/1037129
[16:49:01] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Revert "Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org"" [puppet] - 10https://gerrit.wikimedia.org/r/1037129 (owner: 10JHathaway)
[16:49:04] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9843219 (10Papaul)
[16:49:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9843222 (10Dzahn) >>! In T365574#9829202, @jon_amar-WMDE wrote: > Hi @Dzahn I'm not clear whether I can provide approval (I'm the Product Manager for Wik...
[16:49:55] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Wrap tables with JS" [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037018 (https://phabricator.wikimedia.org/T330527) (owner: 10Jdlrobson)
[16:50:25] <logmsgbot>	 !log dancy@deploy1002 Started scap: Backport for [[gerrit:1037018|Revert "Wrap tables with JS" (T330527)]]
[16:50:30] <stashbot>	 T330527: Wider tables overlap sticky page tools (Upstream Minerva's responsive table styles to core SkinModule) - https://phabricator.wikimedia.org/T330527
[16:50:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T366123)', diff saved to https://phabricator.wikimedia.org/P63575 and previous config saved to /var/cache/conftool/dbconfig/20240529-165057-marostegui.json
[16:51:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[16:51:03] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[16:51:13] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[16:51:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T366123)', diff saved to https://phabricator.wikimedia.org/P63576 and previous config saved to /var/cache/conftool/dbconfig/20240529-165121-marostegui.json
[16:52:11] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:52:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9843239 (10VRiley-WMF) Investigated this unit with the assistance of Dell. After some troubleshooting and pulling logs, they will be sending out a new motherboard as a replacement (tomorrow). Wil...
[16:52:53] <wikibugs>	 (03Abandoned) 10Herron: pyrra add service dns entries [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[16:53:22] <wikibugs>	 (03Abandoned) 10Herron: services: add pyrra conftool-data and service stub entry [puppet] - 10https://gerrit.wikimedia.org/r/961129 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[16:53:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T366123)', diff saved to https://phabricator.wikimedia.org/P63577 and previous config saved to /var/cache/conftool/dbconfig/20240529-165333-marostegui.json
[16:53:40] <wikibugs>	 (03Abandoned) 10Herron: pyrra: use load balancing [puppet] - 10https://gerrit.wikimedia.org/r/961130 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[16:53:53] <wikibugs>	 (03PS1) 10JHathaway: rsyslog: fix undef var in global entry [puppet] - 10https://gerrit.wikimedia.org/r/1037121
[16:54:19] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037121 (owner: 10JHathaway)
[16:57:05] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] rsyslog: fix undef var in global entry [puppet] - 10https://gerrit.wikimedia.org/r/1037121 (owner: 10JHathaway)
[16:58:53] <wikibugs>	 (03PS3) 10JHathaway: rsyslog kafka_shipper: use the new global_entry function [puppet] - 10https://gerrit.wikimedia.org/r/1037098
[16:59:11] <logmsgbot>	 !log stevemunene@deploy1002 Started deploy [airflow-dags/analytics@229b278]: (no justification provided)
[16:59:19] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037098 (owner: 10JHathaway)
[16:59:38] <logmsgbot>	 !log stevemunene@deploy1002 Finished deploy [airflow-dags/analytics@229b278]: (no justification provided) (duration: 00m 26s)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1700)
[17:01:34] <dancy>	 I'm getting testserver check failures during scap mediawiki deployment:
[17:01:34] <dancy>	 ```
[17:01:34] <dancy>	 17:00:31 Check 'check_testservers_k8s' failed: Sending to mwdebug.discovery.wmnet...
[17:01:34] <dancy>	 https://foundation.wikimedia.org/wiki/Home (/srv/deployment/httpbb-tests/appserver/test_foundation.yaml:2)
[17:01:34] <dancy>	     ERROR: HTTPSConnectionPool(host='mwdebug.discovery.wmnet', port=4444): Max retries exceeded with url: /wiki/Home (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb042d89a90>, 'Connection to mwdebug.discovery.wmnet timed out. (connect timeout=10)'))
[17:01:35] <dancy>	 ```
[17:01:41] <dancy>	 rzl: Any ideas?
[17:01:58] <dancy>	 The error persists when retrying
[17:02:16] <rzl>	 curious, taking a look
[17:02:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T364299)', diff saved to https://phabricator.wikimedia.org/P63578 and previous config saved to /var/cache/conftool/dbconfig/20240529-170242-marostegui.json
[17:02:49] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[17:02:56] <hashar>	 port 4444?
[17:03:08] <wikibugs>	 (03PS8) 10Elukey: redfish: expand support for Supermicro hosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1036704 (https://phabricator.wikimedia.org/T365372)
[17:03:28] <dancy>	 hashar: nod.. as configured in /etc/scap.cfg: `testservers_check_cmd_k8s: httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444 --retry_on_timeout`
[17:04:42] <mutante>	 there are a ton of open ports on mwdebug1001. 4444 is not one of them.
[17:04:59] <dancy>	 mwdebug1001 is bare metal.  This is the k8s check failing
[17:05:08] <mutante>	 ah, nod
[17:05:09] <rzl>	 interesting, `curl https://foundation.wikimedia.org/wiki/Home --resolve 'foundation.wikimedia.org:443:4444'` works reliably but I'm getting the same timeout from httpbb, still looking
[17:05:17] <hashar>	 we would have lost the kubernetes debug pod so?
[17:05:24] <wikibugs>	 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9843292 (10Scott_French)
[17:05:25] <rzl>	 er, because I messed up the --resolve :) hang on
[17:05:27] <dancy>	 Hopping into my team meating.
[17:05:33] <hashar>	 meat time!
[17:05:37] <dancy>	 meating!
[17:05:38] <dancy>	 haha
[17:05:42] <dancy>	 that sounds delicious
[17:05:49] <hashar>	 how do you want your steak today?
[17:05:49] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers parse1013.eqiad.wmnet, mw1442.eqiad.wmnet, kubernetes1022.eqiad.wmnet, mw1384.eqiad.wmnet, mw1479.eqiad.wmnet, mw1470.eqiad.wmnet, kubernetes1021.eqiad.wmnet, mw1430.eqiad.wmnet, mw1388.eqiad.wmnet, mw1482.eqiad.wmnet, parse1009.eqiad.wmnet, mw1449.eqiad.wmnet, mw1391.eqiad.wmnet, parse1024.eqiad.wmnet, mw1408.eqiad.wmnet, mw14
[17:05:49] <icinga-wm_>	 wmnet, mw1357.eqiad.wmnet, kubernetes1017.eqiad.wmnet, mw1425.eqiad.wmnet, mw1397.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1051.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1452.eqiad.wmnet, mw1356.eqiad.wmnet, mw1374.eqiad.wmnet, mw1414.eqiad.wmnet, mw1371.eqiad.wmnet, mw1473.eqiad.wmnet, mw1392.eqiad.wmnet, kubernetes1028.eqiad.wmnet, mw1485.eqiad.wmnet, kubernetes1043.eqiad.wmnet, kubernetes1008.eqiad.wmnet, mw1362.eqiad.wmnet
[17:05:49] <icinga-wm_>	 eqiad.wmnet, mw1463.eqiad.wmnet, mw1421.eqiad.wmnet, mw1441.eqiad.wmnet, parse1006.eqiad.wmnet, parse1004.eqiad.wmnet, parse1016.eqiad.wmnet, kubernetes1052.eqiad.wmnet, parse1022.eqiad https://wikitech.wikimedia.org/wiki/PyBal
[17:05:54] <hashar>	 ah
[17:05:58] <rzl>	 well, that's probably not unrelated
[17:06:00] <hashar>	 more stuff exploding with pybal ..
[17:06:31] <wikibugs>	 (03CR) 10Reedy: [C:03+1] CommonSettings: correct AutoModerator load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037118 (https://phabricator.wikimedia.org/T366203) (owner: 10Jsn.sherman)
[17:06:49] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:08:10] <logmsgbot>	 !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1041.eqiad.wmnet with OS bookworm
[17:08:27] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm
[17:08:36] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9843300 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm
[17:08:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P63579 and previous config saved to /var/cache/conftool/dbconfig/20240529-170841-marostegui.json
[17:09:49] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers parse1011.eqiad.wmnet, mw1462.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1457.eqiad.wmnet, mw1442.eqiad.wmnet, mw1478.eqiad.wmnet, kubernetes1037.eqiad.wmnet, mw1384.eqiad.wmnet, mw1479.eqiad.wmnet, kubernetes1044.eqiad.wmnet, mw1449.eqiad.wmnet, mw1399.eqiad.wmnet, mw1424.eqiad.wmnet, parse1024.eqiad.wmnet, mw1454.eqiad.wmnet,
[17:09:49] <icinga-wm_>	 0.eqiad.wmnet, mw1423.eqiad.wmnet, mw1496.eqiad.wmnet, kubernetes1060.eqiad.wmnet, mw1466.eqiad.wmnet, kubernetes1059.eqiad.wmnet, mw1469.eqiad.wmnet, mw1394.eqiad.wmnet, mw1452.eqiad.wmnet, mw1422.eqiad.wmnet, mw1374.eqiad.wmnet, mw1414.eqiad.wmnet, parse1020.eqiad.wmnet, mw1485.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1009.eqiad.wmnet, mw1448.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, kubernetes1042.eqiad.wmnet, mw1
[17:09:49] <icinga-wm_>	 .wmnet, kubernetes1056.eqiad.wmnet, kubernetes1029.eqiad.wmnet, mw1472.eqiad.wmnet, parse1022.eqiad.wmnet, kubernetes1032.eqiad.wmnet, parse1017.eqiad.wmnet, mw1440.eqiad.wmnet, kuberne https://wikitech.wikimedia.org/wiki/PyBal
[17:10:40] <rzl>	 httpbb was passing but is failing again, so this isn't an httpbb problem but httpbb surfacing a load-balancing problem
[17:11:32] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9843301 (10jhathaway)
[17:13:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, can you please also take care of merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1031761 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[17:13:57] <wikibugs>	 (03CR) 10Dduvall: "That works for me. Thanks, @effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036716 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall)
[17:14:29] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host kafka-main1010.eqiad.wmnet with OS bullseye
[17:16:47] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:17:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P63580 and previous config saved to /var/cache/conftool/dbconfig/20240529-171750-marostegui.json
[17:19:42] <wikibugs>	 (03PS3) 10Muehlenhoff: vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145)
[17:20:40] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] vrts: Stop ignoring errors from alias sync [puppet] - 10https://gerrit.wikimedia.org/r/1031761 (https://phabricator.wikimedia.org/T284145) (owner: 10Muehlenhoff)
[17:23:23] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage
[17:23:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P63581 and previous config saved to /var/cache/conftool/dbconfig/20240529-172349-marostegui.json
[17:23:56] <cdanis>	 rzl: I'm back at keys and looking now
[17:25:44] <rzl>	 thanks, still looking too -- I'm still not 100% sure this isn't just a genuine mw-debug issue caused by the deploy, but it doesn't look like it
[17:26:01] <dancy>	 This is what I was trying to deploy: https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1037018
[17:26:06] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage
[17:26:20] <rzl>	 `curl https://foundation.wikimedia.org/wiki/Main_Page --connect-to foundation.wikimedia.org:443:mwdebug.discovery.wmnet:4444` also hangs, so it definitely isn't just httpbb
[17:27:31] <rzl>	 `curl https://foundation.wikimedia.org/wiki/Main_Page --connect-to foundation.wikimedia.org:443:mw-web.discovery.wmnet:4450` works so it isn't all of mw-on-k8s
[17:29:36] <wikibugs>	 (03PS3) 10Dzahn: mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145)
[17:30:10] <cdanis>	 rzl: https://grafana.wikimedia.org/d/000000422/pybal-service?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=All&var-service=mwdebug_4444
[17:30:19] <wikibugs>	 (03CR) 10Dzahn: mx: stop ignoring VRTS alias errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[17:31:41] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:32:24] <rzl>	 hrm, and scap started at 16:50
[17:32:35] <wikibugs>	 (03CR) 10Dzahn: mx: stop ignoring VRTS alias errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[17:32:41] <wikibugs>	 (03Abandoned) 10Dzahn: mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[17:32:59] <rzl>	 it doesn't necessarily need to have been the actual code getting deployed, but that looks likely to have been the trigger for whatever this is
[17:32:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P63582 and previous config saved to /var/cache/conftool/dbconfig/20240529-173258-marostegui.json
[17:34:36] <rzl>	 the actual mw-debug pods are 35m and 38m old so they're not dying, and I don't immediately see anything in logs on the k8s side
[17:35:31] <cdanis>	 May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444] INFO: Leaving previously pooled but down server mw1439.eqiad.wmnet pooled
[17:35:33] <cdanis>	 May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444] ERROR: Monitoring instance IdleConnection reports server mw1393.eqiad.wmnet (enabled/up/pooled) down: User timeout caused connection failure.
[17:35:35] <cdanis>	 May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444] ERROR: Could not depool server mw1393.eqiad.wmnet because of too many down!
[17:35:37] <cdanis>	 May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444 IdleConnection] WARN: mw1393.eqiad.wmnet (enabled/down/pooled): Connection to 10.64.16.151:4444 failed.
[17:35:39] <cdanis>	 May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444] ERROR: Monitoring instance IdleConnection reports server parse1003.eqiad.wmnet (enabled/up/pooled) down: User timeout caused connection failure.
[17:35:41] <cdanis>	 May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444] ERROR: Could not depool server parse1003.eqiad.wmnet because of too many down!
[17:35:43] <cdanis>	 May 29 17:35:03 lvs1019 pybal[273937]: [mwdebug_4444 IdleConnection] WARN: parse1003.eqiad.wmnet (enabled/down/pooled): Connection to 10.64.0.121:4444 failed.
[17:36:14] <cdanis>	 I don't know how to quickly check but I think that presently lvs1019/lvs1020 can connect to 0 of the kubernetes hosts on 4444
[17:38:35] <cdanis>	 ah wait, that's not true
[17:38:43] <hashar>	 lvs1019 started doing bunch of disk write https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=lvs1019&var-datasource=thanos&var-cluster=lvs&from=now-1h&to=now&viewPanel=35   possibly writing logs
[17:38:54] <cdanis>	 the situation is much worse on lvs1019, on lvs1020 it is actually okay-ish
[17:38:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T366123)', diff saved to https://phabricator.wikimedia.org/P63583 and previous config saved to /var/cache/conftool/dbconfig/20240529-173857-marostegui.json
[17:39:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[17:39:03] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[17:39:09] <hashar>	 and it has TCP errors https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=lvs1019&var-datasource=thanos&var-cluster=lvs&from=now-1h&to=now&viewPanel=31
[17:39:14] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[17:39:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T366123)', diff saved to https://phabricator.wikimedia.org/P63584 and previous config saved to /var/cache/conftool/dbconfig/20240529-173921-marostegui.json
[17:39:40] <cdanis>	 SYN retransmits are consistent with what I'm seeing yeah
[17:40:35] <hashar>	 that is all I know :-]
[17:41:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T366123)', diff saved to https://phabricator.wikimedia.org/P63585 and previous config saved to /var/cache/conftool/dbconfig/20240529-174132-marostegui.json
[17:41:59] <wikibugs>	 (03CR) 10Muehlenhoff: mx: stop ignoring VRTS alias errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[17:42:27] <wikibugs>	 (03Restored) 10Dzahn: mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[17:43:14] <cdanis>	 rzl: I think this is a Calico issue
[17:43:43] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 42 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:43:56] <rzl>	 -> #wikimedia-sre
[17:45:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:48:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T364299)', diff saved to https://phabricator.wikimedia.org/P63586 and previous config saved to /var/cache/conftool/dbconfig/20240529-174806-marostegui.json
[17:48:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance
[17:48:12] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[17:48:22] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance
[17:48:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T364299)', diff saved to https://phabricator.wikimedia.org/P63587 and previous config saved to /var/cache/conftool/dbconfig/20240529-174829-marostegui.json
[17:53:43] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:56:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P63588 and previous config saved to /var/cache/conftool/dbconfig/20240529-175640-marostegui.json
[17:59:50] <wikibugs>	 (03PS4) 10Dzahn: mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145)
[18:00:05] <jouncebot>	 dancy and andre: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1800).
[18:00:05] <jouncebot>	 dancy and andre: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T1800).
[18:00:27] <dancy>	 I'm holding the train until k8s issues are worked out.
[18:01:02] <wikibugs>	 (03CR) 10Dzahn: "both defaults are false, so just removing all 3 lines then" [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[18:04:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T352010)', diff saved to https://phabricator.wikimedia.org/P63589 and previous config saved to /var/cache/conftool/dbconfig/20240529-180442-ladsgroup.json
[18:04:49] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[18:04:53] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:07:35] <akosiaris>	 dancy: I think we figured it out, you can unhold the train
[18:07:44] <dancy>	 Excellent.
[18:07:54] <dancy>	 Re-doing the backport that I was originally attempting first.
[18:08:26] <logmsgbot>	 !log dancy@deploy1002 Started scap: Backport for [[gerrit:1037018|Revert "Wrap tables with JS" (T330527)]]
[18:08:32] <stashbot>	 T330527: Wider tables overlap sticky page tools (Upstream Minerva's responsive table styles to core SkinModule) - https://phabricator.wikimedia.org/T330527
[18:10:59] <logmsgbot>	 !log dancy@deploy1002 dancy and jdlrobson: Backport for [[gerrit:1037018|Revert "Wrap tables with JS" (T330527)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:11:45] <dancy>	 jan_drewniak: Can you verify that the revert fixed the problem on testservers?
[18:11:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P63590 and previous config saved to /var/cache/conftool/dbconfig/20240529-181148-marostegui.json
[18:12:16] <jan_drewniak>	 dancy: ok taking a look now
[18:15:43] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:15:56] <jan_drewniak>	 Jdlrobson: can you verify the fix? My computer just crashed :/
[18:17:43] <dancy>	 Bummer!
[18:19:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P63592 and previous config saved to /var/cache/conftool/dbconfig/20240529-181950-ladsgroup.json
[18:20:43] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:24:30] <jan_drewniak>	 dancy: we are good to sync
[18:24:40] <dancy>	 Excellent.  Proceeding
[18:24:42] <logmsgbot>	 !log dancy@deploy1002 dancy and jdlrobson: Continuing with sync
[18:26:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T366123)', diff saved to https://phabricator.wikimedia.org/P63593 and previous config saved to /var/cache/conftool/dbconfig/20240529-182656-marostegui.json
[18:26:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[18:27:02] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[18:27:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[18:27:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T366123)', diff saved to https://phabricator.wikimedia.org/P63594 and previous config saved to /var/cache/conftool/dbconfig/20240529-182719-marostegui.json
[18:29:37] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] mx: stop ignoring VRTS alias errors [puppet] - 10https://gerrit.wikimedia.org/r/1037117 (https://phabricator.wikimedia.org/T284145) (owner: 10Dzahn)
[18:29:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:31:19] <rzl>	 ^ not exactly "expected" but let's call it status quo
[18:31:33] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002"
[18:32:43] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:32:45] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002"
[18:32:47] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1041.eqiad.wmnet with OS bookworm
[18:32:54] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9843563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm completed: - cloudvirt...
[18:33:36] <logmsgbot>	 !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1037018|Revert "Wrap tables with JS" (T330527)]] (duration: 25m 10s)
[18:33:42] <stashbot>	 T330527: Wider tables overlap sticky page tools (Upstream Minerva's responsive table styles to core SkinModule) - https://phabricator.wikimedia.org/T330527
[18:34:00] <dancy>	 Rolling the train.
[18:34:19] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037148 (https://phabricator.wikimedia.org/T361401)
[18:34:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037148 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot)
[18:34:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P63595 and previous config saved to /var/cache/conftool/dbconfig/20240529-183458-ladsgroup.json
[18:34:59] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9843568 (10SonjaPerry) L3 signed, thank you!
[18:35:01] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037148 (https://phabricator.wikimedia.org/T361401) (owner: 10TrainBranchBot)
[18:35:07] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:37:43] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[18:38:21] <wikibugs>	 (03PS1) 10Bernard Wang: POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131
[18:41:54] <cdanis>	 !log 💙cdanis@lvs1020.eqiad.wmnet ~ 🕝☕ sudo systemctl restart pybal.service
[18:41:57] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:41:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:12] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9843591 (10Andrew) a:05Jclark-ctr→03None After a nic firmware upgrade things seem to be working. It took a couple of tries (suspicious!) but now the host is imaged an...
[18:44:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:47:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9843636 (10wiki_willy) a:03VRiley-WMF
[18:48:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9843643 (10wiki_willy)
[18:48:45] <wikibugs>	 (03PS1) 10Ahmon Dancy: httpbb-tests: Update https://donate.wikimedia.org redirect Location [puppet] - 10https://gerrit.wikimedia.org/r/1037149 (https://phabricator.wikimedia.org/T351325)
[18:49:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9843647 (10wiki_willy)
[18:49:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9843655 (10wiki_willy)
[18:49:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9843656 (10wiki_willy)
[18:50:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T352010)', diff saved to https://phabricator.wikimedia.org/P63597 and previous config saved to /var/cache/conftool/dbconfig/20240529-185006-ladsgroup.json
[18:50:10] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance
[18:50:14] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[18:50:23] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance
[18:50:25] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[18:50:27] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[18:50:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T352010)', diff saved to https://phabricator.wikimedia.org/P63598 and previous config saved to /var/cache/conftool/dbconfig/20240529-185035-ladsgroup.json
[18:55:38] <wikibugs>	 (03PS2) 10Ahmon Dancy: httpbb-tests: test_foundation.yaml: Update a donate.wikimedia.org expected redirect [puppet] - 10https://gerrit.wikimedia.org/r/1037149 (https://phabricator.wikimedia.org/T351325)
[18:55:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T366123)', diff saved to https://phabricator.wikimedia.org/P63599 and previous config saved to /var/cache/conftool/dbconfig/20240529-185541-marostegui.json
[18:55:47] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[18:57:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T364299)', diff saved to https://phabricator.wikimedia.org/P63600 and previous config saved to /var/cache/conftool/dbconfig/20240529-185719-marostegui.json
[18:57:25] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[18:59:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] httpbb-tests: test_foundation.yaml: Update a donate.wikimedia.org expected redirect [puppet] - 10https://gerrit.wikimedia.org/r/1037149 (https://phabricator.wikimedia.org/T351325) (owner: 10Ahmon Dancy)
[18:59:45] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Send weighted tags to known clusters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037153
[18:59:53] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Joely Rooke WMDE - https://phabricator.wikimedia.org/T366145#9843702 (10Dzahn) Hi @KFrancis , @JoelyRooke-WMDE will need the usual NDA for WMDE employees. Thanks  Hi @JoelyRooke-WMDE If you could send an email to Katie (https://meta.wikimedia.org/wik...
[19:00:04] <wikibugs>	 (03PS3) 10Ahmon Dancy: httpbb-tests: Update a donate.wikimedia.org expected redirect [puppet] - 10https://gerrit.wikimedia.org/r/1037149 (https://phabricator.wikimedia.org/T351325)
[19:01:02] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9843720 (10Dzahn) @derenrich From your direct manager by leaving a comment on this ticket, please.
[19:03:27] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9843724 (10derenrich) >>! In T365381#9843720, @Dzahn wrote: > @derenrich From your direct manager by leaving a comment on this ticket, please.  @Dzahn that already happened....
[19:03:42] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] httpbb-tests: Update a donate.wikimedia.org expected redirect [puppet] - 10https://gerrit.wikimedia.org/r/1037149 (https://phabricator.wikimedia.org/T351325) (owner: 10Ahmon Dancy)
[19:04:01] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9843726 (10Dzahn) My bad, see my edit above though.
[19:07:31] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9843733 (10KFrancis) Hi all, the NDA has been sent out for signatures.  I'll confirm when it's complete.
[19:10:46] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for STran, Madalina, Tchanders and JayCano - https://phabricator.wikimedia.org/T366198#9843745 (10Dzahn) Hi @JayCano sorry for the hassle but this isn't an LDAP group, so it's not really an LDAP-Access-Request.  This is the right form f...
[19:10:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P63601 and previous config saved to /var/cache/conftool/dbconfig/20240529-191049-marostegui.json
[19:10:53] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9843740 (10Andrew) a:03aborrero This host is up and seems stable, but VMs running on it cannot reach the internet.  Since this host was being moved from a 2-nic to 1-ni...
[19:12:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P63602 and previous config saved to /var/cache/conftool/dbconfig/20240529-191227-marostegui.json
[19:17:34] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.7  refs T361401
[19:17:39] <stashbot>	 T361401: 1.43.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T361401
[19:21:12] <wikibugs>	 (03PS1) 10JHathaway: wikipedia.org dmarc: change to quarantine [dns] - 10https://gerrit.wikimedia.org/r/1037154 (https://phabricator.wikimedia.org/T211403)
[19:22:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review, 07Security: Domains of most projects do not have DMARC policy - https://phabricator.wikimedia.org/T211403#9843816 (10jhathaway) Patch added to change wikipedia.org's policy to quarantine.
[19:25:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P63603 and previous config saved to /var/cache/conftool/dbconfig/20240529-192559-marostegui.json
[19:27:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P63604 and previous config saved to /var/cache/conftool/dbconfig/20240529-192735-marostegui.json
[19:32:11] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1016 is OK: OK - Categories lag: 14:32:10.202067 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:32:11] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1012 is OK: OK - Categories lag: 14:32:10.223435 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:32:11] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1014 is OK: OK - Categories lag: 14:32:10.256696 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:32:13] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1020 is OK: OK - Categories lag: 14:32:11.934951 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:35:17] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2015 is OK: OK - Categories lag: 14:35:15.504675 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:35:17] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2021 is OK: OK - Categories lag: 14:35:15.519092 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:35:17] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2019 is OK: OK - Categories lag: 14:35:15.523440 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:35:17] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2017 is OK: OK - Categories lag: 14:35:15.534935 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:36:49] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@3287de9]: bump discolytics to 0.22.0
[19:37:17] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@3287de9]: bump discolytics to 0.22.0 (duration: 00m 27s)
[19:39:07] <wikibugs>	 (03PS3) 10David Martin: Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228)
[19:41:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T366123)', diff saved to https://phabricator.wikimedia.org/P63605 and previous config saved to /var/cache/conftool/dbconfig/20240529-194107-marostegui.json
[19:41:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[19:41:15] <stashbot>	 T366123: Drop gu_salt from globaluser in WMF prod - https://phabricator.wikimedia.org/T366123
[19:41:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[19:42:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T364299)', diff saved to https://phabricator.wikimedia.org/P63606 and previous config saved to /var/cache/conftool/dbconfig/20240529-194245-marostegui.json
[19:42:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance
[19:42:52] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[19:43:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance
[19:43:06] <wikibugs>	 (03PS4) 10CDanis: otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094)
[19:43:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T364299)', diff saved to https://phabricator.wikimedia.org/P63607 and previous config saved to /var/cache/conftool/dbconfig/20240529-194309-marostegui.json
[19:45:59] <wikibugs>	 (03PS1) 10Ahmon Dancy: Make header expected/got failure output multiline for easier human viewing [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037156
[19:46:51] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for STran, Madalina, Tchanders and JayCano - https://phabricator.wikimedia.org/T366198#9843932 (10Aklapper) 05Open→03Invalid Please see / follow the "Analytics" entry under "I need access or permissions to..." on the https://pha...
[19:46:54] <wikibugs>	 (03PS1) 10JHathaway: wikipedia.org spf: indicate mail is not sent from this domain. [dns] - 10https://gerrit.wikimedia.org/r/1037157 (https://phabricator.wikimedia.org/T211403)
[19:47:11] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1013 is OK: OK - Categories lag: 14:47:10.174585 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:47:11] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1011 is OK: OK - Categories lag: 14:47:10.195421 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:47:13] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs1021 is OK: OK - Categories lag: 14:47:11.903546 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:47:29] <wikibugs>	 (03PS3) 10Dzahn: gerrit: add parameter to toggle lfs_replica_sync [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196)
[19:47:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Make header expected/got failure output multiline for easier human viewing [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037156 (owner: 10Ahmon Dancy)
[19:47:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: add parameter to toggle lfs_replica_sync [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn)
[19:48:18] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis)
[19:48:24] <wikibugs>	 (03CR) 10CDanis: [C:03+2] otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis)
[19:49:00] <wikibugs>	 (03PS1) 10Ahmon Dancy: Add more junk to .gitignore [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158
[19:49:55] <wikibugs>	 (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196) (owner: 10Dzahn)
[19:50:13] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2018 is OK: OK - Categories lag: 14:50:12.311387 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:50:13] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2016 is OK: OK - Categories lag: 14:50:12.317364 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:50:15] <icinga-wm_>	 RECOVERY - Categories update lag on wdqs2020 is OK: OK - Categories lag: 14:50:13.815870 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Categories_update_lag
[19:50:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add more junk to .gitignore [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158 (owner: 10Ahmon Dancy)
[19:51:20] <wikibugs>	 (03Merged) 10jenkins-bot: otelcol: limit collected k8s data [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037113 (https://phabricator.wikimedia.org/T366094) (owner: 10CDanis)
[19:51:46] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[19:52:03] <wikibugs>	 (03CR) 10Ahmon Dancy: "Not sure what's up with the tests." [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158 (owner: 10Ahmon Dancy)
[19:56:11] <wikibugs>	 (03PS4) 10Dzahn: gerrit: add parameter to toggle lfs_replica_sync [puppet] - 10https://gerrit.wikimedia.org/r/1036771 (https://phabricator.wikimedia.org/T363196)
[19:58:42] <logmsgbot>	 !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T2000).
[20:00:05] <jouncebot>	 JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:18] <JSherman>	 here and happy to self deploy
[20:01:15] <Jdlrobson>	 o/ present but someone in the wrong window again
[20:02:05] <JSherman>	 Jdlrobson: are you here for https://gerrit.wikimedia.org/r/c/1034480/ ?
[20:02:23] <Jdlrobson>	 Fixed: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=2183257&oldid=2183214
[20:02:53] <Jdlrobson>	 nope for the follow up to that: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Popups/+/1036664 JSherman 
[20:03:07] <JSherman>	 thanks
[20:04:30] <Nemoralis>	 hi, who is the deployer
[20:04:50] <JSherman>	 Jdlrobson: it looks like it's simplifying things. Was it tested on beta already?
[20:05:07] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:05:09] <Jdlrobson>	 yeh
[20:05:13] <Nemoralis>	 jouncebot now
[20:05:13] <jouncebot>	 For the next 0 hour(s) and 54 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T2000)
[20:05:17] <Jdlrobson>	 we did the config change yesterday
[20:05:37] <Jdlrobson>	 We want to do this now in case we need to revert before the Thursday train.. but it's easy to ficx!
[20:05:51] <Jdlrobson>	 s/fix/test
[20:06:02] <JSherman>	 Nemoralis: haven't heard from one of the listed deployers, but I was about to self deploy and then do Jdlrobson's patch too if needed
[20:06:17] <Nemoralis>	 I have patch too
[20:06:36] <JSherman>	 Okay, I'm about to start mine.
[20:06:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037118 (https://phabricator.wikimedia.org/T366203) (owner: 10Jsn.sherman)
[20:07:33] <Jdlrobson>	 JSherman: are merges to deploy branches still taking 30mins + ?
[20:08:15] <JSherman>	 Honestly, I don't know; it's been pretty variable week to week in my experience
[20:08:45] <wikibugs>	 (03PS2) 10NMW03: Enable wmgUseSandboxLink for Swahili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037073 (https://phabricator.wikimedia.org/T365970)
[20:09:26] * cjming thanks JSherman for deploying!
[20:09:42] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings: correct AutoModerator load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037118 (https://phabricator.wikimedia.org/T366203) (owner: 10Jsn.sherman)
[20:10:12] <logmsgbot>	 !log jsn@deploy1002 Started scap: Backport for [[gerrit:1037118|CommonSettings: correct AutoModerator load order (T366203)]]
[20:10:18] <stashbot>	 T366203: Check/move/document code in CommonSettings.php after require of CommonSettings-labs.php - https://phabricator.wikimedia.org/T366203
[20:10:18] <cjming>	 Jdlrobson: in my experience, it's waiting for the CI to finish on release branches - so backports are time-consuming - sometimes over 20 mins to merge
[20:11:30] <cjming>	 config takes a few mins to merge -- and deploying to the test servers, then production have taken longer than i recall in recent memory
[20:12:46] <JSherman>	 Jdlrobson: yeah it looks like gate-and-submit jobs are still running 20+ minutes, which lines up with what cjming: is saying
[20:12:49] <logmsgbot>	 !log jsn@deploy1002 jsn: Backport for [[gerrit:1037118|CommonSettings: correct AutoModerator load order (T366203)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:12:53] <logmsgbot>	 !log jsn@deploy1002 jsn: Continuing with sync
[20:12:58] <cjming>	 so if there's a backport, i tend to manually +2 it while deploying a config patch
[20:13:03] <wikibugs>	 (03PS1) 10Scott French: function-evaluator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037162 (https://phabricator.wikimedia.org/T362978)
[20:13:17] <wikibugs>	 (03PS1) 10Scott French: function-orchestrator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037163 (https://phabricator.wikimedia.org/T362978)
[20:13:36] <wikibugs>	 (03PS1) 10Scott French: wikifeeds: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037164 (https://phabricator.wikimedia.org/T362978)
[20:13:47] <wikibugs>	 (03PS1) 10Scott French: toolhub: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037165 (https://phabricator.wikimedia.org/T362978)
[20:14:00] <wikibugs>	 (03PS1) 10Scott French: thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978)
[20:14:01] <JSherman>	 cjming: yeah, I did that a couple weeks ago and then got scared that it was wrong so I -1ed it.
[20:14:32] <JSherman>	 I'll go ahead with Jdlrobson's patch.
[20:14:45] <wikibugs>	 (03CR) 10Jsn.sherman: [C:03+2] feature(Popups): Conditional User Defaults Implementation [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036664 (https://phabricator.wikimedia.org/T364347) (owner: 10Jdlrobson)
[20:14:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[20:15:19] <wikibugs>	 (03PS2) 10Bernard Wang: POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131
[20:16:14] <JSherman>	 Nemoralis: your patch is a really straightforward config change, so I might be able to do it while we wait on Jdlrobson's backport to run through ci.
[20:16:32] <Nemoralis>	 (y)
[20:17:52] <JSherman>	 Nemoralis: not that I expect any trouble, but are you set up to test with the debug extension?
[20:18:00] <Nemoralis>	 yes
[20:21:35] <logmsgbot>	 !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1037118|CommonSettings: correct AutoModerator load order (T366203)]] (duration: 11m 22s)
[20:21:41] <stashbot>	 T366203: Check/move/document code in CommonSettings.php after require of CommonSettings-labs.php - https://phabricator.wikimedia.org/T366203
[20:21:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037073 (https://phabricator.wikimedia.org/T365970) (owner: 10NMW03)
[20:22:51] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wmgUseSandboxLink for Swahili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037073 (https://phabricator.wikimedia.org/T365970) (owner: 10NMW03)
[20:23:22] <logmsgbot>	 !log jsn@deploy1002 Started scap: Backport for [[gerrit:1037073|Enable wmgUseSandboxLink for Swahili Wikipedia (T365970)]]
[20:23:28] <stashbot>	 T365970: Add "Sandbox" link to top bar on Swahili Wikipedia - https://phabricator.wikimedia.org/T365970
[20:23:41] <wikibugs>	 (03PS3) 10Bernard Wang: POC: t Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131
[20:23:51] <wikibugs>	 (03Merged) 10jenkins-bot: feature(Popups): Conditional User Defaults Implementation [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1036664 (https://phabricator.wikimedia.org/T364347) (owner: 10Jdlrobson)
[20:25:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:25:52] <logmsgbot>	 !log jsn@deploy1002 jsn and nmw03: Backport for [[gerrit:1037073|Enable wmgUseSandboxLink for Swahili Wikipedia (T365970)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:26:27] <JSherman>	 Nemoralis: please test
[20:29:22] <wikibugs>	 (03PS4) 10Bernard Wang: POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131
[20:30:51] <wikibugs>	 (03PS5) 10Bernard Wang: POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131
[20:31:41] <JSherman>	 Nemoralis: I went ahead and tested for you since Jdlrobson is waiting. I verified that the sandbox link is enabled for sw wiki on the debug host.
[20:31:55] <JSherman>	 proceeding
[20:32:00] <logmsgbot>	 !log jsn@deploy1002 jsn and nmw03: Continuing with sync
[20:35:04] <wikibugs>	 (03PS2) 10Scott French: thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978)
[20:35:11] <JSherman>	 Jdlrobson: FWIW, the gate-and-submit-wmf job for your backport only took 9 minutes. I stuck the other config change in front of you because I expected it to take longer. Apologies for the wait.
[20:35:44] <Jdlrobson>	 JSherman: np
[20:37:10] <Nemoralis>	 JSherman sorry I was disconnected. It looks like my patch has been deployed
[20:37:35] <JSherman>	 Nemoralis: yep, I went ahead and verified that sw had the sandbox link on the debug host
[20:37:45] <Nemoralis>	 thanks!
[20:37:52] <Nemoralis>	 I can close the phab task now
[20:38:06] <JSherman>	 good deal!
[20:38:55] <JSherman>	 well, I suppose you should wait to verify that it makes to sw wiki on the other hosts as well
[20:39:16] <JSherman>	 we're about halfway through the php-fpm restarts
[20:40:27] <Nemoralis>	 alright
[20:40:31] <logmsgbot>	 !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1037073|Enable wmgUseSandboxLink for Swahili Wikipedia (T365970)]] (duration: 17m 08s)
[20:40:37] <stashbot>	 T365970: Add "Sandbox" link to top bar on Swahili Wikipedia - https://phabricator.wikimedia.org/T365970
[20:40:54] <JSherman>	 Nemoralis: and it's done; you should see the changes live on swwiki
[20:41:10] <Nemoralis>	 thanks again!
[20:41:52] <JSherman>	 Nemoralis: no prob!
[20:41:52] <logmsgbot>	 !log jsn@deploy1002 Started scap: Backport for [[gerrit:1036664|feature(Popups): Conditional User Defaults Implementation (T364347)]]
[20:41:58] <stashbot>	 T364347: Popups: Make use of conditional user defaults - https://phabricator.wikimedia.org/T364347
[20:43:43] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubernetes2032.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2032.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[20:44:00] <cdanis>	 uh
[20:44:22] <logmsgbot>	 !log jsn@deploy1002 jsn and jdlrobson: Backport for [[gerrit:1036664|feature(Popups): Conditional User Defaults Implementation (T364347)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:44:46] <JSherman>	 Jdlrobson: please test
[20:45:11] <dancy>	 I love seeing the self-organization around backports. Nice work folks.
[20:45:29] <Jdlrobson>	 JSherman: on it
[20:47:05] <dancy>	 Jdlrobson: By the way, yesterday when I was doing a backport I got a warning about a change of yours that had been merged but not deployed.  Please make sure to fully scap backport beta-only config changes.  Scap is smart enough to not do a full production for beta-only changes.  Leaving a merged change undeployed is confusing for whoever deploys after you.
[20:47:27] <dancy>	 *full production deployment.
[20:48:00] <Jdlrobson>	 dancy: which change, sorry? I didn't merged anything yesterday (I don't have deploy rights)
[20:48:35] <dancy>	 lemme dig itup
[20:48:57] <Jdlrobson>	 JSherman: unfortunately there looks like there is a problem with this patch so it should be cancelled.
[20:49:08] <JSherman>	 Jdlrobson: ack
[20:49:12] <logmsgbot>	 !log jsn@deploy1002 Sync cancelled.
[20:49:54] <dancy>	 Jdlrobson: It was https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1036720
[20:50:39] <JSherman>	 Jdlrobson: reverting
[20:50:52] <dancy>	 Jdlrobson: Thanks for the info.  I'll remind the +2'er
[20:52:24] <JSherman>	 dancy: on the revert, it looks like I need to fix my git config on the deployment host; can I just ctrl-c out of the scap revert to fix it?
[20:52:46] <dancy>	 yes, it's always safe to control-c scap
[20:53:14] <JSherman>	 excellent; scap has been awesome in my experience (of about 3 weeks)
[20:54:06] <dancy>	 But, depending on when you control-c, a change may be partially deployed, so there may be some action that needs to be taken to get to a consistent state (such as re-running or, backporting something else with a fix, etc).
[20:54:24] <dancy>	 Glad you like it!
[20:55:56] <Jdlrobson>	 dancy: ack.
[20:56:14] <Jdlrobson>	 JSherman: sorry about the need for the cancel that was unexpected :(
[20:56:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131 (owner: 10Bernard Wang)
[20:57:31] <cdanis>	 JSherman: are you done deploying?
[20:57:51] <JSherman>	 cdanis: I'm muddling my way through a revert currently
[20:57:56] <cdanis>	 ah okay, npnp
[20:58:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T364299)', diff saved to https://phabricator.wikimedia.org/P63608 and previous config saved to /var/cache/conftool/dbconfig/20240529-205813-marostegui.json
[20:58:19] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[21:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240529T2100)
[21:00:33] <JSherman>	 dancy: It's asking me to do gerrit https user/password authentication for the revert. Should I be fowarding my gerrit ssh key etc?
[21:00:49] <dancy>	 hmm.. this is when using `scap backport --revert ..` ?
[21:01:27] <JSherman>	 yep
[21:02:07] <JSherman>	 ```
[21:02:07] <JSherman>	 jsn@deploy1002:/srv/mediawiki-staging$ scap backport --revert 1036664
[21:02:07] <JSherman>	 21:00:54 Checking whether changes are in a branch and version deployed to production...
[21:02:07] <JSherman>	 21:00:54 Reverting 1 change(s)
[21:02:07] <JSherman>	 Already on 'wmf/1.43.0-wmf.7'
[21:02:08] <JSherman>	 Your branch is ahead of 'origin/wmf/1.43.0-wmf.7' by 1 commit.
[21:02:08] <JSherman>	 ```
[21:02:33] <dancy>	 Hmm. That's a bug.  Please file a phab ticket w/ the transcript and we'll fix it.    In the meantime you'll need to create the revert commit some other way (e.g, using the Gerrit UI).
[21:02:58] <JSherman>	 dancy: wilco; thanks!
[21:05:16] <JSherman>	 dancy: just to verify: I should do a revert after cancelling a sync at the test step, yes?
[21:05:32] <dancy>	 yes.
[21:05:45] <JSherman>	 good deal; ty
[21:06:20] <wikibugs>	 (03PS1) 10Jsn.sherman: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037132
[21:06:36] <dancy>	 and since the broken change never made it past testservers, you could cancel the deployment of the revert after testservers.
[21:06:54] <JSherman>	 dancy: just what I was about to ask!
[21:07:03] <dancy>	 (if you know that no deployments happened in between)
[21:07:19] <JSherman>	 okay, so I should be able to just scap deploy the revert
[21:07:26] <dancy>	 nod
[21:08:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037132 (owner: 10Jsn.sherman)
[21:13:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P63609 and previous config saved to /var/cache/conftool/dbconfig/20240529-211321-marostegui.json
[21:14:39] <wikibugs>	 (03PS2) 10Scott French: toolhub: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037165 (https://phabricator.wikimedia.org/T362978)
[21:18:29] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037132 (owner: 10Jsn.sherman)
[21:19:00] <logmsgbot>	 !log jsn@deploy1002 Started scap: Backport for [[gerrit:1037132|Revert "feature(Popups): Conditional User Defaults Implementation"]]
[21:21:34] <logmsgbot>	 !log jsn@deploy1002 jsn: Backport for [[gerrit:1037132|Revert "feature(Popups): Conditional User Defaults Implementation"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:21:40] <logmsgbot>	 !log jsn@deploy1002 Sync cancelled.
[21:21:43] <JSherman>	 Jdlrobson: you should be reverted
[21:28:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P63610 and previous config saved to /var/cache/conftool/dbconfig/20240529-212830-marostegui.json
[21:31:33] <JSherman>	 cdanis: you should be good to go btw
[21:37:41] <JSherman>	 dancy: created a phab task at https://phabricator.wikimedia.org/T366217
[21:37:50] <dancy>	 Thanks!
[21:38:29] <wikibugs>	 (03PS1) 10CDanis: freshen hardcoded IDP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037179 (https://phabricator.wikimedia.org/T365855)
[21:38:52] <wikibugs>	 (03PS1) 10RLazarus: Fix tests for Python 3.8+ [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037180
[21:40:31] <wikibugs>	 (03CR) 10CDanis: [C:03+2] freshen hardcoded IDP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037179 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis)
[21:41:18] <wikibugs>	 (03Merged) 10jenkins-bot: freshen hardcoded IDP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037179 (https://phabricator.wikimedia.org/T365855) (owner: 10CDanis)
[21:41:58] <logmsgbot>	 !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[21:42:31] <logmsgbot>	 !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[21:43:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T364299)', diff saved to https://phabricator.wikimedia.org/P63611 and previous config saved to /var/cache/conftool/dbconfig/20240529-214338-marostegui.json
[21:43:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance
[21:43:44] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[21:43:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2198.codfw.wmnet with reason: Maintenance
[21:44:57] <wikibugs>	 (03PS2) 10CDanis: jaeger: link to Mediawiki debug Logstash [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035829 (https://phabricator.wikimedia.org/T320549)
[21:45:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:46:17] <wikibugs>	 (03CR) 10CDanis: [C:03+2] jaeger: link to Mediawiki debug Logstash (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035829 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis)
[21:47:10] <wikibugs>	 (03Merged) 10jenkins-bot: jaeger: link to Mediawiki debug Logstash [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035829 (https://phabricator.wikimedia.org/T320549) (owner: 10CDanis)
[21:47:23] <logmsgbot>	 !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[21:47:27] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "As pointed out by elukey on the linked ticket, we don't install systemd-coredump. There is one single system here https://debmonitor.wikim" [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy)
[21:47:59] <logmsgbot>	 !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[21:54:40] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+2] "Thanks!" [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037180 (owner: 10RLazarus)
[21:56:20] <wikibugs>	 (03Merged) 10jenkins-bot: Fix tests for Python 3.8+ [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037180 (owner: 10RLazarus)
[21:57:23] <wikibugs>	 (03PS2) 10Ahmon Dancy: Add more junk to .gitignore [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158
[21:57:23] <wikibugs>	 (03PS2) 10Ahmon Dancy: Make header expected/got failure output multiline for easier human viewing [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037156
[22:00:19] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] Add more junk to .gitignore [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158 (owner: 10Ahmon Dancy)
[22:00:29] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "Thanks for this!" [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037156 (owner: 10Ahmon Dancy)
[22:00:38] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1009.eqiad.wmnet with OS bullseye
[22:00:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844395 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye
[22:01:47] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye
[22:01:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844410 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye
[22:02:58] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage
[22:03:41] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1010.eqiad.wmnet with reason: host reimage
[22:04:59] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[22:05:01] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1009.eqiad.wmnet with OS bullseye
[22:05:07] <logmsgbot>	 !log jclark@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on kafka-main1010.eqiad.wmnet with reason: host reimage
[22:05:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844420 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye completed...
[22:06:24] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1009.eqiad.wmnet with reason: host reimage
[22:07:38] <wikibugs>	 (03Merged) 10jenkins-bot: Add more junk to .gitignore [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037158 (owner: 10Ahmon Dancy)
[22:07:39] <wikibugs>	 (03Merged) 10jenkins-bot: Make header expected/got failure output multiline for easier human viewing [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037156 (owner: 10Ahmon Dancy)
[22:09:40] <icinga-wm_>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:09:40] <icinga-wm_>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:10:02] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:11:02] <wikibugs>	 (03PS1) 10RLazarus: Release v0.0.4. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037186
[22:13:40] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] Release v0.0.4. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037186 (owner: 10RLazarus)
[22:15:16] <wikibugs>	 (03Merged) 10jenkins-bot: Release v0.0.4. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037186 (owner: 10RLazarus)
[22:16:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1009.eqiad.wmnet with OS bullseye
[22:17:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host kafka-main1009.eqiad.wmnet with OS bullseye completed...
[22:18:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9844468 (10Jclark-ctr)
[22:27:33] <rzl>	 dancy: lolsob, the test is failing in the debian build for bullseye -- I was moving too fast, it depends on the version of the jsonschema package, not the Python version 🙃
[22:27:49] <rzl>	 I'll get it untangled and release a new version properly, but if it doesn't happen before I turn into a pumpkin in 33 minutes, it'll be tomorrow
[22:28:40] <icinga-wm_>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:29:06] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:29:40] <icinga-wm_>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:30:10] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9844495 (10colewhite)
[22:31:19] <dancy>	 rzl: good times. :-)
[22:31:36] <dancy>	 rzl: No rush.  
[22:31:40] <rzl>	 👍
[22:32:31] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9844511 (10Ahoelzl) Approved.
[22:38:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2200.codfw.wmnet with reason: Maintenance
[22:38:52] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2200.codfw.wmnet with reason: Maintenance
[22:41:14] <wikibugs>	 (03Abandoned) 10Jdlrobson: POC: Wrap tables with JS [skins/Vector] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037131 (owner: 10Bernard Wang)
[22:49:10] <icinga-wm_>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 190480992 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:50:10] <icinga-wm_>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 8344 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:52:51] <wikibugs>	 (03PS1) 10Jdlrobson: Popups setting should be string not integer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037189 (https://phabricator.wikimedia.org/T364347)
[22:52:57] <wikibugs>	 (03PS1) 10RLazarus: Really fix tests for jsonschema. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037190
[22:54:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Really fix tests for jsonschema. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037190 (owner: 10RLazarus)
[22:54:50] <rzl>	 that commit message was asking for it, I guess
[22:55:02] <swfrench-wmf>	 lol
[22:55:27] <wikibugs>	 (03PS2) 10RLazarus: Really fix tests for jsonschema. [software/httpbb] - 10https://gerrit.wikimedia.org/r/1037190
[22:56:03] <wikibugs>	 (03PS1) 10Jdlrobson: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037133 (https://phabricator.wikimedia.org/T364347)
[22:56:29] <wikibugs>	 (03PS2) 10Jdlrobson: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037133 (https://phabricator.wikimedia.org/T364347)
[22:56:35] <wikibugs>	 (03PS3) 10Jdlrobson: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037133 (https://phabricator.wikimedia.org/T364347)
[22:56:50] <wikibugs>	 (03CR) 10Stoyofuku-wmf: [C:03+1] "😭" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037189 (https://phabricator.wikimedia.org/T364347) (owner: 10Jdlrobson)
[23:00:46] <wikibugs>	 (03Abandoned) 10Jdlrobson: Revert "feature(Popups): Conditional User Defaults Implementation" [extensions/Popups] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1037133 (https://phabricator.wikimedia.org/T364347) (owner: 10Jdlrobson)
[23:02:32] <wikibugs>	 06SRE, 06serviceops: k8s master capacity issues - https://phabricator.wikimedia.org/T366094#9844588 (10CDanis) >>! In T366094#9842327, @akosiaris wrote: > I am gonna disagree on this one. [This](https://grafana-rw.wikimedia.org/d/d304d897-54ea-4062-a504-6f2567ed7dba/t366094?orgId=1&from=1716910376624&to=171691...
[23:05:54] <wikibugs>	 (03PS1) 10Scott French: termbox: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037193 (https://phabricator.wikimedia.org/T362978)
[23:06:09] <wikibugs>	 (03PS1) 10Scott French: similar-users: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037194 (https://phabricator.wikimedia.org/T362978)
[23:06:23] <wikibugs>	 (03PS1) 10Scott French: kask: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037195 (https://phabricator.wikimedia.org/T362978)
[23:06:38] <wikibugs>	 (03PS1) 10Scott French: chromium-render: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037196 (https://phabricator.wikimedia.org/T362978)
[23:15:49] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] Add a stream for tracking the API of WikiLambda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017962 (https://phabricator.wikimedia.org/T356228) (owner: 10David Martin)
[23:29:04] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2208.codfw.wmnet with reason: Maintenance
[23:29:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2208.codfw.wmnet with reason: Maintenance
[23:29:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T364299)', diff saved to https://phabricator.wikimedia.org/P63612 and previous config saved to /var/cache/conftool/dbconfig/20240529-232924-marostegui.json
[23:29:34] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[23:38:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1036600
[23:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1036600 (owner: 10TrainBranchBot)
[23:59:16] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1036600 (owner: 10TrainBranchBot)