[00:03:44] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052444 (owner: 10TrainBranchBot)
[00:25:33] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:25:33] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:26:17] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:26:17] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:33:33] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:42:33] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:55:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T367856)', diff saved to https://phabricator.wikimedia.org/P65912 and previous config saved to /var/cache/conftool/dbconfig/20240708-005501-marostegui.json
[00:55:05] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[01:10:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P65913 and previous config saved to /var/cache/conftool/dbconfig/20240708-011008-marostegui.json
[01:25:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P65914 and previous config saved to /var/cache/conftool/dbconfig/20240708-012515-marostegui.json
[01:40:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T367856)', diff saved to https://phabricator.wikimedia.org/P65915 and previous config saved to /var/cache/conftool/dbconfig/20240708-014022-marostegui.json
[01:40:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[01:40:26] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[01:40:37] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[01:40:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T367856)', diff saved to https://phabricator.wikimedia.org/P65916 and previous config saved to /var/cache/conftool/dbconfig/20240708-014044-marostegui.json
[01:48:05] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:49:44] <jinxer-wm>	 FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[02:00:35] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:39:17] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:49:17] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:59:17] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:59:21] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:06:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:37:11] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[03:44:17] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:11:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[04:37:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T367856)', diff saved to https://phabricator.wikimedia.org/P65917 and previous config saved to /var/cache/conftool/dbconfig/20240708-043738-marostegui.json
[04:37:42] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[04:52:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P65918 and previous config saved to /var/cache/conftool/dbconfig/20240708-045246-marostegui.json
[05:02:10] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1052446 (https://phabricator.wikimedia.org/T369478)
[05:02:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T369478
[05:02:58] <stashbot>	 T369478: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T369478
[05:03:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2123 with weight 0 T369478', diff saved to https://phabricator.wikimedia.org/P65919 and previous config saved to /var/cache/conftool/dbconfig/20240708-050301-root.json
[05:03:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T369478
[05:04:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2123 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1052446 (https://phabricator.wikimedia.org/T369478) (owner: 10Gerrit maintenance bot)
[05:05:35] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:15:37] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[05:16:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2123 from dump/slow', diff saved to https://phabricator.wikimedia.org/P65920 and previous config saved to /var/cache/conftool/dbconfig/20240708-051605-marostegui.json
[05:16:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P65921 and previous config saved to /var/cache/conftool/dbconfig/20240708-051615-marostegui.json
[05:18:13] <marostegui>	 !log Starting s5 codfw failover from db2213 to db2123 - T369478
[05:18:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:18:16] <stashbot>	 T369478: Switchover s5 master (db2213 -> db2123) - https://phabricator.wikimedia.org/T369478
[05:18:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2123 to s5 primary T369478', diff saved to https://phabricator.wikimedia.org/P65922 and previous config saved to /var/cache/conftool/dbconfig/20240708-051840-root.json
[05:19:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2213 T369478', diff saved to https://phabricator.wikimedia.org/P65923 and previous config saved to /var/cache/conftool/dbconfig/20240708-051935-root.json
[05:20:39] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[05:24:29] <wikibugs>	 (03PS1) 10Marostegui: db2213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1052447
[05:24:35] <marostegui>	 !log Deploy schema change on s5 codfw db2213 dbmaint T367856
[05:24:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:24:38] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[05:24:44] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Long schema change
[05:24:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Long schema change
[05:25:05] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2213: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1052447 (owner: 10Marostegui)
[05:29:49] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 28306
[05:30:06] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28306
[05:31:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T367856)', diff saved to https://phabricator.wikimedia.org/P65925 and previous config saved to /var/cache/conftool/dbconfig/20240708-053122-marostegui.json
[05:31:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[05:31:25] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[05:31:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[05:31:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T367856)', diff saved to https://phabricator.wikimedia.org/P65926 and previous config saved to /var/cache/conftool/dbconfig/20240708-053133-marostegui.json
[05:32:20] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 6447
[05:33:38] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6447
[05:33:49] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 132167
[05:34:31] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 132167
[05:34:42] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 4788
[05:35:48] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4788
[05:36:00] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 999
[05:36:09] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 999
[05:36:18] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 28352
[05:36:34] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28352
[05:36:46] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61672
[05:37:00] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61672
[05:37:11] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 268248
[05:37:27] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268248
[05:37:54] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 18013
[05:38:20] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 18013
[05:38:24] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61942
[05:38:38] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61942
[05:38:52] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 263522
[05:39:07] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 263522
[05:39:10] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 17072
[05:39:42] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 17072
[05:48:05] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:49:44] <jinxer-wm>	 FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[05:59:21] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 28008
[05:59:36] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 28008
[06:01:47] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 270052
[06:02:43] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270052
[06:03:18] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 52468
[06:04:17] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:04:42] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52468
[06:04:46] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 7738
[06:05:23] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7738
[06:05:42] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 52320
[06:06:46] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52320
[06:08:27] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 269783
[06:08:37] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 269783
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:28] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61512
[06:11:25] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61512
[06:13:21] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 27768
[06:13:54] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:13:58] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 27768
[06:14:00] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:14:02] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 137409
[06:15:53] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 137409
[06:16:13] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 52455
[06:16:48] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 3.905 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:16:52] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52338 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:17:39] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 52455
[06:28:04] <wikibugs>	 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06Traffic, 13Patch-For-Review, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794#9959652 (10Joe)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:01:47] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 270052
[07:02:10] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 270052
[07:06:41] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:12:30] <wikibugs>	 (03PS1) 10Slyngshede: Template: Fix missing success styling on logout. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1052580
[07:16:31] <wikibugs>	 (03CR) 10Hashar: [C:04-1] gerrit: Add if statement for reason in PatchSetAbandoned (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1051134 (owner: 10Paladox)
[07:18:13] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+1] Template: Fix missing success styling on logout. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1052580 (owner: 10Slyngshede)
[07:18:19] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Template: Fix missing success styling on logout. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1052580 (owner: 10Slyngshede)
[07:36:36] <wikibugs>	 (03PS2) 10Hashar: gerrit: enable built-in log rotation [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505)
[07:37:12] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[07:38:56] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "We have successfully upgraded to Gerrit 3.10 and can now configure it to handle the logrotation instead of using a home made systemd timer" [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar)
[07:47:48] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:47:48] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:48:12] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:48:22] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:02:39] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: recording rules to monitor [puppet] - 10https://gerrit.wikimedia.org/r/1050376 (https://phabricator.wikimedia.org/T367283) (owner: 10Arnaudb)
[08:16:58] <icinga-wm>	 PROBLEM - Disk space on mw1446 is CRITICAL: DISK CRITICAL - free space: / 1445 MB (0% inode=99%): /tmp 1445 MB (0% inode=99%): /var/tmp 1445 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops
[08:26:56] <godog>	 !log re-enable business hours americas oncall - T369122
[08:26:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:59] <stashbot>	 T369122: On-call batphone escalation configuration holidays FY2024/25 - https://phabricator.wikimedia.org/T369122
[08:27:40] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add new component thirdparty/kubeadm-k8s-1-25 [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163)
[08:29:21] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: aptrepro: enable thirdparty/kubeadm-k8s-1-24 for buster and bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1010906 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez)
[08:31:07] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: aptrepo: add new component thirdparty/kubeadm-k8s-1-25 [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163)
[08:31:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163) (owner: 10Arturo Borrero Gonzalez)
[08:33:07] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: aptrepo: add new component thirdparty/kubeadm-k8s-1-25 [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163)
[08:33:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163) (owner: 10Arturo Borrero Gonzalez)
[08:35:26] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:35:28] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:35:50] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:35:50] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:38:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] aptrepo: add new component thirdparty/kubeadm-k8s-1-25 [puppet] - 10https://gerrit.wikimedia.org/r/1052667 (https://phabricator.wikimedia.org/T369163) (owner: 10Arturo Borrero Gonzalez)
[08:42:21] <arturo>	 !log update packages for thirdparty/kubeadm-k8s-1-25 bookworm-wikimedia in apt1002 (T369163)
[08:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:24] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant access to wmf to lferreira - https://phabricator.wikimedia.org/T369348#9959948 (10Lferreira) @Aklapper Done!
[08:42:27] <stashbot>	 T369163: toolforge: prepare deb packages for k8s 1.25 - https://phabricator.wikimedia.org/T369163
[08:44:48] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9959951 (10elukey) Folks today I found snapshot1017 with puppet disable for mo...
[08:46:19] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9959956 (10Marostegui) I don't think we should be leaving a host with puppet d...
[08:47:39] <wikibugs>	 (03PS1) 10Marostegui: installserver: Allow pc2017 reimage [puppet] - 10https://gerrit.wikimedia.org/r/1052671 (https://phabricator.wikimedia.org/T368919)
[08:49:10] <wikibugs>	 (03Abandoned) 10JMeybohm: Remove kubetcd200[4-6] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1034447 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm)
[08:50:27] <Dreamy_Jazz>	 !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration
[08:50:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:43] <Dreamy_Jazz>	 !log Running `foreachwikiindblist group1.dblist extensions/CheckUser/maintenance/deleteReadOldRowsInCuChanges.php --batch-size=200` in a tmux session
[08:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Allow pc2017 reimage [puppet] - 10https://gerrit.wikimedia.org/r/1052671 (https://phabricator.wikimedia.org/T368919) (owner: 10Marostegui)
[08:56:18] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[08:57:30] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:57:32] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:57:52] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:57:52] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:59:29] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[09:06:57] <Dreamy_Jazz>	 jouncebot: nowandnext
[09:06:57] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 53 minute(s)
[09:06:57] <jouncebot>	 In 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1000)
[09:07:04] <wikibugs>	 (03PS2) 10Elukey: redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372)
[09:07:31] <wikibugs>	 (03CR) 10Elukey: "Still testing if the code works on Dells" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[09:08:23] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: introduce component thirdparty/k9s for bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1052677 (https://phabricator.wikimedia.org/T366061)
[09:10:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] aptrepo: introduce component thirdparty/k9s for bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1052677 (https://phabricator.wikimedia.org/T366061) (owner: 10Arturo Borrero Gonzalez)
[09:14:44] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#9959990 (10MatthewVernon) p:05Unbreak!→03Medium
[09:16:18] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#9959995 (10MatthewVernon) [this is likely related to T360913]
[09:17:31] <arturo>	 !log aborrero@apt1002:~$ sudo -i reprepro --component thirdparty/k9s includedeb bookworm-wikimedia /home/aborrero/k9s_linux_amd64.deb (T366061)
[09:17:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:33] <stashbot>	 T366061: [infra,k8s] package k9s for use in kubernetes - https://phabricator.wikimedia.org/T366061
[09:18:02] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: control: install k9s [puppet] - 10https://gerrit.wikimedia.org/r/1052678 (https://phabricator.wikimedia.org/T366061)
[09:21:12] <wikibugs>	 (03PS2) 10Hnowlan: shellbox-video: increase replicas, namespace resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1050375 (https://phabricator.wikimedia.org/T356241)
[09:21:37] <wikibugs>	 (03CR) 10Volans: "Nice addition! One main comment inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[09:22:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: k8s: control: install k9s [puppet] - 10https://gerrit.wikimedia.org/r/1052678 (https://phabricator.wikimedia.org/T366061) (owner: 10Arturo Borrero Gonzalez)
[09:23:17] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: lower mesh's envoy concurrency to 8 for Wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052262 (https://phabricator.wikimedia.org/T368238) (owner: 10Elukey)
[09:23:24] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: upgrade mesh's envoy Docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[09:23:30] <wikibugs>	 (03PS2) 10Elukey: services: upgrade mesh's envoy Docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366)
[09:23:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] services: upgrade mesh's envoy Docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[09:23:39] <wikibugs>	 (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[09:24:44] <wikibugs>	 (03Merged) 10jenkins-bot: services: upgrade mesh's envoy Docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[09:26:32] <icinga-wm>	 PROBLEM - Disk space on mw1445 is CRITICAL: DISK CRITICAL - free space: / 11869 MB (2% inode=99%): /tmp 11869 MB (2% inode=99%): /var/tmp 11869 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops
[09:31:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: sync
[09:31:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: sync
[09:32:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: sync
[09:32:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: sync
[09:36:23] <wikibugs>	 (03CR) 10Btullis: [C:03+2] cephcsi: Grant the provisioner access to the ceph userID secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052341 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:38:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: sync
[09:38:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: sync
[09:39:53] <wikibugs>	 (03Merged) 10jenkins-bot: cephcsi: Grant the provisioner access to the ceph userID secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052341 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:41:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync
[09:41:38] <wikibugs>	 (03CR) 10Jelto: [C:03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1051208 (owner: 10Ahmon Dancy)
[09:41:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync
[09:42:58] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "AFAICT this only implements half of T368632 – what about the Wikiproiektu namespace?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) (owner: 10GergesShamon)
[09:44:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: sync
[09:44:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: sync
[09:46:01] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar)
[09:48:05] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:49:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
[09:49:44] <jinxer-wm>	 FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:49:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
[09:50:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: clinic-duty: update equinix parsing [software] - 10https://gerrit.wikimedia.org/r/1052688
[09:50:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: sync
[09:50:33] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: sync
[09:55:06] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[09:58:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[09:58:34] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.76 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1000)
[10:00:20] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:00:51] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:02:49] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[10:04:22] <wikibugs>	 (03PS2) 10Filippo Giunchedi: clinic-duty: update equinix parsing [software] - 10https://gerrit.wikimedia.org/r/1052688
[10:05:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] clinic-duty: update equinix parsing [software] - 10https://gerrit.wikimedia.org/r/1052688 (owner: 10Filippo Giunchedi)
[10:06:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[10:06:38] <wikibugs>	 (03PS1) 10Elukey: role::deployment_server::kubernetes: update Envoy's version [puppet] - 10https://gerrit.wikimedia.org/r/1052691 (https://phabricator.wikimedia.org/T368366)
[10:08:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T367856)', diff saved to https://phabricator.wikimedia.org/P65927 and previous config saved to /var/cache/conftool/dbconfig/20240708-100804-marostegui.json
[10:08:07] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[10:10:35] <wikibugs>	 (03CR) 10Elukey: "My plan is to send an email to ops@ announcing the diff, so people will be able to rollout the new envoy version during next deployments (" [puppet] - 10https://gerrit.wikimedia.org/r/1052691 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[10:15:34] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 20.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:23:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P65928 and previous config saved to /var/cache/conftool/dbconfig/20240708-102311-marostegui.json
[10:23:40] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1052693
[10:23:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65929 and previous config saved to /var/cache/conftool/dbconfig/20240708-102347-root.json
[10:24:06] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2213: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1052693 (owner: 10Marostegui)
[10:26:07] <wikibugs>	 (03PS2) 10GergesShamon: [euwiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632)
[10:26:26] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:26:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [euwiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) (owner: 10GergesShamon)
[10:27:44] <wikibugs>	 (03PS3) 10GergesShamon: [euwiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632)
[10:29:09] <wikibugs>	 (03PS4) 10GergesShamon: [euwiki] Enable Visual Editor in namespaces Project and Wikiproiektu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632)
[10:31:09] <wikibugs>	 (03PS4) 10Paladox: gerrit: Add if statement for reason in PatchSetAbandoned [puppet] - 10https://gerrit.wikimedia.org/r/1051134
[10:31:50] <wikibugs>	 (03CR) 10Paladox: gerrit: Add if statement for reason in PatchSetAbandoned (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1051134 (owner: 10Paladox)
[10:32:29] <wikibugs>	 (03PS1) 10Btullis: Disable monitoring on clouddb1021 prior to decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052696 (https://phabricator.wikimedia.org/T368518)
[10:33:10] <wikibugs>	 (03PS2) 10Btullis: Disable monitoring on clouddb1021 prior to decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052696 (https://phabricator.wikimedia.org/T368518)
[10:33:56] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3172/console" [puppet] - 10https://gerrit.wikimedia.org/r/1052696 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis)
[10:35:11] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Disable monitoring on clouddb1021 prior to decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052696 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis)
[10:38:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P65930 and previous config saved to /var/cache/conftool/dbconfig/20240708-103818-marostegui.json
[10:38:37] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Add entity-schema to $wgWBRepoSettings['searchIndexTypes'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052699 (https://phabricator.wikimedia.org/T369495)
[10:38:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65931 and previous config saved to /var/cache/conftool/dbconfig/20240708-103854-root.json
[10:39:36] <wikibugs>	 (03PS1) 10JMeybohm: aux: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052700 (https://phabricator.wikimedia.org/T362978)
[10:39:40] <wikibugs>	 (03PS1) 10JMeybohm: dse: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978)
[10:39:44] <wikibugs>	 (03PS1) 10JMeybohm: ml: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978)
[10:40:39] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Disable monitoring on clouddb1021 prior to decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1052696 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis)
[10:40:49] <wikibugs>	 (03CR) 10JMeybohm: "I'm not sure this is completely correct as the config structure differs from what we use on wikikube (and CNI is configured). So please do" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[10:40:56] <wikibugs>	 (03CR) 10JMeybohm: "I'm not sure this is completely correct as the config structure differs from what we use on wikikube (and CNI is configured). So please do" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[10:41:49] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 272432
[10:42:03] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 272432
[10:42:36] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 262476
[10:42:52] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262476
[10:42:53] <wikibugs>	 (03PS2) 10JMeybohm: dse: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978)
[10:43:00] <wikibugs>	 (03PS2) 10JMeybohm: ml: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052702 (https://phabricator.wikimedia.org/T362978)
[10:43:28] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 268248
[10:43:38] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 268248
[10:43:41] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 270359
[10:43:52] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 270359
[10:45:01] <fabfur>	 !log rebooting A:cp-esams (T366555)
[10:45:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:12] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_esams
[10:45:13] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_esams
[10:52:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[10:53:17] <jynus>	 fabfur: expected?
[10:53:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T367856)', diff saved to https://phabricator.wikimedia.org/P65932 and previous config saved to /var/cache/conftool/dbconfig/20240708-105325-marostegui.json
[10:53:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[10:53:29] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[10:53:40] <jayme>	 !incidents
[10:53:40] <sirenbot>	 4840 (UNACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[10:53:40] <sirenbot>	 4839 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (phabricator.discovery.wmnet eqiad)
[10:53:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance
[10:53:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T367856)', diff saved to https://phabricator.wikimedia.org/P65933 and previous config saved to /var/cache/conftool/dbconfig/20240708-105348-marostegui.json
[10:53:49] <jayme>	 !ack 4840
[10:53:49] <sirenbot>	 4840 (ACKED)  HaproxyUnavailable cache_text global sre (thanos-rule)
[10:53:52] <jelto>	 it looks like the availability is recovering again
[10:54:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65934 and previous config saved to /var/cache/conftool/dbconfig/20240708-105400-root.json
[10:54:01] <jayme>	 3min into lunch >D
[10:54:04] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9960396 (10Lucas_Werkmeister_WMDE) {T355292} should probably be a subtask of this (or maybe a subtask of T321899)? At least I’ve been told th...
[10:54:37] <arnaudb>	 I just cut my thumb cooking :D
[10:54:45] <jayme>	 ouch
[10:55:12] <jayme>	 arnaudb: I think you can go plaster yourself ;)
[10:55:29] <arnaudb>	 oh its done, it was just before the phone rang :D
[10:55:48] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3074.esams.wmnet
[10:55:56] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3066.esams.wmnet
[10:56:14] <jayme>	 ah, thought you got terrified by it ringing :)
[10:56:23] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9960410 (10Clement_Goubert)
[10:56:25] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9960411 (10Clement_Goubert)
[10:56:27] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:56:49] <arnaudb>	 a string of bad luck i'd say
[10:57:06] <arnaudb>	 we had a bump on CDN but it seems gone
[10:57:14] <jayme>	 hey fabfur - there was an availibility blib during your cp reboots, could that be related?
[10:57:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[10:58:15] <claime>	 jayme: it's a bit early to get plastered though
[10:58:16] <claime>	 :p
[10:58:42] <jayme>	 claime: what took you so long?!
[10:59:42] <claime>	 Blame painkillers
[10:59:47] * claime is not touching production today
[11:00:03] <jayme>	 ok, fair. You're excused this time
[11:03:04] <Dreamy_Jazz>	 BTW currently running a script that is deleting a lot of rows from `cu_changes` and is currently on `s4`, which might explain the replication lag for `s4` cloud DBs.
[11:04:02] <Dreamy_Jazz>	 It seems to be resolved based on the log, so I am not going to stop my script at the moment.
[11:04:47] <marostegui>	 Dreamy_Jazz: they aren't lagging at the moment
[11:04:50] <marostegui>	 arnaudb: ^
[11:05:16] <Dreamy_Jazz>	 AFAIK `cu_changes` is excluded from the cloud DBs by the sanitarium hosts, but I presume that the deletion statements still need to be filtered somehow. 
[11:06:48] <arnaudb>	 there was a bit of replag on clouddb Dreamy_Jazz but this is expected on that host, threshold tweaking is currently ongoing
[11:06:58] <Dreamy_Jazz>	 👍
[11:06:59] <arnaudb>	 unless you saw something hidden?
[11:07:05] <arnaudb>	 👀
[11:07:18] <Dreamy_Jazz>	 No, just was looking at the scroll-back and saw a replication lag alert
[11:07:25] <arnaudb>	 ack thanks
[11:09:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65935 and previous config saved to /var/cache/conftool/dbconfig/20240708-110905-root.json
[11:09:23] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9960448 (10Clement_Goubert)
[11:16:33] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9960456 (10Clement_Goubert)
[11:20:25] <wikibugs>	 (03PS4) 10Jforrester: [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408)
[11:20:30] <wikibugs>	 (03PS5) 10Jforrester: [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408)
[11:20:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) (owner: 10Jforrester)
[11:20:39] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) (owner: 10Jforrester)
[11:21:19] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9960472 (10Clement_Goubert) 05Open→03Resolved The work this task tracked is now completed. Remaining migrations {T352650}, {T355292}, {T355292...
[11:22:27] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Raise CPU limit in orchestrator from 200m to 400m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051813 (https://phabricator.wikimedia.org/T368892)
[11:22:39] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Raise CPU limit in orchestrator from 200m to 400m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051813 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester)
[11:23:49] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Raise CPU limit in orchestrator from 200m to 400m [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051813 (https://phabricator.wikimedia.org/T368892) (owner: 10Jforrester)
[11:24:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65936 and previous config saved to /var/cache/conftool/dbconfig/20240708-112411-root.json
[11:24:14] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[11:24:16] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[11:24:39] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[11:24:42] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[11:25:11] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[11:25:28] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[11:25:59] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[11:26:49] <logmsgbot>	 !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[11:26:54] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[11:27:41] <logmsgbot>	 !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[11:29:07] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9960486 (10Clement_Goubert) 05Open→03In progress
[11:34:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[11:34:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[11:34:26] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9960507 (10Clement_Goubert) 05In progress→03Resolved All internal traffic has been migrated.
[11:36:02] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786#9960518 (10Clement_Goubert)
[11:37:12] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[11:37:14] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:37:14] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:37:18] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:37:22] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:37:52] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9960524 (10Clement_Goubert) >>! In T290536#9960396, @Lucas_Werkmeister_WMDE wrote: > {T355292} should probably be a subtask of this (or maybe...
[11:39:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65937 and previous config saved to /var/cache/conftool/dbconfig/20240708-113917-root.json
[11:41:40] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9960525 (10Clement_Goubert)
[11:42:10] <wikibugs>	 (03PS1) 10Phuedx: lib: Update metrics-platform to 84ed8dcbe7c9 [extensions/EventLogging] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052711
[11:42:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/EventLogging] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052711 (owner: 10Phuedx)
[11:46:02] <wikibugs>	 (03CR) 10Ayounsi: [V:03+1] "No rush at all. I'm fine deploying it in a few weeks as it's a small edge case of the full routed ganeti setup." [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi)
[11:47:30] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 262476
[11:47:51] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 262476
[11:54:04] <wikibugs>	 (03PS1) 10Btullis: Configure reuse-parts for an-mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/1052712 (https://phabricator.wikimedia.org/T365503)
[11:54:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65938 and previous config saved to /var/cache/conftool/dbconfig/20240708-115422-root.json
[11:58:16] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Configure reuse-parts for an-mariadb servers [puppet] - 10https://gerrit.wikimedia.org/r/1052712 (https://phabricator.wikimedia.org/T365503) (owner: 10Btullis)
[12:00:28] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9960571 (10Lucas_Werkmeister_WMDE) /me shakes fist at Phorge for not letting me award this task another token  🪙🪙🪙🪙🪙
[12:17:52] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Uniquemia - https://phabricator.wikimedia.org/T369500 (10EUwandu-WMF) 03NEW
[12:19:02] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-mariadb1002.eqiad.wmnet with OS bookworm
[12:20:46] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052723
[12:27:30] <logmsgbot>	 !log btullis@deploy1002 Started deploy [airflow-dags/analytics@a2faba7]: (no justification provided)
[12:27:57] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@a2faba7]: (no justification provided) (duration: 00m 27s)
[12:28:13] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Remove shell access for ezachte and chelsyx. [puppet] - 10https://gerrit.wikimedia.org/r/1052728
[12:29:29] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] role::deployment_server::kubernetes: update Envoy's version [puppet] - 10https://gerrit.wikimedia.org/r/1052691 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[12:29:37] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:32:53] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-mariadb1002.eqiad.wmnet with reason: host reimage
[12:35:14] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-mariadb1002.eqiad.wmnet with reason: host reimage
[12:36:13] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3075.esams.wmnet
[12:36:31] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3067.esams.wmnet
[12:43:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool with small weight T365805', diff saved to https://phabricator.wikimedia.org/P65939 and previous config saved to /var/cache/conftool/dbconfig/20240708-124310-marostegui.json
[12:43:14] <stashbot>	 T365805: Test MariaDB 10.11 - https://phabricator.wikimedia.org/T365805
[12:44:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640)
[12:44:21] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640)
[12:44:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi)
[12:46:36] <wikibugs>	 (03CR) 10CDanis: [C:03+1] haproxy,hiera: Test bwlimit per url on cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/1052064 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[12:47:48] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy,hiera: Test bwlimit per url on cp4051 [puppet] - 10https://gerrit.wikimedia.org/r/1052064 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[12:48:26] <vgutierrez>	 !log test bwlimit per url on cp4051 - T317799
[12:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:29] <stashbot>	 T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799
[12:49:02] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::deployment_server::kubernetes: update Envoy's version [puppet] - 10https://gerrit.wikimedia.org/r/1052691 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey)
[12:49:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052165 (https://phabricator.wikimedia.org/T9496) (owner: 10Pppery)
[12:50:23] <wikibugs>	 (03CR) 10CDanis: [C:03+1] conftool/cli: add option to log actions with a reason string [software/conftool] - 10https://gerrit.wikimedia.org/r/1052307 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[12:51:48] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:51:54] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-mariadb1002.eqiad.wmnet with OS bookworm
[12:51:56] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:57:14] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "There is a subtle difference here in terms of what Bird does with the information.  With the address%interface syntax that just adds the i" [puppet] - 10https://gerrit.wikimedia.org/r/1052109 (https://phabricator.wikimedia.org/T362392) (owner: 10Ayounsi)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1300).
[13:00:05] <jouncebot>	 Gerges, tchin, James_F, phuedx, and pppery: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <Pppery>	 Here
[13:00:14] <Gerges>	 Hi
[13:01:42] <James_F>	 Hey.
[13:02:56] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[13:03:06] <tchin>	 hello
[13:03:09] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[13:03:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:03:26] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:03:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T367781)', diff saved to https://phabricator.wikimedia.org/P65940 and previous config saved to /var/cache/conftool/dbconfig/20240708-130333-arnaudb.json
[13:03:36] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[13:04:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T367781)', diff saved to https://phabricator.wikimedia.org/P65941 and previous config saved to /var/cache/conftool/dbconfig/20240708-130441-arnaudb.json
[13:06:31] <icinga-wm>	 RECOVERY - Disk space on mw1445 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1445&var-datasource=eqiad+prometheus/ops
[13:06:53] <fabfur>	 sorry jynus, apparently irccloud sopped alerting me about mentions
[13:11:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640)
[13:11:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640)
[13:11:42] <urbanecm>	 i missed the ping somehow
[13:11:46] <urbanecm>	 is anyone deploying?
[13:12:03] <James_F>	 Looks like no
[13:12:14] <urbanecm>	 let's get started then
[13:13:43] <urbanecm>	 hello Gerges, do we have a 👍 for the VE enabling from someone on the Editing team (as the VE maintainers)? AFAIK, they'd like to review before a deployment like this one happens. 
[13:14:06] <wikibugs>	 (03PS3) 10TChin: EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134)
[13:14:13] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[13:14:57] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) (owner: 10Jforrester)
[13:15:10] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: Add hive ingestion defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050596 (https://phabricator.wikimedia.org/T367134) (owner: 10TChin)
[13:15:24] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] lib: Update metrics-platform to 84ed8dcbe7c9 [extensions/EventLogging] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052711 (owner: 10Phuedx)
[13:15:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi)
[13:15:48] <wikibugs>	 (03Merged) 10jenkins-bot: [wikifunctionswiki] Disable MobileFrontend in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1010270 (https://phabricator.wikimedia.org/T349408) (owner: 10Jforrester)
[13:16:24] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Fix cp4051 bwlimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1052736 (https://phabricator.wikimedia.org/T317799)
[13:16:25] <urbanecm>	 Pppery: would you mind securing a +1 on  https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1052165, please? 
[13:16:59] <icinga-wm>	 RECOVERY - Disk space on mw1446 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops
[13:17:10] <logmsgbot>	 !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1050596|EventStreamConfig: Add hive ingestion defaults (T367134)]], [[gerrit:1010270|[wikifunctionswiki] Disable MobileFrontend in production (T349408)]]
[13:17:14] <stashbot>	 T367134: [Refine Refactoring] Integrate Refine workflow configuration into ESC - https://phabricator.wikimedia.org/T367134
[13:17:15] <stashbot>	 T349408: Use responsive Vector-2022 instead of Minerva for Wikifunctions Mobile and drop the secondary domain/MobileFrontend part - https://phabricator.wikimedia.org/T349408
[13:17:20] <wikibugs>	 (03CR) 10CDanis: [C:03+1] hiera: Fix cp4051 bwlimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1052736 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[13:17:41] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Fix cp4051 bwlimit configuration [puppet] - 10https://gerrit.wikimedia.org/r/1052736 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez)
[13:19:25] <urbanecm>	 actually... phuedx doesn't appear to be around, removing the +2
[13:19:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P65942 and previous config saved to /var/cache/conftool/dbconfig/20240708-131948-arnaudb.json
[13:20:35] <urbanecm>	 sent them a slack message, they're joining
[13:20:38] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] lib: Update metrics-platform to 84ed8dcbe7c9 [extensions/EventLogging] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052711 (owner: 10Phuedx)
[13:20:58] <Lucas_WMDE>	 oops, I also totally missed the ping
[13:21:09] * Lucas_WMDE lets urbanecm deploy
[13:21:20] <phuedx>	 o/
[13:21:22] <urbanecm>	 hi phuedx!
[13:21:31] <urbanecm>	 waiting on CI currently
[13:22:53] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached: enable extstore to eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/1052739 (https://phabricator.wikimedia.org/T352885)
[13:23:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] memcached: enable extstore to eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/1052739 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli)
[13:24:02] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached: enable extstore to eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/1052739 (https://phabricator.wikimedia.org/T352885)
[13:24:25] <wikibugs>	 (03CR) 10Ssingh: "Thanks for the review folks!" [software/conftool] - 10https://gerrit.wikimedia.org/r/1052307 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[13:24:26] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] conftool/cli: add option to log actions with a reason string [software/conftool] - 10https://gerrit.wikimedia.org/r/1052307 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[13:24:48] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640)
[13:24:48] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640)
[13:25:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9960830 (10Papaul) @cmooney the 18th works for me thanks.
[13:26:58] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9960835 (10ssingh)
[13:27:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Management routers to 22.4R3-S2 - https://phabricator.wikimedia.org/T369504 (10cmooney) 03NEW p:05Triage→03Medium
[13:27:41] <urbanecm>	 Pppery: Gerges: reminding about my pings from above, can you take a look please?
[13:28:13] <Pppery>	 I saw that ping. Was thinking about who to add as reviewers, though.
[13:28:52] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: add monitoring on io pressure for mariadb hosts [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb)
[13:29:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T367856)', diff saved to https://phabricator.wikimedia.org/P65943 and previous config saved to /var/cache/conftool/dbconfig/20240708-132911-marostegui.json
[13:29:12] <wikibugs>	 (03CR) 10Arnaudb: [V:03+1 C:03+2] mariadb: add monitoring on io pressure for mariadb hosts [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb)
[13:29:15] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[13:29:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi)
[13:30:28] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: add monitoring on io pressure for mariadb hosts [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb)
[13:31:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security update - bking@cumin2002 - T366555
[13:31:58] <logmsgbot>	 !log urbanecm@deploy1002 tchin, jforrester, urbanecm: Backport for [[gerrit:1050596|EventStreamConfig: Add hive ingestion defaults (T367134)]], [[gerrit:1010270|[wikifunctionswiki] Disable MobileFrontend in production (T349408)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:32:02] <urbanecm>	 finally
[13:32:03] <stashbot>	 T367134: [Refine Refactoring] Integrate Refine workflow configuration into ESC - https://phabricator.wikimedia.org/T367134
[13:32:03] <stashbot>	 T349408: Use responsive Vector-2022 instead of Minerva for Wikifunctions Mobile and drop the secondary domain/MobileFrontend part - https://phabricator.wikimedia.org/T349408
[13:32:17] <urbanecm>	 tchin: James_F: please take a look at the first two changes at mwdebug, if possible :)
[13:32:19] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security update - bking@cumin2002 - T366555
[13:32:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security update - bking@cumin2002 - T366555
[13:32:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková)
[13:32:59] <James_F>	 urbanecm: LGTM.
[13:33:08] <urbanecm>	 ty
[13:34:04] <wikibugs>	 (03PS4) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640)
[13:34:04] <wikibugs>	 (03PS4) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640)
[13:34:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P65944 and previous config saved to /var/cache/conftool/dbconfig/20240708-133456-arnaudb.json
[13:35:01] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 236 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 236, active_shards: 236, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 236, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number
[13:35:01] <icinga-wm>	 light_fetch: 0, task_max_waiting_in_queue_millis: 527, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:35:01] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 7 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 9, active_shards: 9, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 7, delayed_unassigned_shards: 0, number_of_pending_tasks: 0,
[13:35:01] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 56.25 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:35:05] <wikibugs>	 (03CR) 10Pppery: "Adding the author and approver of the original patch that added the functionality I'm fixing (https://gerrit.wikimedia.org/r/c/operations/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052165 (https://phabricator.wikimedia.org/T9496) (owner: 10Pppery)
[13:36:01] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 9, active_shards: 16, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig
[13:36:01] <icinga-wm>	 : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:36:33] <urbanecm>	 tchin: what about you?
[13:36:48] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1052728 (owner: 10Slyngshede)
[13:37:01] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 252, active_shards: 472, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max
[13:37:01] <icinga-wm>	 _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:37:14] <wikibugs>	 (03CR) 10Marostegui: mediawiki: Start the table catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[13:37:37] <Pppery>	 I added some people as reviewers to my missing.php patch, but that probably won't take place during this backport window, so call it not done today and I am going to reschedule it for a later window
[13:38:24] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-mcrouter: rollout to eqiad mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052741 (https://phabricator.wikimedia.org/T346690)
[13:38:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi)
[13:38:39] <urbanecm>	 Pppery: thanks and sorry for the delay :). 
[13:38:58] <tchin>	 looks good
[13:39:03] <logmsgbot>	 !log urbanecm@deploy1002 tchin, jforrester, urbanecm: Continuing with sync
[13:39:06] <urbanecm>	 proceeding then, thanks
[13:39:15] <wikibugs>	 (03PS1) 10Ssingh: Release 3.0.2 [software/conftool] - 10https://gerrit.wikimedia.org/r/1052742
[13:40:03] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 16 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 16, delayed_unassigned_shards: 0, number_of_pending_tasks: 4, 
[13:40:03] <icinga-wm>	 f_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 1436, active_shards_percent_as_number: 0.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:40:23] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 336 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 136, active_shards: 136, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 334, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number
[13:40:23] <icinga-wm>	 light_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 28.8135593220339 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:41:03] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 9, active_shards: 16, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig
[13:41:03] <icinga-wm>	 : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:41:03] <wikibugs>	 (03CR) 10Ssingh: "I am not sure about the conftool release cycle and if this warrants a new release or not so I will leave that to you. Please feel free to " [software/conftool] - 10https://gerrit.wikimedia.org/r/1052742 (owner: 10Ssingh)
[13:41:23] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 252, active_shards: 472, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max
[13:41:23] <icinga-wm>	 _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[13:41:43] <icinga-wm>	 PROBLEM - Check unit status of push_cross_cluster_settings_9200 on relforge1003 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:42:02] <wikibugs>	 (03Merged) 10jenkins-bot: lib: Update metrics-platform to 84ed8dcbe7c9 [extensions/EventLogging] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052711 (owner: 10Phuedx)
[13:42:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: security update - bking@cumin2002 - T366555
[13:42:32] <wikibugs>	 (03CR) 10Elukey: [C:03+1] aux: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052700 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[13:42:46] <wikibugs>	 (03CR) 10Elukey: [C:03+2] aux: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052700 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[13:43:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ats: Route /api/ to /w/rest.php on mw-api-ext [puppet] - 10https://gerrit.wikimedia.org/r/1052745 (https://phabricator.wikimedia.org/T364400)
[13:44:05] <wikibugs>	 (03PS5) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640)
[13:44:06] <wikibugs>	 (03PS5) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640)
[13:44:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P65945 and previous config saved to /var/cache/conftool/dbconfig/20240708-134418-marostegui.json
[13:47:16] <wikibugs>	 (03CR) 10Ladsgroup: mediawiki: Start the table catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[13:47:48] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1050596|EventStreamConfig: Add hive ingestion defaults (T367134)]], [[gerrit:1010270|[wikifunctionswiki] Disable MobileFrontend in production (T349408)]] (duration: 30m 38s)
[13:47:52] <stashbot>	 T367134: [Refine Refactoring] Integrate Refine workflow configuration into ESC - https://phabricator.wikimedia.org/T367134
[13:47:53] <stashbot>	 T349408: Use responsive Vector-2022 instead of Minerva for Wikifunctions Mobile and drop the secondary domain/MobileFrontend part - https://phabricator.wikimedia.org/T349408
[13:47:56] <urbanecm>	 and synced!
[13:47:56] <wikibugs>	 (03CR) 10Marostegui: mediawiki: Start the table catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[13:48:15] <James_F>	 Whee.
[13:48:15] <logmsgbot>	 !log urbanecm@deploy1002 Started scap sync-world: Backport for [[gerrit:1052711|lib: Update metrics-platform to 84ed8dcbe7c9]]
[13:48:22] <urbanecm>	 continuing with the last one
[13:48:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi)
[13:48:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] "It looks good for now, maybe once we start using it, we'll notice stuff that needs changing to adapt more to our needs." [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[13:50:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T367781)', diff saved to https://phabricator.wikimedia.org/P65946 and previous config saved to /var/cache/conftool/dbconfig/20240708-135002-arnaudb.json
[13:50:05] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[13:50:06] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[13:50:18] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[13:50:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T367781)', diff saved to https://phabricator.wikimedia.org/P65947 and previous config saved to /var/cache/conftool/dbconfig/20240708-135024-arnaudb.json
[13:50:33] <logmsgbot>	 !log urbanecm@deploy1002 phuedx, urbanecm: Backport for [[gerrit:1052711|lib: Update metrics-platform to 84ed8dcbe7c9]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:50:44] <urbanecm>	 phuedx: can you take a look at mwdebug, please?
[13:50:55] <phuedx>	 urbanecm: On it
[13:51:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T367781)', diff saved to https://phabricator.wikimedia.org/P65948 and previous config saved to /var/cache/conftool/dbconfig/20240708-135132-arnaudb.json
[13:51:43] <icinga-wm>	 RECOVERY - Check unit status of push_cross_cluster_settings_9200 on relforge1003 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:52:03] <Gerges>	 urbanecm: How about https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1052372 ?
[13:52:26] <urbanecm>	 Gerges: i asked a question a couple of lines above, but did not receive a response :). 
[13:52:28] <urbanecm>	 let me repaste
[13:52:39] <urbanecm>	 15:13 <urbanecm> hello Gerges, do we have a 👍 for the VE enabling from someone on the Editing team (as the VE maintainers)? AFAIK, they'd like to review before a deployment like this one happens. 
[13:53:34] <phuedx>	 urbanecm: LGTM
[13:53:39] <urbanecm>	 thanks, proceeding
[13:53:41] <logmsgbot>	 !log urbanecm@deploy1002 phuedx, urbanecm: Continuing with sync
[13:53:51] <Gerges>	 So what do I do?
[13:54:08] <urbanecm>	 Gerges: do we have the plus one from Editing team, or not?
[13:55:19] <Gerges>	 Do I need to wait for the review editing team? 
[13:55:20] <wikibugs>	 (03CR) 10Btullis: dse-k8s-services: Add net-new chart for Airflow (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[13:55:42] <wikibugs>	 (03PS6) 10Filippo Giunchedi: pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640)
[13:55:42] <wikibugs>	 (03PS6) 10Filippo Giunchedi: pontoon: add frontend proxy capability to LB [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640)
[13:56:37] <urbanecm>	 Gerges: if it didn't happen already, yes. it might be a good idea to ping them on the task (I can ask once I'm done with the deployment).
[13:57:11] <Gerges>	 Okay 
[13:58:51] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1052711|lib: Update metrics-platform to 84ed8dcbe7c9]] (duration: 10m 36s)
[13:58:56] <urbanecm>	 and synced
[13:58:59] <urbanecm>	 anything else?
[13:59:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P65949 and previous config saved to /var/cache/conftool/dbconfig/20240708-135925-marostegui.json
[13:59:56] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1052748 (https://phabricator.wikimedia.org/T369514)
[14:00:01] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1052749 (https://phabricator.wikimedia.org/T369514)
[14:00:33] <urbanecm>	 Gerges: I commented on the task: https://phabricator.wikimedia.org/T368632#9961075. Let's see what they say. 
[14:01:11] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1052751 (https://phabricator.wikimedia.org/T369515)
[14:05:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] mobileapps: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043107 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:05:55] <wikibugs>	 (03PS3) 10Filippo Giunchedi: mobileapps: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043107 (https://phabricator.wikimedia.org/T320563)
[14:06:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P65950 and previous config saved to /var/cache/conftool/dbconfig/20240708-140640-arnaudb.json
[14:06:59] <icinga-wm>	 PROBLEM - Disk space on mw1446 is CRITICAL: DISK CRITICAL - free space: / 5095 MB (1% inode=99%): /tmp 5095 MB (1% inode=99%): /var/tmp 5095 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops
[14:09:50] <wikibugs>	 (03PS1) 10Btullis: Puppetize the disabling of the misc dumps on snapshot1017 [puppet] - 10https://gerrit.wikimedia.org/r/1052752 (https://phabricator.wikimedia.org/T368098)
[14:10:44] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3173/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052752 (https://phabricator.wikimedia.org/T368098) (owner: 10Btullis)
[14:13:06] <claime>	 !log cleaning up old shellbox files on mw1446
[14:13:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T367856)', diff saved to https://phabricator.wikimedia.org/P65951 and previous config saved to /var/cache/conftool/dbconfig/20240708-141432-marostegui.json
[14:14:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[14:14:36] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[14:14:48] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[14:16:53] <icinga-wm>	 RECOVERY - Disk space on mw1446 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1446&var-datasource=eqiad+prometheus/ops
[14:16:58] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3076.esams.wmnet
[14:17:07] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3068.esams.wmnet
[14:17:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[14:17:25] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:17:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[14:17:56] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:18:03] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for xiaoxiao - https://phabricator.wikimedia.org/T369519 (10XiaoXiao-WMF) 03NEW
[14:18:37] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 24.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:20:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[14:20:14] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:20:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[14:20:26] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:20:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] mobileapps: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043107 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi)
[14:20:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[14:20:50] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:21:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[14:21:19] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:21:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P65952 and previous config saved to /var/cache/conftool/dbconfig/20240708-142147-arnaudb.json
[14:21:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[14:21:50] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:21:57] <logmsgbot>	 !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:22:34] <logmsgbot>	 !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:22:35] <logmsgbot>	 !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:22:48] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Update modules/README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553
[14:23:09] <logmsgbot>	 !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:23:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Revisit this. I 've added a few more stuff and I 'll take a look at some point into what sextant does to fix the incompatibility issues." [deployment-charts] - 10https://gerrit.wikimedia.org/r/953553 (owner: 10Alexandros Kosiaris)
[14:25:57] <wikibugs>	 (03CR) 10Herron: [V:03+1] "Thanks! Great points and agreed overall. I'm hoping to revisit this to see how the metrics behave in Pyrra today, and assuming we can leav" [puppet] - 10https://gerrit.wikimedia.org/r/1051439 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[14:27:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[14:27:43] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[14:31:45] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1011.eqiad.wmnet
[14:34:01] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1011.eqiad.wmnet
[14:36:21] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2161 - https://phabricator.wikimedia.org/T369229#9961274 (10Jhancock.wm) request submitted with Dell. SR193625600. might have a spare on hand to get it up now. the SR will allow us to replace the spare. will lyk
[14:36:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T367781)', diff saved to https://phabricator.wikimedia.org/P65953 and previous config saved to /var/cache/conftool/dbconfig/20240708-143654-arnaudb.json
[14:36:56] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[14:36:58] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[14:37:10] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[14:37:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T367781)', diff saved to https://phabricator.wikimedia.org/P65954 and previous config saved to /var/cache/conftool/dbconfig/20240708-143716-arnaudb.json
[14:37:37] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] "seems reasonably, looks to already be applied in prod." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051358 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[14:38:48] <wikibugs>	 (03Merged) 10jenkins-bot: Search update pipeline: reduce client-side rate-limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051358 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[14:39:17] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T367781)', diff saved to https://phabricator.wikimedia.org/P65955 and previous config saved to /var/cache/conftool/dbconfig/20240708-143925-arnaudb.json
[14:42:30] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Uniquemia - https://phabricator.wikimedia.org/T369500#9961302 (10fgiunchedi) Hello @EUwandu-WMF, I couldn't find the uniquemia account on wikitech, or at least one with `euwandu-ctr@wikimedia.org` as its email, what wikitech account should we be using? tha...
[14:43:50] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1011.eqiad.wmnet
[14:43:51] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cloudcephosd1011.eqiad.wmnet
[14:44:23] <wikibugs>	 (03PS1) 10TChin: EventStreamConfig: Enable hive ingestion for mediawiki.page-delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052762 (https://phabricator.wikimedia.org/T367134)
[14:46:36] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2161 - https://phabricator.wikimedia.org/T369229#9961346 (10Marostegui) Thank you!
[14:49:09] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for xiaoxiao - https://phabricator.wikimedia.org/T369519#9961374 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Hello @XiaoXiao-WMF; I've added you to `wmf` ldap group. I'm tentatively resolving the task though please reopen if sth is amiss
[14:49:59] <icinga-wm>	 PROBLEM - Disk space on mw1438 is CRITICAL: DISK CRITICAL - free space: / 10721 MB (2% inode=99%): /tmp 10721 MB (2% inode=99%): /var/tmp 10721 MB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1438&var-datasource=eqiad+prometheus/ops
[14:51:14] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "This will shift around 100rps to from mw-web to mw-api-ext. It shouldn't need a replica bump, but we should still keep an eye on latency a" [puppet] - 10https://gerrit.wikimedia.org/r/1052745 (https://phabricator.wikimedia.org/T364400) (owner: 10Alexandros Kosiaris)
[14:51:17] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] cirrus: add cirrussearch-legacy-updater dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052135 (owner: 10DCausse)
[14:51:35] <claime>	 !log cleaning up old shellbox files on mw1438
[14:51:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host search-loader1002.eqiad.wmnet
[14:52:11] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host search-loader1002.eqiad.wmnet
[14:53:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host search-loader1002.eqiad.wmnet
[14:53:26] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host search-loader1002.eqiad.wmnet
[14:53:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#9961390 (10elukey)
[14:53:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host search-loader1002.eqiad.wmnet
[14:54:10] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-mcrouter: rollout to eqiad mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052741 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[14:54:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P65956 and previous config saved to /var/cache/conftool/dbconfig/20240708-145432-arnaudb.json
[14:56:37] <icinga-wm>	 RECOVERY - Disk space on mw1438 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1438&var-datasource=eqiad+prometheus/ops
[14:57:21] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader1002.eqiad.wmnet
[14:59:07] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1011.eqiad.wmnet with OS bullseye
[14:59:17] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:02:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052285 (https://phabricator.wikimedia.org/T369342) (owner: 10NMW03)
[15:04:15] <jinxer-wm>	 FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[15:04:29] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9961423 (10xcollazo) >>! In T368098#9959951, @elukey wrote: > Folks today I fo...
[15:07:31] <wikibugs>	 (03PS4) 10Ladsgroup: mediawiki: Start the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581)
[15:07:36] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Start the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[15:07:45] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Puppetize the disabling of the misc dumps on snapshot1017 [puppet] - 10https://gerrit.wikimedia.org/r/1052752 (https://phabricator.wikimedia.org/T368098) (owner: 10Btullis)
[15:07:49] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mediawiki: Start the table catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043246 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[15:09:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P65957 and previous config saved to /var/cache/conftool/dbconfig/20240708-150939-arnaudb.json
[15:11:44] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[15:12:46] <wikibugs>	 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data Products (Data Products Sprint 16), and 2 others: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#9961448 (10xcollazo) >>! In T368098#9953045, @Ladsgroup wrote: >>>! In T368098...
[15:12:50] <wikibugs>	 (03PS1) 10Scott French: commons-impact-analytics: bump image to v1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052765 (https://phabricator.wikimedia.org/T361835)
[15:13:36] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1011.eqiad.wmnet with reason: host reimage
[15:13:47] <wikibugs>	 (03CR) 10Volans: "Approach looks good to me, question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:14:15] <jinxer-wm>	 RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[15:16:44] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[15:16:46] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1011.eqiad.wmnet with reason: host reimage
[15:20:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[15:21:05] <wikibugs>	 (03CR) 10Volans: [C:03+1] "makes sense to me (to be tested ;) )" [cookbooks] - 10https://gerrit.wikimedia.org/r/1050445 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[15:22:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Bumping db1227 weight (T366852)', diff saved to https://phabricator.wikimedia.org/P65958 and previous config saved to /var/cache/conftool/dbconfig/20240708-152222-ladsgroup.json
[15:22:26] <stashbot>	 T366852: Discover and fix under-utilized replicas - https://phabricator.wikimedia.org/T366852
[15:24:16] <wikibugs>	 (03CR) 10Volans: "One concern inline" [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[15:24:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T367781)', diff saved to https://phabricator.wikimedia.org/P65959 and previous config saved to /var/cache/conftool/dbconfig/20240708-152446-arnaudb.json
[15:24:49] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[15:24:50] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[15:25:02] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[15:25:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T367781)', diff saved to https://phabricator.wikimedia.org/P65960 and previous config saved to /var/cache/conftool/dbconfig/20240708-152508-arnaudb.json
[15:25:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[15:27:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T367781)', diff saved to https://phabricator.wikimedia.org/P65961 and previous config saved to /var/cache/conftool/dbconfig/20240708-152717-arnaudb.json
[15:30:04] <jouncebot>	 jan_drewniak: Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1530). Please do the needful.
[15:30:48] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052766 (https://phabricator.wikimedia.org/T128546)
[15:34:19] <wikibugs>	 (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052766 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:34:57] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052766 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:36:31] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: [spicerack] python-kafka does not support python 3.12, there's a fix but there has not been any releases since 2020 - https://phabricator.wikimedia.org/T354410#9961563 (10Volans) @elukey do you know how much of an effort would it be to change library ba...
[15:36:44] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, question inline" [software/homer] - 10https://gerrit.wikimedia.org/r/1050262 (owner: 10Ayounsi)
[15:37:12] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[15:38:07] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] commons-impact-analytics: bump image to v1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052765 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[15:38:14] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'.
[15:38:18] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[15:39:08] <wikibugs>	 (03PS1) 10JHathaway: wikipedia.org spf: indicate mail is sent from this domain. [dns] - 10https://gerrit.wikimedia.org/r/1052768 (https://phabricator.wikimedia.org/T369341)
[15:41:53] <wikibugs>	 (03CR) 10Scott French: [C:03+2] commons-impact-analytics: bump image to v1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052765 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[15:42:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P65962 and previous config saved to /var/cache/conftool/dbconfig/20240708-154224-arnaudb.json
[15:42:25] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] "This covers a good number of the affected domains but there are some others, we can deal with them in a separate patch!" [dns] - 10https://gerrit.wikimedia.org/r/1052768 (https://phabricator.wikimedia.org/T369341) (owner: 10JHathaway)
[15:42:43] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] wikipedia.org spf: indicate mail is sent from this domain. [dns] - 10https://gerrit.wikimedia.org/r/1052768 (https://phabricator.wikimedia.org/T369341) (owner: 10JHathaway)
[15:42:58] <wikibugs>	 (03Merged) 10jenkins-bot: commons-impact-analytics: bump image to v1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052765 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[15:43:23] <wikibugs>	 (03PS1) 10Arnaudb: bashrc: change option on alias [puppet] - 10https://gerrit.wikimedia.org/r/1052769
[15:43:25] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] bashrc: change option on alias [puppet] - 10https://gerrit.wikimedia.org/r/1052769 (owner: 10Arnaudb)
[15:44:12] <wikibugs>	 (03CR) 10Dwisehaupt: "Yes, there should be a new endpoint to check. I brought it up with fr-tech last week before the US holiday and plan to have an answer soon" [puppet] - 10https://gerrit.wikimedia.org/r/1052062 (https://phabricator.wikimedia.org/T368114) (owner: 10Filippo Giunchedi)
[15:44:18] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:44:36] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply
[15:44:49] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 07m 54s)
[15:44:56] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:45:01] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply
[15:45:11] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:45:30] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[15:45:52] <logmsgbot>	 !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:46:48] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[15:47:29] <logmsgbot>	 !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[15:48:53] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:51:18] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1046698| Bumping portals to master (T128546)]] (duration: 06m 28s)
[15:51:21] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:51:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:54:05] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:55:02] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9961705 (10Scott_French) Thanks, @SGupta-WMF!  @mforns - The v1.0.1 image is n...
[15:57:19] <wikibugs>	 (03PS1) 10Ahmon Dancy: class scap::scripts: Drop logstash_checker.py, phase 1 [puppet] - 10https://gerrit.wikimedia.org/r/1052771
[15:57:19] <wikibugs>	 (03PS1) 10Ahmon Dancy: class scap::scripts: Drop logstash_checker.py, phase 2 [puppet] - 10https://gerrit.wikimedia.org/r/1052772
[15:57:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P65963 and previous config saved to /var/cache/conftool/dbconfig/20240708-155731-arnaudb.json
[15:57:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] class scap::scripts: Drop logstash_checker.py, phase 1 [puppet] - 10https://gerrit.wikimedia.org/r/1052771 (owner: 10Ahmon Dancy)
[15:57:43] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3069.esams.wmnet
[15:57:54] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3077.esams.wmnet
[15:58:53] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:59:13] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:59:44] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[16:00:14] <wikibugs>	 (03PS2) 10Ahmon Dancy: class scap::scripts: Drop logstash_checker.py, phase 1 [puppet] - 10https://gerrit.wikimedia.org/r/1052771
[16:00:14] <wikibugs>	 (03PS2) 10Ahmon Dancy: class scap::scripts: Drop logstash_checker.py, phase 2 [puppet] - 10https://gerrit.wikimedia.org/r/1052772
[16:01:26] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:02:02] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9961749 (10cmooney)
[16:02:43] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052772 (owner: 10Ahmon Dancy)
[16:03:18] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mesh: Add faultinjection capabilities (c/p part) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052775
[16:03:18] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mesh: Add faultinjection capabilities [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052776
[16:03:54] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052771 (owner: 10Ahmon Dancy)
[16:04:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[16:05:23] <wikibugs>	 (03PS1) 10David Caro: cloudcephosd1011: update interface names [puppet] - 10https://gerrit.wikimedia.org/r/1052777 (https://phabricator.wikimedia.org/T309789)
[16:06:12] <wikibugs>	 (03CR) 10David Caro: [C:03+2] cloudcephosd1011: update interface names [puppet] - 10https://gerrit.wikimedia.org/r/1052777 (https://phabricator.wikimedia.org/T309789) (owner: 10David Caro)
[16:06:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9961762 (10cmooney)
[16:07:18] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9961779 (10cmooney)
[16:08:41] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1011.eqiad.wmnet with OS bullseye
[16:09:11] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1011.eqiad.wmnet
[16:09:21] <wikibugs>	 (03CR) 10Ahmon Dancy: "Not sure what's up with PCC" [puppet] - 10https://gerrit.wikimedia.org/r/1052771 (owner: 10Ahmon Dancy)
[16:10:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[16:12:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T367781)', diff saved to https://phabricator.wikimedia.org/P65964 and previous config saved to /var/cache/conftool/dbconfig/20240708-161238-arnaudb.json
[16:12:42] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[16:12:45] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[16:12:55] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[16:13:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T367781)', diff saved to https://phabricator.wikimedia.org/P65965 and previous config saved to /var/cache/conftool/dbconfig/20240708-161302-arnaudb.json
[16:15:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T367781)', diff saved to https://phabricator.wikimedia.org/P65966 and previous config saved to /var/cache/conftool/dbconfig/20240708-161510-arnaudb.json
[16:15:19] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1011.eqiad.wmnet
[16:15:35] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:15:44] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[16:20:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[16:25:43] <jinxer-wm>	 RESOLVED: [2x] OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[16:26:40] <Amir1>	 jouncebot: nowandnext
[16:26:40] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 33 minute(s)
[16:26:41] <jouncebot>	 In 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1700)
[16:26:41] <jouncebot>	 In 0 hour(s) and 33 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1700)
[16:30:15] <wikibugs>	 (03PS2) 10Ladsgroup: Reduce frequency of two query pages in commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052058 (https://phabricator.wikimedia.org/T369024)
[16:30:18] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Reduce frequency of two query pages in commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052058 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup)
[16:30:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P65967 and previous config saved to /var/cache/conftool/dbconfig/20240708-163017-arnaudb.json
[16:30:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052058 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup)
[16:31:03] <wikibugs>	 (03Merged) 10jenkins-bot: Reduce frequency of two query pages in commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052058 (https://phabricator.wikimedia.org/T369024) (owner: 10Ladsgroup)
[16:31:18] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap sync-world: Backport for [[gerrit:1052058|Reduce frequency of two query pages in commonswiki (T369024)]]
[16:31:27] <stashbot>	 T369024: SpecialUncategorizedPages slow query - https://phabricator.wikimedia.org/T369024
[16:33:35] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1052058|Reduce frequency of two query pages in commonswiki (T369024)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:34:02] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Continuing with sync
[16:36:10] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for xiaoxiao - https://phabricator.wikimedia.org/T369519#9961950 (10Aklapper)
[16:39:08] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1052058|Reduce frequency of two query pages in commonswiki (T369024)]] (duration: 07m 50s)
[16:39:11] <stashbot>	 T369024: SpecialUncategorizedPages slow query - https://phabricator.wikimedia.org/T369024
[16:45:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P65968 and previous config saved to /var/cache/conftool/dbconfig/20240708-164524-arnaudb.json
[16:50:57] <wikibugs>	 (03PS1) 10Ottomata: EventLoggingLegacyProxy - move endpoint to w/beacon/event.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052782 (https://phabricator.wikimedia.org/T353817)
[16:51:04] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2161 - https://phabricator.wikimedia.org/T369229#9962007 (10Jhancock.wm) no spare, but got confirmation that the replacement is ordered. Should be here very soon.
[16:51:56] <wikibugs>	 (03CR) 10Ottomata: EventLoggingLegacyProxy - move endpoint to w/beacon/event.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052782 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[16:54:47] <wikibugs>	 (03PS1) 10Herron: wip [alerts] - 10https://gerrit.wikimedia.org/r/1052784
[16:55:55] <wikibugs>	 (03PS2) 10Anzx: jawiki: add throttle rule for edit-a-thon July 11-18, 2024 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052781 (https://phabricator.wikimedia.org/T369522)
[16:56:18] <wikibugs>	 (03PS62) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[16:56:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052781 (https://phabricator.wikimedia.org/T369522) (owner: 10Anzx)
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1700)
[17:00:04] <jouncebot>	 ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T1700). Please do the needful.
[17:00:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T367781)', diff saved to https://phabricator.wikimedia.org/P65969 and previous config saved to /var/cache/conftool/dbconfig/20240708-170031-arnaudb.json
[17:00:34] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[17:00:36] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[17:00:47] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[17:00:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T367781)', diff saved to https://phabricator.wikimedia.org/P65970 and previous config saved to /var/cache/conftool/dbconfig/20240708-170053-arnaudb.json
[17:01:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:02:05] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:02:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[17:03:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T367781)', diff saved to https://phabricator.wikimedia.org/P65971 and previous config saved to /var/cache/conftool/dbconfig/20240708-170302-arnaudb.json
[17:07:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[17:14:17] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:14:35] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Remove legacy appserver and api records [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[17:15:22] <wikibugs>	 (03CR) 10Scott French: [C:03+1] service.yaml: Switch api and appserver to lvs_setup 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050381 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[17:18:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P65972 and previous config saved to /var/cache/conftool/dbconfig/20240708-171810-arnaudb.json
[17:18:52] <wikibugs>	 (03CR) 10Scott French: "From reading [0], it sounds like the `service::catalog` entries need to move to `service_setup` in this step as well (i.e., before the PyB" [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[17:23:12] <wikibugs>	 (03CR) 10Scott French: [C:03+1] Remove conftool-data and service catalog for legacy appservers 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050383 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert)
[17:23:20] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] data.yaml: Remove shell access for ezachte and chelsyx. [puppet] - 10https://gerrit.wikimedia.org/r/1052728 (owner: 10Slyngshede)
[17:24:00] <wikibugs>	 (03PS63) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[17:32:20] <wikibugs>	 (03PS2) 10Herron: istio_sli_avail: alert if metric goes absent [alerts] - 10https://gerrit.wikimedia.org/r/1052784 (https://phabricator.wikimedia.org/T352756)
[17:33:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P65973 and previous config saved to /var/cache/conftool/dbconfig/20240708-173316-arnaudb.json
[17:34:44] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "This can be merged anytime :)" [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar)
[17:34:45] <wikibugs>	 06SRE, 10DNS, 10fundraising-tech-ops, 06Traffic, 13Patch-For-Review: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9962213 (10Dzahn) Thanks @AKanji-WMF Are you still using http://mandrillapp.com/ / MailChimp for fundraising emails with benefactors.wikimedia.org ?
[17:34:56] <wikibugs>	 (03PS1) 10Ottomata: mediawiki.org - Apache rewrite /beacon/event -> /w/beacon/event.php [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817)
[17:35:44] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[17:35:47] <wikibugs>	 (03PS2) 10Ottomata: mediawiki.org - Apache rewrite /beacon/event -> /w/beacon/event.php [puppet] - 10https://gerrit.wikimedia.org/r/1052791 (https://phabricator.wikimedia.org/T353817)
[17:35:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: enable built-in log rotation [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar)
[17:37:21] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9962238 (10jhathaway) I agree that decoupling makes sense and that it is worth the effort to try and run the current script on the puppets...
[17:38:21] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3070.esams.wmnet
[17:38:44] <wikibugs>	 (03CR) 10Bking: "Acknowledged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[17:40:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[17:40:51] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3078.esams.wmnet
[17:41:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9962325 (10VRiley-WMF) @Eevans It did. I was planning on swapping the unit back. Is there a good time to proceed with this?
[17:41:37] <wikibugs>	 06SRE, 06collaboration-services, 06DBA, 13Patch-For-Review: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9962322 (10Ladsgroup) 05Open→03Resolved ^ dropped the user in production on m5.
[17:45:30] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "how are we handling the service restart and make sure it's not forgotten? We have removed the logrotation now so let's not run out of disk" [puppet] - 10https://gerrit.wikimedia.org/r/1049090 (https://phabricator.wikimedia.org/T367505) (owner: 10Hashar)
[17:48:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T367781)', diff saved to https://phabricator.wikimedia.org/P65974 and previous config saved to /var/cache/conftool/dbconfig/20240708-174823-arnaudb.json
[17:48:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[17:48:27] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[17:48:39] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[17:48:59] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[17:49:12] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1231.eqiad.wmnet with reason: Maintenance
[17:49:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T367781)', diff saved to https://phabricator.wikimedia.org/P65975 and previous config saved to /var/cache/conftool/dbconfig/20240708-174918-arnaudb.json
[17:50:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T367781)', diff saved to https://phabricator.wikimedia.org/P65976 and previous config saved to /var/cache/conftool/dbconfig/20240708-175026-arnaudb.json
[17:50:53] <wikibugs>	 (03PS1) 10JHathaway: wikipedia.org spf: add a comment [dns] - 10https://gerrit.wikimedia.org/r/1052792 (https://phabricator.wikimedia.org/T369341)
[17:51:36] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9962378 (10CDanis) +1
[17:52:40] <wikibugs>	 (03CR) 10CDanis: [C:03+1] merge_cli: fix a puppet-merge.sh comment [puppet] - 10https://gerrit.wikimedia.org/r/1052260 (https://phabricator.wikimedia.org/T366355) (owner: 10Elukey)
[17:52:44] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[17:54:21] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] MediaWiki: Allow Bitu to be used as a 2FA proxy. [software/bitu] - 10https://gerrit.wikimedia.org/r/1052085 (https://phabricator.wikimedia.org/T359551) (owner: 10Slyngshede)
[17:54:32] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] wikipedia.org spf: add a comment [dns] - 10https://gerrit.wikimedia.org/r/1052792 (https://phabricator.wikimedia.org/T369341) (owner: 10JHathaway)
[17:56:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:56:55] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "looks good, root@wikimedia.org is also another option" [puppet] - 10https://gerrit.wikimedia.org/r/1051846 (owner: 10Dzahn)
[17:57:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[17:58:15] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] puppetmaster::gitclone: disarm pre-commit and post-commit hooks [puppet] - 10https://gerrit.wikimedia.org/r/1052261 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey)
[17:58:24] <wikibugs>	 (03PS1) 10Ssingh: P:dns::auth: indentation clean-up, no code change [puppet] - 10https://gerrit.wikimedia.org/r/1052793
[17:59:24] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3174/console" [puppet] - 10https://gerrit.wikimedia.org/r/1052793 (owner: 10Ssingh)
[18:01:32] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Migrate puppet merges to a cookbook - https://phabricator.wikimedia.org/T366355#9962431 (10CDanis) >>! In T366355#9954359, @elukey wrote: > I've also checked what puppet-merge does behind the scenes, and the gist of it...
[18:02:05] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:02:18] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth: indentation clean-up, no code change [puppet] - 10https://gerrit.wikimedia.org/r/1052793 (owner: 10Ssingh)
[18:02:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host search-loader2002.codfw.wmnet
[18:05:34] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P65977 and previous config saved to /var/cache/conftool/dbconfig/20240708-180533-arnaudb.json
[18:05:35] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:06:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host search-loader2002.codfw.wmnet
[18:09:17] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:14:17] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:16:44] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[18:20:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P65978 and previous config saved to /var/cache/conftool/dbconfig/20240708-182041-arnaudb.json
[18:21:44] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[18:34:30] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "I just added a nit regarding variable naming but I see that other parts of the code (unrelated to this patch) use the same variable name (" [puppet] - 10https://gerrit.wikimedia.org/r/1052731 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi)
[18:35:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T367781)', diff saved to https://phabricator.wikimedia.org/P65979 and previous config saved to /var/cache/conftool/dbconfig/20240708-183548-arnaudb.json
[18:35:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[18:35:52] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[18:36:04] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[18:36:18] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2114.codfw.wmnet with reason: Maintenance
[18:36:32] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2114.codfw.wmnet with reason: Maintenance
[18:36:39] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2124.codfw.wmnet with reason: Maintenance
[18:36:52] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2124.codfw.wmnet with reason: Maintenance
[18:36:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T367781)', diff saved to https://phabricator.wikimedia.org/P65980 and previous config saved to /var/cache/conftool/dbconfig/20240708-183658-arnaudb.json
[18:38:27] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] puppetmaster: change git sender email address to git@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1051846 (owner: 10Dzahn)
[18:39:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367781)', diff saved to https://phabricator.wikimedia.org/P65981 and previous config saved to /var/cache/conftool/dbconfig/20240708-183923-arnaudb.json
[18:39:32] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi)
[18:44:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 (owner: 10Catrope)
[18:45:34] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "changed but I think we now need a list admin to allow the "non-member sender address". Trying to find out who that is." [puppet] - 10https://gerrit.wikimedia.org/r/1051846 (owner: 10Dzahn)
[18:49:12] <wikibugs>	 (03PS1) 10Ssingh: conftool-data: add geodns schema [puppet] - 10https://gerrit.wikimedia.org/r/1052803 (https://phabricator.wikimedia.org/T369366)
[18:49:13] <wikibugs>	 (03PS1) 10Ssingh: P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1052804 (https://phabricator.wikimedia.org/T369366)
[18:50:14] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1052804 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh)
[18:54:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P65982 and previous config saved to /var/cache/conftool/dbconfig/20240708-185430-arnaudb.json
[19:02:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[19:02:47] <wikibugs>	 (03PS1) 10Dzahn: Revert "puppetmaster: change git sender email address to git@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1052805
[19:05:20] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "puppetmaster: change git sender email address to git@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1052805 (owner: 10Dzahn)
[19:06:46] <wikibugs>	 (03CR) 10Krinkle: EventLoggingLegacyProxy - move endpoint to w/beacon/event.php (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052782 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[19:09:33] <wikibugs>	 06SRE, 10Incident Tooling: wikimediastatus.net help popups are mobile-unfriendly and keyboard-inaccessible - https://phabricator.wikimedia.org/T327201#9962670 (10CDanis) >>! In T327201#9958666, @DMacks wrote: > It is still not fixed on my desktop-Mac Firefox. There is no longer a scrollbar, but the box is stil...
[19:09:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P65983 and previous config saved to /var/cache/conftool/dbconfig/20240708-190937-arnaudb.json
[19:12:46] <wikibugs>	 (03PS1) 10Dzahn: Revert^2 "puppetmaster: change git sender email address to git@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1052807
[19:20:50] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[19:20:59] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert^2 "puppetmaster: change git sender email address to git@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1052807 (owner: 10Dzahn)
[19:21:02] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[19:21:10] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3071.esams.wmnet
[19:21:26] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3079.esams.wmnet
[19:24:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T367781)', diff saved to https://phabricator.wikimedia.org/P65984 and previous config saved to /var/cache/conftool/dbconfig/20240708-192444-arnaudb.json
[19:24:48] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2129.codfw.wmnet with reason: Maintenance
[19:24:54] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[19:25:02] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2129.codfw.wmnet with reason: Maintenance
[19:25:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2129 (T367781)', diff saved to https://phabricator.wikimedia.org/P65985 and previous config saved to /var/cache/conftool/dbconfig/20240708-192508-arnaudb.json
[19:27:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T367781)', diff saved to https://phabricator.wikimedia.org/P65986 and previous config saved to /var/cache/conftool/dbconfig/20240708-192735-arnaudb.json
[19:37:12] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[19:37:31] <wikibugs>	 (03CR) 10Aude: [C:03+1] "looks good. tested this locally" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 (owner: 10Catrope)
[19:38:36] <wikibugs>	 (03CR) 10Ottomata: "Abandoning this based on discussion:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052782 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[19:38:39] <wikibugs>	 (03Abandoned) 10Ottomata: EventLoggingLegacyProxy - move endpoint to w/beacon/event.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052782 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[19:39:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[19:42:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P65987 and previous config saved to /var/cache/conftool/dbconfig/20240708-194242-arnaudb.json
[19:44:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[19:44:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[19:44:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T367856)', diff saved to https://phabricator.wikimedia.org/P65988 and previous config saved to /var/cache/conftool/dbconfig/20240708-194435-marostegui.json
[19:44:39] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[19:45:31] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:46:23] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:57:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P65989 and previous config saved to /var/cache/conftool/dbconfig/20240708-195749-arnaudb.json
[19:58:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051709 (https://phabricator.wikimedia.org/T363587) (owner: 10Gmodena)
[19:59:41] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T2000).
[20:00:05] <jouncebot>	 Nemoralis, anzx, and RoanKattouw: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:01:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:08:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[20:12:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T367781)', diff saved to https://phabricator.wikimedia.org/P65990 and previous config saved to /var/cache/conftool/dbconfig/20240708-201256-arnaudb.json
[20:12:59] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2151.codfw.wmnet with reason: Maintenance
[20:13:00] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[20:13:12] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2151.codfw.wmnet with reason: Maintenance
[20:13:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T367781)', diff saved to https://phabricator.wikimedia.org/P65991 and previous config saved to /var/cache/conftool/dbconfig/20240708-201318-arnaudb.json
[20:14:06] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "This package is actually installed on every single machine. (" [puppet] - 10https://gerrit.wikimedia.org/r/1052383 (https://phabricator.wikimedia.org/T369322) (owner: 10Urbanecm)
[20:15:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T367781)', diff saved to https://phabricator.wikimedia.org/P65992 and previous config saved to /var/cache/conftool/dbconfig/20240708-201545-arnaudb.json
[20:17:21] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] stewards: Add Phabricator API configuration [puppet] - 10https://gerrit.wikimedia.org/r/1052185 (https://phabricator.wikimedia.org/T369322) (owner: 10Urbanecm)
[20:18:27] <RoanKattouw>	 I guess nobody is doing the deployment yet? I can start
[20:19:22] <RoanKattouw>	 And nobody is here for the other patches?
[20:19:37] <RoanKattouw>	 Alright well then I'll finish my lunch and then deploy my patch
[20:27:53] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[20:28:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[20:30:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P65993 and previous config saved to /var/cache/conftool/dbconfig/20240708-203052-arnaudb.json
[20:35:28] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[20:35:33] <wikibugs>	 (03PS1) 10Btullis: Allow dse-k8s-worker hosts to access ceph ports [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259)
[20:35:41] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052771 (owner: 10Ahmon Dancy)
[20:36:04] <wikibugs>	 (03PS2) 10Btullis: Allow dse-k8s-worker hosts to access ceph ports [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259)
[20:36:49] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3176/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[20:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 (owner: 10Catrope)
[20:38:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[20:38:44] <wikibugs>	 (03PS2) 10Catrope: Graph extension: Add tracking for data sources used in <graph> tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853
[20:38:50] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 (owner: 10Catrope)
[20:38:57] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 41 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:39:27] <wikibugs>	 (03Merged) 10jenkins-bot: Graph extension: Add tracking for data sources used in <graph> tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051853 (owner: 10Catrope)
[20:39:44] <logmsgbot>	 !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1051853|Graph extension: Add tracking for data sources used in <graph> tags]]
[20:40:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[20:40:36] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[20:40:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T367856)', diff saved to https://phabricator.wikimedia.org/P65994 and previous config saved to /var/cache/conftool/dbconfig/20240708-204042-marostegui.json
[20:40:46] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[20:42:06] <logmsgbot>	 !log catrope@deploy1002 catrope: Backport for [[gerrit:1051853|Graph extension: Add tracking for data sources used in <graph> tags]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:42:47] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[20:43:49] <wikibugs>	 (03PS3) 10Btullis: Allow dse-k8s-worker hosts to access ceph ports [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259)
[20:43:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1022.eqiad.wmnet
[20:46:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P65995 and previous config saved to /var/cache/conftool/dbconfig/20240708-204559-arnaudb.json
[20:46:21] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3178/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[20:47:21] <wikibugs>	 (03CR) 10Btullis: Allow dse-k8s-worker hosts to access ceph ports [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[20:47:33] <logmsgbot>	 !log catrope@deploy1002 catrope: Continuing with sync
[20:47:49] <wikibugs>	 (03PS1) 10Andrew Bogott: trove guest agent: look for cinder volume on /sdb [puppet] - 10https://gerrit.wikimedia.org/r/1052814
[20:47:56] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Allow dse-k8s-worker hosts to access ceph ports [puppet] - 10https://gerrit.wikimedia.org/r/1052812 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[20:48:23] <wikibugs>	 (03PS2) 10Andrew Bogott: trove guest agent: look for cinder volume on /sdb [puppet] - 10https://gerrit.wikimedia.org/r/1052814
[20:48:57] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:49:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] trove guest agent: look for cinder volume on /sdb [puppet] - 10https://gerrit.wikimedia.org/r/1052814 (owner: 10Andrew Bogott)
[20:49:31] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 225, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:49:55] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:50:23] <Nemoralis>	 o/
[20:50:24] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1022.eqiad.wmnet
[20:50:32] <Nemoralis>	 I forgot that I have a deployment
[20:52:45] <logmsgbot>	 !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1051853|Graph extension: Add tracking for data sources used in <graph> tags]] (duration: 13m 00s)
[20:55:19] <Nemoralis>	 catrope: are you able to deploy my patch too?
[20:55:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs1023.eqiad.wmnet
[20:56:26] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:56:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[20:59:20] <wikibugs>	 06SRE, 06Traffic: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9963012 (10bd808) I expect that the `?action=raw` query string is what is causing you to run into a rate limit. I think you will have a better...
[20:59:41] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240708T2100).
[21:01:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T367781)', diff saved to https://phabricator.wikimedia.org/P65996 and previous config saved to /var/cache/conftool/dbconfig/20240708-210106-arnaudb.json
[21:01:09] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2158.codfw.wmnet with reason: Maintenance
[21:01:10] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[21:01:23] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2158.codfw.wmnet with reason: Maintenance
[21:01:24] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance
[21:01:37] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance
[21:01:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[21:01:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T367781)', diff saved to https://phabricator.wikimedia.org/P65997 and previous config saved to /var/cache/conftool/dbconfig/20240708-210144-arnaudb.json
[21:01:45] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3072.esams.wmnet
[21:02:02] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3080.esams.wmnet
[21:02:43] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1023.eqiad.wmnet
[21:04:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T367781)', diff saved to https://phabricator.wikimedia.org/P65998 and previous config saved to /var/cache/conftool/dbconfig/20240708-210410-arnaudb.json
[21:05:17] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1093-1095].eqiad.wmnet with reason: T348977
[21:05:23] <RoanKattouw>	 Nemoralis: Sorry for the delay, yes I'll deploy yours now
[21:05:28] <stashbot>	 T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977
[21:05:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1093-1095].eqiad.wmnet with reason: T348977
[21:05:41] <wikibugs>	 (03PS2) 10NMW03: Enable VisualEditor by default on Italian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052285 (https://phabricator.wikimedia.org/T369342)
[21:05:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic109[3-5]* for T348977 - bking@cumin2002
[21:05:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic109[3-5]* for T348977 - bking@cumin2002
[21:05:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052285 (https://phabricator.wikimedia.org/T369342) (owner: 10NMW03)
[21:06:30] <wikibugs>	 (03Merged) 10jenkins-bot: Enable VisualEditor by default on Italian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052285 (https://phabricator.wikimedia.org/T369342) (owner: 10NMW03)
[21:06:46] <logmsgbot>	 !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1052285|Enable VisualEditor by default on Italian Wikibooks (T369342)]]
[21:06:48] <stashbot>	 T369342: Enable VisualEditor by default on Italian Wikibooks - https://phabricator.wikimedia.org/T369342
[21:07:15] <RoanKattouw>	 Nemoralis1: Hi, just in case you missed it, I started deploying your patch
[21:07:42] <Nemoralis1>	 thank you, I can test it when it is available
[21:09:21] <logmsgbot>	 !log catrope@deploy1002 catrope, nmw03: Backport for [[gerrit:1052285|Enable VisualEditor by default on Italian Wikibooks (T369342)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:09:40] <Nemoralis>	 testing...
[21:10:44] <Nemoralis>	 RoanKattouw: LGTM
[21:10:53] <RoanKattouw>	 Thanks, continuing
[21:10:56] <logmsgbot>	 !log catrope@deploy1002 catrope, nmw03: Continuing with sync
[21:13:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:13:40] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] wdqs: enable throttling only for requests coming from the CDN [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse)
[21:14:21] <wikibugs>	 (03PS1) 10Btullis: cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052818 (https://phabricator.wikimedia.org/T327259)
[21:14:31] <wikibugs>	 10SRE-swift-storage, 07Wikimedia-production-error: Unable to undelete File:Boston_Bruins.svg - https://phabricator.wikimedia.org/T369299#9963067 (10Sreejithk2000) It worked today when i tried. Closing the bug hence.
[21:14:41] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Uniquemia - https://phabricator.wikimedia.org/T369500#9963066 (10EUwandu-WMF) Hello @fgiunchedi , Please can you check again to see if it works now? Here is the screenshot of my sign-in with Uniquemia on Wikitech as well if it is helpful{F56297464}
[21:15:12] <wikibugs>	 10SRE-swift-storage, 07Wikimedia-production-error: Unable to undelete File:Boston_Bruins.svg - https://phabricator.wikimedia.org/T369299#9963068 (10Sreejithk2000) 05Open→03Resolved a:03Sreejithk2000
[21:16:08] <logmsgbot>	 !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1052285|Enable VisualEditor by default on Italian Wikibooks (T369342)]] (duration: 09m 23s)
[21:16:11] <stashbot>	 T369342: Enable VisualEditor by default on Italian Wikibooks - https://phabricator.wikimedia.org/T369342
[21:16:36] <wikibugs>	 (03CR) 10Dzahn: admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn)
[21:18:05] <RoanKattouw>	 Nemoralis: All done
[21:18:12] <Nemoralis>	 thank you!
[21:18:25] <wikibugs>	 (03CR) 10Btullis: [C:03+2] cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052818 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[21:19:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P65999 and previous config saved to /var/cache/conftool/dbconfig/20240708-211918-arnaudb.json
[21:20:09] <wikibugs>	 (03PS4) 10Dzahn: admin: add approvers to group analytics-research-admins [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465)
[21:20:44] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "Miriam agreed on the ticket and also confirmed Xiao Xiao" [puppet] - 10https://gerrit.wikimedia.org/r/1049239 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn)
[21:21:36] <wikibugs>	 (03Merged) 10jenkins-bot: cephcsi: Bump the image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052818 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[21:21:58] <wikibugs>	 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#9963101 (10Dzahn)
[21:23:53] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[21:24:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: add approvers to analytics-research-admins - https://phabricator.wikimedia.org/T368435#9963118 (10Dzahn) Also thanks @Volans for the details and suggesting to add docs to Wikitech
[21:24:49] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[21:27:07] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: add approvers to analytics-research-admins - https://phabricator.wikimedia.org/T368435#9963110 (10Dzahn) 05Open→03Resolved a:03Dzahn Thank you @Miriam! You and Xiao Xiao have been added to the code base. So far this isn't happe...
[21:28:56] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1046121/3179/miscweb1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) (owner: 10Stevemunene)
[21:34:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P66000 and previous config saved to /var/cache/conftool/dbconfig/20240708-213425-arnaudb.json
[21:37:11] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:37:11] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:37:19] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:38:03] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:41:52] <wikibugs>	 (03Abandoned) 10Jforrester: Drop experimental mediawiki-dev chart, unused(?) [deployment-charts] - 10https://gerrit.wikimedia.org/r/953633 (owner: 10Jforrester)
[21:42:52] <wikibugs>	 (03CR) 10Bartosz Dziewoński: Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[21:46:01] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9963189 (10mforns) @Scott_French Thank you! We would like to bring up the prod...
[21:48:44] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[21:49:12] <jinxer-wm>	 FIRING: ProbeDown: Service miscweb2003:443 has failed probes (http_query_scholarly_wikidata_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:49:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T367781)', diff saved to https://phabricator.wikimedia.org/P66001 and previous config saved to /var/cache/conftool/dbconfig/20240708-214932-arnaudb.json
[21:49:35] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2169.codfw.wmnet with reason: Maintenance
[21:49:36] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[21:49:48] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2169.codfw.wmnet with reason: Maintenance
[21:49:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T367781)', diff saved to https://phabricator.wikimedia.org/P66002 and previous config saved to /var/cache/conftool/dbconfig/20240708-214954-arnaudb.json
[21:51:12] <wikibugs>	 (03PS1) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950)
[21:51:45] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking)
[21:52:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367781)', diff saved to https://phabricator.wikimedia.org/P66003 and previous config saved to /var/cache/conftool/dbconfig/20240708-215220-arnaudb.json
[21:52:54] <wikibugs>	 (03PS2) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950)
[21:53:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[21:54:12] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service miscweb1003:443 has failed probes (http_query_main_wikidata_org_collab_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:55:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking)
[21:59:13] <wikibugs>	 (03PS3) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950)
[21:59:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking)
[22:02:29] <inflatador>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/elasticsearch/manifests/tlsproxy.pp
[22:05:38] <wikibugs>	 (03PS4) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950)
[22:06:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking)
[22:07:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P66004 and previous config saved to /var/cache/conftool/dbconfig/20240708-220727-arnaudb.json
[22:07:32] <wikibugs>	 (03PS5) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950)
[22:09:12] <wikibugs>	 (03PS6) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950)
[22:10:35] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking)
[22:11:42] <wikibugs>	 10SRE-swift-storage, 07Wikimedia-production-error: Unable to undelete File:Boston_Bruins.svg - https://phabricator.wikimedia.org/T369299#9963274 (10Aklapper) a:05Sreejithk2000→03None
[22:18:26] <wikibugs>	 (03PS7) 10Bking: elastic: test envoy TLS terminator in relforge [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950)
[22:21:05] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052819 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking)
[22:22:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P66005 and previous config saved to /var/cache/conftool/dbconfig/20240708-222234-arnaudb.json
[22:25:44] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[22:26:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.wdqs.reboot
[22:29:31] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565 (10RobH) 03NEW
[22:30:18] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#9963351 (10RobH)
[22:30:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[22:30:51] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#9963359 (10RobH)
[22:31:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:32:30] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566 (10RobH) 03NEW
[22:32:46] <wikibugs>	 10ops-codfw, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#9963385 (10RobH)
[22:37:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T367781)', diff saved to https://phabricator.wikimedia.org/P66006 and previous config saved to /var/cache/conftool/dbconfig/20240708-223741-arnaudb.json
[22:37:43] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2180.codfw.wmnet with reason: Maintenance
[22:37:45] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[22:37:46] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2180.codfw.wmnet with reason: Maintenance
[22:37:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T367781)', diff saved to https://phabricator.wikimedia.org/P66007 and previous config saved to /var/cache/conftool/dbconfig/20240708-223752-arnaudb.json
[22:38:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[22:40:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T367781)', diff saved to https://phabricator.wikimedia.org/P66008 and previous config saved to /var/cache/conftool/dbconfig/20240708-224006-arnaudb.json
[22:42:53] <logmsgbot>	 !log fabfur@cumin1002 cookbooks.sre.cdn.roll-reboot finished rebooting cp3081.esams.wmnet
[22:42:53] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_esams
[22:43:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[22:45:43] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[22:46:03] <wikibugs>	 (03CR) 10Herron: [C:03+1] pontoon: support puppet 7 / puppetserver and openstack API [puppet] - 10https://gerrit.wikimedia.org/r/1052730 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi)
[22:46:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[22:52:49] <icinga-wm>	 PROBLEM - Host cp3073 is DOWN: PING CRITICAL - Packet loss = 100%
[22:55:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P66009 and previous config saved to /var/cache/conftool/dbconfig/20240708-225513-arnaudb.json
[22:55:44] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[23:03:44] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[23:08:01] <wikibugs>	 (03PS2) 10Dwisehaupt: prometheus: adjust fr payments-listener endpoint monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1052062 (https://phabricator.wikimedia.org/T368114) (owner: 10Filippo Giunchedi)
[23:08:09] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users for Sharvaniharan - https://phabricator.wikimedia.org/T368566#9963466 (10ATsay-WMF) I approve this, thanks!
[23:08:43] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[23:09:49] <wikibugs>	 (03CR) 10Dwisehaupt: "@fgiunchedi@wikimedia.org I have updated the URL to the new endpoint we can test. It should be clear to roll out when you are ready." [puppet] - 10https://gerrit.wikimedia.org/r/1052062 (https://phabricator.wikimedia.org/T368114) (owner: 10Filippo Giunchedi)
[23:10:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P66010 and previous config saved to /var/cache/conftool/dbconfig/20240708-231020-arnaudb.json
[23:24:18] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "Merging this before the sites actually existed in DNS caused 12 monitoring alerts. 8 for search-platform and 4 for collab." [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) (owner: 10Stevemunene)
[23:25:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T367781)', diff saved to https://phabricator.wikimedia.org/P66011 and previous config saved to /var/cache/conftool/dbconfig/20240708-232527-arnaudb.json
[23:25:30] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2193.codfw.wmnet with reason: Maintenance
[23:25:32] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[23:25:43] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2193.codfw.wmnet with reason: Maintenance
[23:25:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T367781)', diff saved to https://phabricator.wikimedia.org/P66012 and previous config saved to /var/cache/conftool/dbconfig/20240708-232549-arnaudb.json
[23:27:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T367856)', diff saved to https://phabricator.wikimedia.org/P66013 and previous config saved to /var/cache/conftool/dbconfig/20240708-232728-marostegui.json
[23:27:32] <stashbot>	 T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856
[23:28:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T367781)', diff saved to https://phabricator.wikimedia.org/P66014 and previous config saved to /var/cache/conftool/dbconfig/20240708-232803-arnaudb.json
[23:29:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "I'll revert for now. This will need DNS changes and ATS config changes first." [puppet] - 10https://gerrit.wikimedia.org/r/1046121 (https://phabricator.wikimedia.org/T364367) (owner: 10Stevemunene)
[23:29:53] <wikibugs>	 (03PS1) 10Dzahn: Revert "wdqs: microsites for wdqs graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1052826
[23:30:48] <wikibugs>	 (03PS2) 10Dzahn: Revert "wdqs: microsites for wdqs graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1052826 (https://phabricator.wikimedia.org/T364367)
[23:32:44] <jinxer-wm>	 FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[23:33:47] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "wdqs: microsites for wdqs graph split" [puppet] - 10https://gerrit.wikimedia.org/r/1052826 (https://phabricator.wikimedia.org/T364367) (owner: 10Dzahn)
[23:34:26] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1052826 (https://phabricator.wikimedia.org/T364367) (owner: 10Dzahn)
[23:37:12] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[23:38:23] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052827
[23:38:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052827 (owner: 10TrainBranchBot)
[23:42:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P66015 and previous config saved to /var/cache/conftool/dbconfig/20240708-234235-marostegui.json
[23:42:44] <jinxer-wm>	 RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans
[23:43:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P66016 and previous config saved to /var/cache/conftool/dbconfig/20240708-234310-arnaudb.json
[23:52:10] <logmsgbot>	 !log fabfur@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-reboot (exit_code=1) rolling reboot on A:cp-text_esams
[23:57:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P66017 and previous config saved to /var/cache/conftool/dbconfig/20240708-235742-marostegui.json
[23:58:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P66018 and previous config saved to /var/cache/conftool/dbconfig/20240708-235817-arnaudb.json
[23:59:12] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service miscweb1003:443 has failed probes (http_query_main_wikidata_org_collab_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:59:33] <mutante>	 ^ reverted a change to resolve those