[00:14:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50390 and previous config saved to /var/cache/conftool/dbconfig/20230810-001414-ladsgroup.json
[00:14:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
[00:14:20] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[00:14:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
[00:14:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T342617)', diff saved to https://phabricator.wikimedia.org/P50391 and previous config saved to /var/cache/conftool/dbconfig/20230810-001437-ladsgroup.json
[00:23:23] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:25:29] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:26:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T342617)', diff saved to https://phabricator.wikimedia.org/P50392 and previous config saved to /var/cache/conftool/dbconfig/20230810-002648-ladsgroup.json
[00:26:53] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[00:38:48] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947388
[00:38:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947388 (owner: 10TrainBranchBot)
[00:41:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P50393 and previous config saved to /var/cache/conftool/dbconfig/20230810-004154-ladsgroup.json
[00:43:11] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:43:45] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:44:01] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:44:01] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:54:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947388 (owner: 10TrainBranchBot)
[00:57:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P50394 and previous config saved to /var/cache/conftool/dbconfig/20230810-005701-ladsgroup.json
[01:02:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T342617)', diff saved to https://phabricator.wikimedia.org/P50395 and previous config saved to /var/cache/conftool/dbconfig/20230810-010212-ladsgroup.json
[01:02:19] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[01:12:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T342617)', diff saved to https://phabricator.wikimedia.org/P50396 and previous config saved to /var/cache/conftool/dbconfig/20230810-011207-ladsgroup.json
[01:12:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[01:12:12] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[01:12:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[01:12:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1214 (T342617)', diff saved to https://phabricator.wikimedia.org/P50397 and previous config saved to /var/cache/conftool/dbconfig/20230810-011228-ladsgroup.json
[01:17:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P50398 and previous config saved to /var/cache/conftool/dbconfig/20230810-011718-ladsgroup.json
[01:32:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P50399 and previous config saved to /var/cache/conftool/dbconfig/20230810-013225-ladsgroup.json
[01:47:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T342617)', diff saved to https://phabricator.wikimedia.org/P50400 and previous config saved to /var/cache/conftool/dbconfig/20230810-014731-ladsgroup.json
[01:47:35] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[02:00:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T342617)', diff saved to https://phabricator.wikimedia.org/P50401 and previous config saved to /var/cache/conftool/dbconfig/20230810-020012-ladsgroup.json
[02:00:22] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[02:06:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:15:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P50402 and previous config saved to /var/cache/conftool/dbconfig/20230810-021518-ladsgroup.json
[02:18:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:23:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:24:25] <wikibugs>	 (03PS1) 10Mdaniels5757: add (I think even properly!) autopatrolled group with autopatrol right for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495
[02:26:10] <wikibugs>	 (03PS2) 10Mdaniels5757: add (I think even properly!) autopatrolled group with autopatrol right for wikifunctionswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946)
[02:30:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P50403 and previous config saved to /var/cache/conftool/dbconfig/20230810-023025-ladsgroup.json
[02:31:33] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:45:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T342617)', diff saved to https://phabricator.wikimedia.org/P50404 and previous config saved to /var/cache/conftool/dbconfig/20230810-024531-ladsgroup.json
[02:45:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[02:45:36] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[02:45:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[03:27:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[03:27:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[04:01:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T342617)', diff saved to https://phabricator.wikimedia.org/P50405 and previous config saved to /var/cache/conftool/dbconfig/20230810-040104-ladsgroup.json
[04:01:18] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[04:16:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P50406 and previous config saved to /var/cache/conftool/dbconfig/20230810-041610-ladsgroup.json
[04:31:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P50407 and previous config saved to /var/cache/conftool/dbconfig/20230810-043116-ladsgroup.json
[04:46:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T342617)', diff saved to https://phabricator.wikimedia.org/P50408 and previous config saved to /var/cache/conftool/dbconfig/20230810-044622-ladsgroup.json
[04:46:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[04:46:27] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[04:46:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[04:46:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T342617)', diff saved to https://phabricator.wikimedia.org/P50409 and previous config saved to /var/cache/conftool/dbconfig/20230810-044643-ladsgroup.json
[05:04:29] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:06:05] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:13:49] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:13:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh)
[05:13:55] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:13:55] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:21:18] <wikibugs>	 (03CR) 10Muehlenhoff: profile::mirrors::serve: Remove Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff)
[05:21:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::mirrors::serve: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff)
[05:22:45] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:22:51] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:24:21] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:25:53] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1015.eqiad.wmnet
[05:27:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast5004.wikimedia.org
[05:27:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[05:29:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast5004.wikimedia.org - jmm@cumin2002"
[05:30:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast5004.wikimedia.org - jmm@cumin2002"
[05:30:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[05:30:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast5004.wikimedia.org on all recursors
[05:30:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast5004.wikimedia.org on all recursors
[05:30:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast5004.wikimedia.org - jmm@cumin2002"
[05:31:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast5004.wikimedia.org - jmm@cumin2002"
[05:32:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast5004.wikimedia.org with OS bookworm
[05:32:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast5004.wikimedia.org with OS bookworm
[05:35:59] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff)
[05:50:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50410 and previous config saved to /var/cache/conftool/dbconfig/20230810-055005-ladsgroup.json
[05:50:09] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[05:51:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] zookeeper: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945779 (owner: 10Muehlenhoff)
[05:59:03] <moritzm>	 !log installing tiff security updates
[05:59:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T0600)
[06:00:04] <jouncebot>	 kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T0600). Please do the needful.
[06:01:01] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:05:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P50411 and previous config saved to /var/cache/conftool/dbconfig/20230810-060511-ladsgroup.json
[06:05:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-codfw
[06:08:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-codfw
[06:09:13] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:15:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:17:11] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:17:23] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:18:41] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:20:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P50412 and previous config saved to /var/cache/conftool/dbconfig/20230810-062017-ladsgroup.json
[06:20:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad
[06:23:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad
[06:24:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:24:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:26:19] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:26:31] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:32:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:34:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:35:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50413 and previous config saved to /var/cache/conftool/dbconfig/20230810-063523-ladsgroup.json
[06:35:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[06:35:28] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[06:35:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance
[06:35:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[06:36:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[06:36:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T342617)', diff saved to https://phabricator.wikimedia.org/P50414 and previous config saved to /var/cache/conftool/dbconfig/20230810-063611-ladsgroup.json
[06:46:33] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:47:39] <jinxer-wm>	 (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[06:47:53] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:48:26] <wikibugs>	 (03PS3) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648)
[06:56:01] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff)
[06:58:28] <wikibugs>	 (03PS3) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648)
[06:58:30] <wikibugs>	 (03PS4) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648)
[06:58:32] <wikibugs>	 (03PS1) 10Stevemunene: airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714
[07:00:04] <jouncebot>	 Amir1, apergos, and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T0700).
[07:00:17] <apergos>	 morning! no trainees, no patches, no news. It's August!  have a nice day everybody and we'll see you all next time. 
[07:02:12] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42814/console" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff)
[07:04:08] <wikibugs>	 (03PS2) 10Stevemunene: airflow-wmde: create analytics-wmde user for airflow [puppet] - 10https://gerrit.wikimedia.org/r/947714
[07:05:32] <wikibugs>	 (03PS4) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648)
[07:05:43] <wikibugs>	 (03PS5) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648)
[07:06:55] <wikibugs>	 (03PS1) 10Ayounsi: Enable sftp-server [homer/public] - 10https://gerrit.wikimedia.org/r/947715 (https://phabricator.wikimedia.org/T316544)
[07:07:39] <jinxer-wm>	 (Traffic bill over quota) resolved: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[07:11:55] <wikibugs>	 (03PS5) 10Ayounsi: [WIP] Initial SONiC config from Homer YAML [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638)
[07:19:25] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast5004.wikimedia.org with OS bookworm
[07:19:25] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host bast5004.wikimedia.org
[07:19:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast5004.wikimedia.org with OS bookworm executed with errors: - bast5004 (**FAIL**)   - Removed from Puppet...
[07:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:25:36] <wikibugs>	 (03PS5) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648)
[07:28:29] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Update blubberoid to use certmanager certs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli)
[07:36:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) Quick status update regarding Homer. With those 3 patches: * Initial OpenConfig/SONiC support to wmf-netbox - https://gerrit.wikimedia.org/...
[07:44:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff)
[07:48:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast5004.wikimedia.org
[07:48:35] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[07:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:52:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[07:56:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: update my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/947379 (owner: 10Giuseppe Lavagetto)
[07:59:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:00:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast5004.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:00:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:00:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast5004.wikimedia.org
[08:00:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast5004.wikimedia.org` - bast5004.wikimedia.org (**WARN**)   - //Host not found on Icinga, unable to downt...
[08:11:01] <_joe_>	 jouncebot: nowandnext
[08:11:01] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 48 minute(s)
[08:11:01] <jouncebot>	 In 1 hour(s) and 48 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000)
[08:11:01] <jouncebot>	 In 1 hour(s) and 48 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000)
[08:11:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[08:13:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10MoritzMuehlenhoff) a:05fgiunchedi→03Eevans >>! In T342969#9080553, @adee_wmde wrote: >>>! In T342969#9080463, @MoritzMuehlenhoff wrote: >> @adee_wmde You are using the same key...
[08:16:17] <wikibugs>	 (03CR) 10Muehlenhoff: C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[08:18:06] <wikibugs>	 (03PS2) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [deployment-charts] - 10https://gerrit.wikimedia.org/r/790357 (https://phabricator.wikimedia.org/T117845)
[08:19:47] <wikibugs>	 (03CR) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/790357 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix)
[08:21:15] <godog>	 !log put back business hours americas for sre business hours escalation
[08:21:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:37] <godog>	 !log put back business hours americas for sre business hours escalation - T343812
[08:21:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:40] <stashbot>	 T343812: On-call batphone escalation configuration holidays Aug 2023 - https://phabricator.wikimedia.org/T343812
[08:21:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[08:22:50] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[08:26:42] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[08:28:37] <wikibugs>	 (03PS4) 10Samtar: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763)
[08:29:16] <TheresNoTime>	 jouncebot: nowandnext
[08:29:17] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 30 minute(s)
[08:29:17] <jouncebot>	 In 1 hour(s) and 30 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000)
[08:29:17] <jouncebot>	 In 1 hour(s) and 30 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000)
[08:31:36] <_joe_>	 TheresNoTime: hold your horses
[08:31:39] <_joe_>	 :)
[08:31:52] * TheresNoTime isn't going to deploy anything ^^
[08:36:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[08:42:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast3007.wikimedia.org
[08:44:37] <wikibugs>	 (03PS1) 10JMeybohm: Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748)
[08:45:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748) (owner: 10JMeybohm)
[08:46:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:51:31] <wikibugs>	 (03PS2) 10JMeybohm: Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748)
[08:52:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] webperf: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff)
[08:53:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: gather stats from haproxy for openstack and cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro)
[08:55:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748) (owner: 10JMeybohm)
[08:57:48] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1001.eqiad.wmnet
[08:58:20] <btullis>	 I am doing some airflow maintenance and rebooting a postgresql server. I have tried to put downtime in for everything, but there might be a bit of noise.
[08:58:53] <wikibugs>	 (03PS6) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648)
[09:00:20] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:00:56] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[09:01:58] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:03:48] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:03:50] <urbanecm>	 !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'CHUniZH' 'Musik CH' # T343867
[09:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:54] <stashbot>	 T343867: Unblock stuck global rename of Musik CH - https://phabricator.wikimedia.org/T343867
[09:04:03] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1001.eqiad.wmnet
[09:04:30] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-client1002.eqiad.wmnet
[09:04:47] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748) (owner: 10JMeybohm)
[09:05:03] <taavi>	 urbanecm: looks like we have quite a few stuck renames atm. are you fixing those too or should I?
[09:05:14] <urbanecm>	 taavi: yep, working on it.
[09:05:20] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:05:25] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1002.eqiad.wmnet
[09:05:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Revert "logspam.pl: Filter out some persistent noise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński)
[09:05:36] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mediawiki: set requests based on php.workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947792 (https://phabricator.wikimedia.org/T342748) (owner: 10JMeybohm)
[09:06:06] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-airflow1004.eqiad.wmnet
[09:06:10] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1002.eqiad.wmnet
[09:06:19] <wikibugs>	 (03CR) 10Elukey: C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[09:06:26] <urbanecm>	 taavi: since you're here: i think it's a good idea to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/947362/ soon. even though the list of IPs is not yet finalized, i think it's better to have the rule in place soon, and amend it as new info flows, rather than rushing the deployment seconds before Wikimania. what do you think?
[09:06:52] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-airflow1004.eqiad.wmnet
[09:07:32] <urbanecm>	 !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Garciajaysonpinolkwani98' 'Ne_Shokot_Pinolkwane'
[09:07:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:37] <taavi>	 yep, planning to do that today, after the current MW infra window or so
[09:07:40] <urbanecm>	 !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=amwiki --logwiki=metawiki 'Jean-Mahmood' 'User92259453'
[09:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:46] <urbanecm>	 taavi: okay, awesome, thanks :)
[09:08:09] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet
[09:08:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10cmooney) Amazing work!  Looks great.  >>! In T320638#9082582, @ayounsi wrote: > * The ordering can be problematic (`# TODO needs to happen after the...
[09:08:53] <urbanecm>	 !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'Mittzy' 'Mittzy (usurped)'
[09:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:00] <urbanecm>	 !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=arwiki --logwiki=metawiki 'Qwertyoruiop' '3h6 1'
[09:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:19] <taavi>	 hmm, also I just noticed all of the stuck renames were done via Special:GlobalRenameUser and not via the queue. that makes me worried I broke something in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/934384, but I don't see anything
[09:09:38] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-client1002.eqiad.wmnet
[09:09:54] <urbanecm>	 taavi: hmm...let me test that
[09:10:24] <icinga-wm>	 RECOVERY - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[09:12:12] <urbanecm>	 started https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Martin_Urbanec_(test_10-renamed), it started immediately
[09:12:30] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1003 is CRITICAL: CRITICAL - degraded: The following units failed: burrow-jumbo-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:12:31] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet
[09:12:44] <urbanecm>	 so did a rename back
[09:15:14] <taavi>	 urbanecm: what happens if you input the username in a non-canonical format? so replace a space with an underscore, or a lowercase first letter, or similar
[09:15:24] <urbanecm>	 that was the rename back
[09:15:29] <urbanecm>	 but i can try other non-cannonical formats
[09:19:28] <urbanecm>	 taavi: i managed to break it, but in a different way. 
[09:19:36] <urbanecm>	 https://meta.wikimedia.org/wiki/Special:CentralAuth/Martin_Urbanec_(test_10), https://meta.wikimedia.org/wiki/Special:CentralAuth/Martin_Urbanec_(test_10_renamed-02)
[09:20:04] <urbanecm>	 and https://meta.wikimedia.org/wiki/Special:CentralAuth/Martin_Urbanec_(test_10-renamed)
[09:20:30] <taavi>	 oops
[09:20:43] <urbanecm>	 the problem is i don't know how i broke it... trying more.
[09:22:22] <wikibugs>	 (03PS2) 10Ssingh: P:bird::anycast: use systemd::sysuser for creating the bird user [puppet] - 10https://gerrit.wikimedia.org/r/947425
[09:22:43] <wikibugs>	 (03CR) 10Ssingh: P:bird::anycast: use systemd::sysuser for creating the bird user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh)
[09:23:19] <taavi>	 in the meantime, I can fairly reliably reproduce the "jobs get lost" issue locally if the target username is in a non-canonical format. I'll update the task and see if I can come up with a fix
[09:23:32] <taavi>	 and reverting my patch does indeed fix the issue
[09:23:34] <wikibugs>	 (03PS4) 10Effie Mouzeli: Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033)
[09:23:46] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:24:27] <urbanecm>	 taavi: yeah, and renaming one account twice seems to cause the other bug.
[09:25:29] <urbanecm>	 filling task...
[09:27:42] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): cloudcumin: decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) p:05Triage→03Medium
[09:29:00] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10Jelto)
[09:29:01] <urbanecm>	 filled T343956
[09:29:02] <stashbot>	 T343956: Renaming global account to non-canonical form causes rename jobs to be post - https://phabricator.wikimedia.org/T343956
[09:29:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for fkaelin - https://phabricator.wikimedia.org/T343957 (10Jelto)
[09:32:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh)
[09:33:14] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[09:33:35] <urbanecm>	 and T343958
[09:33:35] <stashbot>	 T343958: Renaming one account multiple times creates duplicate global accounts - https://phabricator.wikimedia.org/T343958
[09:33:46] <urbanecm>	 taavi: does reverting the patch fix both issues? 
[09:34:17] <urbanecm>	 (i'd test, but i don't have CA set up (yet?) on my work laptop, and i don't have my personal laptop nearby atm)
[09:34:25] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Update blubberoid to use certmanager certs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli)
[09:34:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10RickiJay-WMDE) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDZMvLWML3HYfq2Tc1TvfUFInGtmN8DS01pcdYDetuiCklmTUFuRwYfeIhevlpwFKxauefEDs04YH/i0aupTfrGfORRtS/qLhn8lSQY3z73c/XlMOYwozfHeojc...
[09:36:23] <taavi>	 urbanecm: it does at least for the first one, but I think I have a one-line patch for the first one
[09:36:29] <taavi>	 will look at the second one after I'm done testing this
[09:36:40] <urbanecm>	 okay, ty. 
[09:37:36] * urbanecm leaves the accounts broken for now; i'll fix them once we fix the problem.
[09:38:33] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10jbond) >>! In T341973#9049479, @bking wrote: > Swift >  - CON: [[ https://platform.swiftstack.com/docs/introduction/openstack_swift.html#mass...
[09:41:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/944870 (owner: 10Muehlenhoff)
[09:42:00] <taavi>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/947794/
[09:44:19] <wikibugs>	 (03CR) 10Stevemunene: airflow-wmde: configure wmde airflow instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene)
[09:46:05] <urbanecm>	 +2'ed.
[09:49:21] <taavi>	 unable to reproduce the second bug locally, could you clarify which usernames you're trying to rename at each step?
[09:50:04] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[09:52:31] <urbanecm>	 taavi: clarified the steps, according to my notes of what i did. 
[09:52:52] <urbanecm>	 (it might be something specific to WMF infra that's not present locally, theoretically)
[09:53:02] <wikibugs>	 (03CR) 10Jbond: Modify install and apt server config to support Juniper ZTP via HTTP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney)
[09:53:27] <wikibugs>	 (03PS1) 10Urbanecm: GlobalRename: Ensure status database rows use the normalized name [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947454 (https://phabricator.wikimedia.org/T343956)
[09:54:02] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] P:bird::anycast: use systemd::sysuser for creating the bird user [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh)
[09:54:05] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] prometheus: gather stats from haproxy for openstack and cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro)
[09:54:07] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] bird: drop support for buster [puppet] - 10https://gerrit.wikimedia.org/r/947412 (owner: 10Ssingh)
[09:54:26] <sukhe>	 dcaro: ok to merge yours?
[09:54:32] <dcaro>	 sukhe: yes please :)
[09:54:33] <sukhe>	 David Caro: prometheus: gather stats from haproxy for openstack and cloudlb (b6592cf212)
[09:54:36] <sukhe>	 thanks
[09:55:55] <taavi>	 urbanecm: thanks, reproduced locally
[09:56:03] <urbanecm>	 👍
[09:56:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10RickiJay-WMDE) a:05RickiJay-WMDE→03None
[09:57:25] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10LSobanski)
[09:59:01] <wikibugs>	 (03CR) 10Btullis: "I think that the way I would tackle this is to try to avoid duplication." [puppet] - 10https://gerrit.wikimedia.org/r/947714 (owner: 10Stevemunene)
[10:00:04] <jouncebot>	 mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000).
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1000)
[10:07:16] <wikibugs>	 (03PS8) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798
[10:07:24] <wikibugs>	 (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757)
[10:07:43] <wikibugs>	 (03CR) 10D3r1ck01: wmf-config: Remove wgContentTranslationDefaultParsoidClient cleanup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/930798 (owner: 10D3r1ck01)
[10:09:37] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli)
[10:10:22] <wikibugs>	 (03Merged) 10jenkins-bot: Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli)
[10:10:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm
[10:10:35] <sukhe>	 BGP/BFD alerts expected in drmrs
[10:12:05] <wikibugs>	 (03PS1) 10EoghanGaffney: gitlab: Add missing options for objectstore and extract swift key [puppet] - 10https://gerrit.wikimedia.org/r/947798
[10:13:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet
[10:13:37] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42815/console" [puppet] - 10https://gerrit.wikimedia.org/r/947798 (owner: 10EoghanGaffney)
[10:14:50] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:15:00] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:15:03] <sukhe>	 expected
[10:16:29] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1090.eqiad.wmnet with OS bullseye
[10:17:29] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet
[10:17:43] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1010.eqiad.wmnet
[10:21:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:23:37] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1010.eqiad.wmnet
[10:26:33] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:29:08] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1090.eqiad.wmnet with reason: host reimage
[10:30:38] <taavi>	 urbanecm: found the other issue too! https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/947799
[10:32:10] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1090.eqiad.wmnet with reason: host reimage
[10:32:43] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply
[10:32:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
[10:33:46] <logmsgbot>	 !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply
[10:34:57] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply
[10:36:07] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
[10:36:16] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply
[10:42:13] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:43:13] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.571 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:44:19] <urbanecm>	 taavi: thanks for fixing both issues! Commented on the patch; the explanation in the commit message should probably be on the task as well, to make it easier to link in code comments/etc (this seems likely to happen again when someone decides to refactor things). 
[10:44:40] <urbanecm>	 Will test once I get to my personal laptop, unless someone beats me :)
[10:45:46] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply
[10:46:20] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply
[10:46:32] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "idea looks good but minor bug" [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:47:54] <taavi>	 urbanecm: thanks, fixed and will do
[10:48:08] <wikibugs>	 (03PS1) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033)
[10:55:53] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1090.eqiad.wmnet with OS bullseye
[10:58:28] <wikibugs>	 (03PS2) 10Muehlenhoff: Add base nftables sets which are equivalent to the main Ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497)
[10:58:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add base nftables sets which are equivalent to the main Ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:00:13] <wikibugs>	 (03PS1) 10Effie Mouzeli: Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947802 (https://phabricator.wikimedia.org/T300033)
[11:00:15] <wikibugs>	 (03PS3) 10Muehlenhoff: Add base nftables sets which are equivalent to the main Ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497)
[11:04:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "minor optional follow up comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney)
[11:04:49] <wikibugs>	 (03CR) 10Muehlenhoff: Add base nftables sets which are equivalent to the main Ferm macros (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:06:36] <wikibugs>	 (03PS1) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947804 (https://phabricator.wikimedia.org/T300033)
[11:09:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-1] "I am not sure this is correct, needs a little more thought" [deployment-charts] - 10https://gerrit.wikimedia.org/r/947802 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli)
[11:09:11] <wikibugs>	 (03PS2) 10Muehlenhoff: firewall: Make more Ferm-specific setup conditional to the ferm provider [puppet] - 10https://gerrit.wikimedia.org/r/945557 (https://phabricator.wikimedia.org/T336497)
[11:09:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:11:40] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947804 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli)
[11:12:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/945557 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:12:24] <wikibugs>	 (03PS1) 10Effie Mouzeli: Update cxserver to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947805 (https://phabricator.wikimedia.org/T300033)
[11:13:36] <wikibugs>	 (03PS2) 10Effie Mouzeli: Update citoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947801 (https://phabricator.wikimedia.org/T300033)
[11:13:57] <wikibugs>	 (03PS1) 10Ssingh: wikimedia.org: update glue record for ns2 [dns] - 10https://gerrit.wikimedia.org/r/947807 (https://phabricator.wikimedia.org/T343942)
[11:14:00] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1091.eqiad.wmnet with OS bullseye
[11:14:11] <wikibugs>	 (03CR) 10Muehlenhoff: firewall: Make more Ferm-specific setup conditional to the ferm provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945557 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:14:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:14:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikimedia.org: update glue record for ns2 [dns] - 10https://gerrit.wikimedia.org/r/947807 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[11:17:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:17:59] <wikibugs>	 (03PS2) 10Ssingh: wikimedia.org: update glue record for ns2 [dns] - 10https://gerrit.wikimedia.org/r/947807 (https://phabricator.wikimedia.org/T343942)
[11:18:17] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:18:55] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:19:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/945586 (owner: 10Muehlenhoff)
[11:20:47] <wikibugs>	 (03CR) 10Muehlenhoff: firewall: Ship a base profile for the nftables provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:20:52] <wikibugs>	 (03PS1) 10Ssingh: hiera: update v4 IP for ns2 [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942)
[11:21:04] <taavi>	 jouncebot: nowandnext
[11:21:04] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 38 minute(s)
[11:21:04] <jouncebot>	 In 0 hour(s) and 38 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1200)
[11:21:34] <wikibugs>	 (03PS1) 10Btullis: Temporarily disable the gobblin jobs on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/947811 (https://phabricator.wikimedia.org/T329363)
[11:21:36] <wikibugs>	 (03PS1) 10Btullis: Re-enable the gobblin timers on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/947812 (https://phabricator.wikimedia.org/T329363)
[11:21:50] <taavi>	 I'll deploy some config patches and a backport
[11:21:59] <wikibugs>	 (03CR) 10Ssingh: [C: 04-1] "Do not merge." [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[11:22:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947361 (owner: 10Majavah)
[11:22:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) (owner: 10Majavah)
[11:22:54] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] GlobalRename: Ensure status database rows use the normalized name [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947454 (https://phabricator.wikimedia.org/T343956) (owner: 10Urbanecm)
[11:22:57] <wikibugs>	 (03Merged) 10jenkins-bot: throttle: remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947361 (owner: 10Majavah)
[11:22:59] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm
[11:22:59] <wikibugs>	 (03Merged) 10jenkins-bot: throttle: add rules for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) (owner: 10Majavah)
[11:23:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff)
[11:23:20] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:947361|throttle: remove expired rules]], [[gerrit:947362|throttle: add rules for Wikimania 2023 (T343595)]]
[11:23:23] <stashbot>	 T343595: Increase account creation at Wikimania 2023 August 14-20 [Note: incomplete IP list] - https://phabricator.wikimedia.org/T343595
[11:23:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/945750 (owner: 10Muehlenhoff)
[11:24:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/945755 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[11:24:55] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:947361|throttle: remove expired rules]], [[gerrit:947362|throttle: add rules for Wikimania 2023 (T343595)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[11:26:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle)
[11:27:18] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Temporarily disable the gobblin jobs on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/947811 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[11:27:23] <logmsgbot>	 !log taavi@deploy1002 taavi: Continuing with sync
[11:27:35] <wikibugs>	 (03Merged) 10jenkins-bot: GlobalRename: Ensure status database rows use the normalized name [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947454 (https://phabricator.wikimedia.org/T343956) (owner: 10Urbanecm)
[11:28:05] <wikibugs>	 (03Abandoned) 10Ori: Randomize thumbnail TTL to prevent stampedes [puppet] - 10https://gerrit.wikimedia.org/r/818145 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori)
[11:28:38] <wikibugs>	 (03PS1) 10Jaime Nuche: releases jenkins: allow Scap to disable services on secondary hosts [puppet] - 10https://gerrit.wikimedia.org/r/947814 (https://phabricator.wikimedia.org/T343447)
[11:30:15] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:32:47] <wikibugs>	 (03Abandoned) 10Jbond: netbox: update netbox service to active/active [puppet] - 10https://gerrit.wikimedia.org/r/890385 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[11:32:49] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-coord1001.eqiad.wmnet with OS bullseye
[11:33:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:34:51] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:947361|throttle: remove expired rules]], [[gerrit:947362|throttle: add rules for Wikimania 2023 (T343595)]] (duration: 11m 30s)
[11:34:55] <stashbot>	 T343595: Increase account creation at Wikimania 2023 August 14-20 [Note: incomplete IP list] - https://phabricator.wikimedia.org/T343595
[11:35:16] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:947454|GlobalRename: Ensure status database rows use the normalized name (T343956)]]
[11:35:19] <stashbot>	 T343956: Renaming global account to non-canonical form causes rename jobs to be lost - https://phabricator.wikimedia.org/T343956
[11:35:27] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:35:29] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:36:41] <wikibugs>	 (03Abandoned) 10Jbond: netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/890384 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond)
[11:36:48] <logmsgbot>	 !log taavi@deploy1002 taavi and urbanecm: Backport for [[gerrit:947454|GlobalRename: Ensure status database rows use the normalized name (T343956)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[11:36:59] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:25] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50421 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:37:25] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:37:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: Add platform [cookbooks] - 10https://gerrit.wikimedia.org/r/931926 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond)
[11:38:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (18) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:39:08] <logmsgbot>	 !log taavi@deploy1002 taavi and urbanecm: Continuing with sync
[11:39:34] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add manufacture to network devices - jbond@cumin1001 - T329669"
[11:39:37] <stashbot>	 T329669: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669
[11:40:51] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add manufacture to network devices - jbond@cumin1001 - T329669"
[11:41:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T342617)', diff saved to https://phabricator.wikimedia.org/P50415 and previous config saved to /var/cache/conftool/dbconfig/20230810-114108-ladsgroup.json
[11:41:11] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[11:42:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi)
[11:42:35] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1091.eqiad.wmnet with reason: host reimage
[11:44:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3007.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[11:45:33] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:947454|GlobalRename: Ensure status database rows use the normalized name (T343956)]] (duration: 10m 17s)
[11:45:36] <stashbot>	 T343956: Renaming global account to non-canonical form causes rename jobs to be lost - https://phabricator.wikimedia.org/T343956
[11:45:50] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: host reimage
[11:45:59] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1091.eqiad.wmnet with reason: host reimage
[11:48:52] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: host reimage
[11:53:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[11:53:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[11:53:38] <wikibugs>	 (03PS1) 10Jbond: tlsproxy::envoy: improve docs [puppet] - 10https://gerrit.wikimedia.org/r/947821
[11:55:11] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!  Although technically not the 'glue' record that's in the org zone not this wikimedia.org one :P" [dns] - 10https://gerrit.wikimedia.org/r/947807 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[11:56:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P50416 and previous config saved to /var/cache/conftool/dbconfig/20230810-115614-ladsgroup.json
[11:58:14] <wikibugs>	 (03PS1) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390
[11:58:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[11:58:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[11:58:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3007.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[11:58:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:58:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast3007.wikimedia.org
[11:58:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori)
[11:58:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `bast3007.wikimedia.org` - bast3007.wikimedia.org (**PASS**)   - Downtimed host on Icinga/Alertmanager   - F...
[12:00:04] <wikibugs>	 (03PS2) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1200)
[12:00:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori)
[12:00:49] <wikibugs>	 (03PS3) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390
[12:01:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori)
[12:02:54] <wikibugs>	 (03CR) 10Ori: "Not tested." [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori)
[12:04:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm but see warning inline" [puppet] - 10https://gerrit.wikimedia.org/r/946559 (owner: 10Herron)
[12:04:45] <wikibugs>	 (03PS4) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390
[12:05:22] <wikibugs>	 (03PS5) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (https://phabricator.wikimedia.org/T211661)
[12:06:27] <wikibugs>	 (03PS3) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214)
[12:08:17] <jinxer-wm>	 (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[12:08:38] <jynus>	 checking
[12:08:45] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1091.eqiad.wmnet with OS bullseye
[12:08:50] <jayme>	 !incidents
[12:08:51] <sirenbot>	 3938 (UNACKED)  NELHigh sre (tcp.timed_out)
[12:08:51] <sirenbot>	 3937 (RESOLVED)  ATSBackendErrorsHigh cache_text sre (miscweb.discovery.wmnet eqsin)
[12:09:00] <jayme>	 !ack 3938
[12:09:00] <sirenbot>	 3938 (ACKED)  NELHigh sre (tcp.timed_out)
[12:09:14] <jynus>	 I don't see a spike yet on the logs
[12:09:22] <jynus>	 checking graphs
[12:09:51] <jynus>	 sustained since 12:01
[12:10:15] <jynus>	 it is acked
[12:10:18] <jynus>	 origin?
[12:10:27] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 778 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:10:38] <jynus>	 that points to eqsin
[12:11:03] <TheresNoTime>	 had a previous spike at 08:56 too
[12:11:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P50417 and previous config saved to /var/cache/conftool/dbconfig/20230810-121120-ladsgroup.json
[12:12:07] <jayme>	 yeah, nel points to text-lb.eqsin.wikimedia.org. as well for the tcp.timed_out
[12:12:09] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10jbond) >>! In T320390#9068521, @Jelto wrote: > @jbond @SLyngshede-WMF do you have a idea how to change the name GitLab uses with O...
[12:12:57] <jynus>	 checking superset
[12:13:17] <jinxer-wm>	 (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[12:13:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[12:13:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[12:13:49] <jynus>	 TheresNoTime: thanks, will have a look at it too
[12:14:14] <TheresNoTime>	 jynus: (when you're not busy) which superset dash do you look at, just out of curiosity 
[12:14:38] <jynus>	 yeah, later when we are out of the incident (even if it resolved)
[12:14:39] <wikibugs>	 (03PS1) 10Btullis: Use a routable email address for sending kerberos details [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155)
[12:14:55] <TheresNoTime>	 (ack)
[12:15:43] <jynus>	 I think I have it, but switching to private chanels
[12:15:45] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 2 probes of 778 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:17:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) Range is being accepted by Arelion according to their looking glass: ` Router: adm-b6 / Amsterdam (Iron Mountain, Haarlem) Command: show bg...
[12:17:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use a routable email address for sending kerberos details [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) (owner: 10Btullis)
[12:22:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1092.eqiad.wmnet with OS bullseye
[12:25:12] <wikibugs>	 (03CR) 10Ssingh: [C: 04-1] "A bit unsure about this: the anycast IP already exists on lo so I am not sure if duplicating that is a good idea. Let's think a bit more." [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[12:26:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T342617)', diff saved to https://phabricator.wikimedia.org/P50418 and previous config saved to /var/cache/conftool/dbconfig/20230810-122626-ladsgroup.json
[12:26:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[12:26:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) HE also accepting and path I'm taking from home connection: `  core1.ams7.he.net> show ipv6 bgp routes detail 2a02:ec80:300::/48    Number...
[12:26:30] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[12:26:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[12:29:07] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:30:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) Reachable from VPS in the UK although not sure exactly how it's coming in to us: ` root@uk:~# mtr -z -b -w -c 10 2a02:ec80:300:ffff::187 St...
[12:31:27] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:34:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) Also accepted by Liberty Global.  They also see a transit route via Tele2 (AS1257) so getting picked up there, as well as from Deutsche Tel...
[12:34:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: New IP and Vlan allocations for esams knams move - https://phabricator.wikimedia.org/T343214 (10cmooney)
[12:35:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) 05Open→03Resolved
[12:38:02] <wikibugs>	 (03PS1) 10Ladsgroup: Enable url shortener in sidebar in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947823 (https://phabricator.wikimedia.org/T267921)
[12:38:45] <wikibugs>	 (03PS1) 10Btullis: Don't install python-is-python3 to presto servers [puppet] - 10https://gerrit.wikimedia.org/r/947824 (https://phabricator.wikimedia.org/T336281)
[12:39:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Don't install python-is-python3 to presto servers [puppet] - 10https://gerrit.wikimedia.org/r/947824 (https://phabricator.wikimedia.org/T336281) (owner: 10Btullis)
[12:39:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947319 (owner: 10Muehlenhoff)
[12:40:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947316 (owner: 10Muehlenhoff)
[12:41:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[12:42:09] <wikibugs>	 (03PS4) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214)
[12:45:50] <wikibugs>	 (03PS1) 10Btullis: Remove settings relating to oozie on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363)
[12:46:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[12:46:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[12:47:19] <wikibugs>	 (03CR) 10Ayounsi: BGPalerter: mute software-update notifications (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi)
[12:47:23] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] BGPalerter: mute software-update notifications [puppet] - 10https://gerrit.wikimedia.org/r/946919 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi)
[12:49:31] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[12:54:08] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1092.eqiad.wmnet with reason: host reimage
[12:57:17] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1092.eqiad.wmnet with reason: host reimage
[12:57:35] <wikibugs>	 (03PS2) 10Btullis: Remove settings relating to oozie on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363)
[12:57:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10User-jbond: Investigate the potential benefits of BGPalerter - https://phabricator.wikimedia.org/T230600 (10ayounsi) 05Open→03Resolved a:03jbond All done! Assigned to jbond as he did most of the work!
[12:58:55] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42818/console" [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1300).
[13:00:04] <jouncebot>	 Dreamy_Jazz and TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "post-review typo I just noticed :|" [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro)
[13:00:12] <Dreamy_Jazz>	 \o
[13:00:14] * TheresNoTime can deploy
[13:01:15] <TheresNoTime>	 Dreamy_Jazz: to confirm, you just need those scripts run?
[13:01:26] <Dreamy_Jazz>	 Yes
[13:01:30] <TheresNoTime>	 ack
[13:05:05] <TheresNoTime>	 (wait one)
[13:06:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: python3: update to python3-bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947828
[13:06:20] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: fix double logging issue [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947829 (https://phabricator.wikimedia.org/T340935)
[13:06:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] python3: update to python3-bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/947828 (owner: 10Giuseppe Lavagetto)
[13:06:41] <TheresNoTime>	 `foreachwiki sql.php extensions/CheckUser/schema/mysql/cu_useragent_clienthints.sql` returns `Unable to open input file`, looking..
[13:06:46] <Dreamy_Jazz>	 If the scripts don't work, my intention was to add the tables to all wikis except testwiki.
[13:07:00] <wikibugs>	 (03PS1) 10Cathal Mooney: Reverse DNS includes for new /24 ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214)
[13:07:10] <Dreamy_Jazz>	 As testwiki already has the table
[13:07:53] <TheresNoTime>	 ack
[13:07:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Reverse DNS includes for new /24 ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[13:08:14] <TheresNoTime>	 looks like I need the full path, okay
[13:09:29] <TheresNoTime>	 !log `[samtar@mwmaint1002 ~]$ foreachwiki sql.php /srv/mediawiki-staging/php-1.41.0-wmf.20/extensions/CheckUser/schema/mysql/cu_useragent_clienthints.sql` for T258105
[13:09:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:33] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947798 (owner: 10EoghanGaffney)
[13:09:33] <stashbot>	 T258105: Implement storage for User-Agent Client Hints header data - https://phabricator.wikimedia.org/T258105
[13:10:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] firewall: Ship a base profile for the nftables provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:11:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] tlsproxy::envoy: improve docs [puppet] - 10https://gerrit.wikimedia.org/r/947821 (owner: 10Jbond)
[13:14:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm excluding the ci issue" [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) (owner: 10Btullis)
[13:14:51] <TheresNoTime>	 !log `[samtar@mwmaint1002 ~]$ foreachwiki sql.php /srv/mediawiki-staging/php-1.41.0-wmf.20/extensions/CheckUser/schema/mysql/cu_useragent_clienthints_map.sql` for T258105
[13:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:55] <stashbot>	 T258105: Implement storage for User-Agent Client Hints header data - https://phabricator.wikimedia.org/T258105
[13:15:22] <TheresNoTime>	 Dreamy_Jazz: first script done, second running — I see the new table
[13:15:26] <Dreamy_Jazz>	 Thanks.
[13:15:32] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: add sonar-scanner-cli image to allowed_images [puppet] - 10https://gerrit.wikimedia.org/r/947832 (https://phabricator.wikimedia.org/T343975)
[13:15:41] <_joe_>	 James_F: sorry I just realized we never deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/945534
[13:15:45] <TheresNoTime>	 `wikifunctionswiki` also already had the table, guessing that's expected
[13:16:03] <Dreamy_Jazz>	 Not sure, but I only requested it on testwiki
[13:16:18] <_joe_>	 jouncebot: now
[13:16:18] <jouncebot>	 For the next 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1300)
[13:16:45] <TheresNoTime>	 _joe_: want me to let you know when I'm done?
[13:16:46] <_joe_>	 TheresNoTime: can you ping me when you're done?
[13:16:48] <_joe_>	 yes :)
[13:16:49] <TheresNoTime>	 hah, yes
[13:16:50] <_joe_>	 lol
[13:17:08] <_joe_>	 I want to sync that patch for wikifunctions
[13:20:08] <TheresNoTime>	 (noting that I'm aware there's a lot of `WARNING`s being generated in logstash while these scripts run)
[13:20:14] <TheresNoTime>	 Dreamy_Jazz: both scripts run, I can see both of the new tables 
[13:20:21] <Dreamy_Jazz>	 Thanks.
[13:20:51] <wikibugs>	 (03PS5) 10Samtar: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763)
[13:21:11] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add wikifunctions object cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945534 (https://phabricator.wikimedia.org/T297815)
[13:21:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar)
[13:22:18] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1092.eqiad.wmnet with OS bullseye
[13:22:29] <wikibugs>	 (03Merged) 10jenkins-bot: IS: Enable Phonos on medium projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936717 (https://phabricator.wikimedia.org/T336763) (owner: 10Samtar)
[13:22:45] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:936717|IS: Enable Phonos on medium projects (T336763)]]
[13:22:48] <stashbot>	 T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763
[13:24:04] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db2188 [puppet] - 10https://gerrit.wikimedia.org/r/947836
[13:24:18] <logmsgbot>	 !log samtar@deploy1002 samtar: Backport for [[gerrit:936717|IS: Enable Phonos on medium projects (T336763)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:24:28] * TheresNoTime testing
[13:24:42] <James_F>	 _joe_: Thanks for the reminder!
[13:24:57] <Lucas_WMDE>	 if there’s enough time at the end of the window (after _joe_), I’d like to do T343980 if that’s okay
[13:24:57] <stashbot>	 T343980: [IPM] Enable temporary accounts (IP Masking) on Beta Wikidata - https://phabricator.wikimedia.org/T343980
[13:25:09] <Lucas_WMDE>	 unless urbanecm or anyone else objects to enabling temp accounts on another wiki
[13:25:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2188 [puppet] - 10https://gerrit.wikimedia.org/r/947836 (owner: 10Marostegui)
[13:26:05] <_joe_>	 James_F: I'm going to merge it now if there's space in this window
[13:26:59] <logmsgbot>	 !log samtar@deploy1002 samtar: Continuing with sync
[13:27:00] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10brennen) > But users may not want to have their full name (cn?) in GitLab displayed (for example Rando McRandomface Jr). So the ui...
[13:27:06] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] prometheus: gather stats from haproxy for openstack and cloudlb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro)
[13:27:33] <TheresNoTime>	 I'll be done after 936717 finishes syncing
[13:29:38] <wikibugs>	 (03PS3) 10Btullis: Remove settings relating to oozie on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363)
[13:30:46] <wikibugs>	 (03PS2) 10Ssingh: hiera: temporarily remove v4 IP for ns2 from authdns_addrs [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942)
[13:30:58] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42819/console" [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[13:31:50] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42820/console" [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[13:32:13] <TheresNoTime>	 Lucas_WMDE: (bikeshedding, but..) is there a reason we're not considering enabling temp accounts to more beta cluster projects?
[13:32:40] <Lucas_WMDE>	 nothing in particular that I know of 🤷
[13:32:49] <TheresNoTime>	 (i.e. enabling it on beta wikidata, why not enable it just on the majority of beta projects)
[13:33:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:33:43] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:936717|IS: Enable Phonos on medium projects (T336763)]] (duration: 10m 58s)
[13:33:49] <stashbot>	 T336763: Enable PhonosInlineAudioPlayerMode on all projects - https://phabricator.wikimedia.org/T336763
[13:33:56] <TheresNoTime>	 _joe_: done, all yours
[13:34:02] <_joe_>	 TheresNoTime: thanks
[13:34:30] <Lucas_WMDE>	 TheresNoTime: I just don’t think that’s my call to make ^^
[13:34:40] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Revert "cloudbackup200[12]: remove some spurious config from the last patch"" [puppet] - 10https://gerrit.wikimedia.org/r/947839
[13:35:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945534 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto)
[13:35:23] <TheresNoTime>	 Lucas_WMDE: [[WP:BOLD]] /sarcasm
[13:35:28] <Lucas_WMDE>	 lol
[13:35:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T342617)', diff saved to https://phabricator.wikimedia.org/P50419 and previous config saved to /var/cache/conftool/dbconfig/20230810-133534-ladsgroup.json
[13:35:37] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[13:35:46] <wikibugs>	 (03Merged) 10jenkins-bot: Add wikifunctions object cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945534 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto)
[13:36:00] <logmsgbot>	 !log oblivian@deploy1002 Started scap: Backport for [[gerrit:945534|Add wikifunctions object cache (T297815)]]
[13:36:04] <stashbot>	 T297815: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815
[13:36:23] <Lucas_WMDE>	 Wikipedia has Be bold. Commons has Don’t be bold. Wikitech has a secret third thing
[13:36:51] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "cumin:O:dnsbox output: https://puppet-compiler.wmflabs.org/output/947810/42821/" [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[13:37:16] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "Needs a second pair of eyes, please review the PCC as well." [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[13:37:38] <logmsgbot>	 !log oblivian@deploy1002 oblivian: Backport for [[gerrit:945534|Add wikifunctions object cache (T297815)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:38:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:38:46] <logmsgbot>	 !log oblivian@deploy1002 oblivian: Continuing with sync
[13:39:06] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "lo               UNKNOWN        127.0.0.1/8 208.80.154.238/32 208.80.153.231/32 91.198.174.239/32 10.3.0.1/32 198.35.27.27/32 ::1/128" [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[13:39:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "Revert "cloudbackup200[12]: remove some spurious config from the last patch"" [puppet] - 10https://gerrit.wikimedia.org/r/947839 (owner: 10Andrew Bogott)
[13:43:05] <wikibugs>	 (03PS1) 10Andrew Bogott: remove 'cluster:wmcs' from cloudbackup2xxx host config [puppet] - 10https://gerrit.wikimedia.org/r/947842
[13:44:04] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:44:53] <wikibugs>	 (03PS4) 10Btullis: Remove settings relating to oozie on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363)
[13:45:10] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:945534|Add wikifunctions object cache (T297815)]] (duration: 09m 09s)
[13:45:40] <wikibugs>	 (03PS1) 10Ssingh: bird: create /etc/bird without relying on postint [puppet] - 10https://gerrit.wikimedia.org/r/947843
[13:45:40] <stashbot>	 T297815: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815
[13:46:15] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42822/console" [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[13:46:42] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove settings relating to oozie on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/947826 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[13:47:18] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42823/console" [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh)
[13:47:26] <Emperor>	 !log depool and stop puppet on ms-fe2009 to test updated rewrite.py T211661
[13:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:30] <stashbot>	 T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661
[13:47:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] remove 'cluster:wmcs' from cloudbackup2xxx host config [puppet] - 10https://gerrit.wikimedia.org/r/947842 (owner: 10Andrew Bogott)
[13:49:04] <jinxer-wm>	 (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:50:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P50420 and previous config saved to /var/cache/conftool/dbconfig/20230810-135040-ladsgroup.json
[13:51:53] <wikibugs>	 (03PS1) 10Andrew Bogott: Set profile::wmcs::backy2::backup_time in cloudbackup200x [puppet] - 10https://gerrit.wikimedia.org/r/947846
[13:51:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede)
[13:52:52] <Emperor>	 !log restart puppet and repool ms-fe2009 after testing T211661
[13:52:53] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Enable IP Masking on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947847 (https://phabricator.wikimedia.org/T343980)
[13:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:56] <stashbot>	 T211661: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661
[13:53:45] <Lucas_WMDE>	 _joe_: are you done deploying?
[13:54:02] <_joe_>	 Lucas_WMDE: duh sorry yes it was a single paltch
[13:54:10] <Lucas_WMDE>	 ok!
[13:54:22] <Lucas_WMDE>	 TheresNoTime: want to give https://gerrit.wikimedia.org/r/947847 a quick +1 before I deploy it? 🥺
[13:54:59] <wikibugs>	 (03CR) 10Samtar: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947847 (https://phabricator.wikimedia.org/T343980) (owner: 10Lucas Werkmeister (WMDE))
[13:55:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Set profile::wmcs::backy2::backup_time in cloudbackup200x [puppet] - 10https://gerrit.wikimedia.org/r/947846 (owner: 10Andrew Bogott)
[13:55:05] <TheresNoTime>	 ^^
[13:55:11] <Lucas_WMDE>	 thx
[13:55:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Jclark-ctr)
[13:55:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947847 (https://phabricator.wikimedia.org/T343980) (owner: 10Lucas Werkmeister (WMDE))
[13:55:58] <wikibugs>	 (03CR) 10MVernon: "I've tested this on ms-fe2009 (depooled, copied into place, restarted swift-proxy, ran rewrite_integration_test.py having made that +x loc" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (https://phabricator.wikimedia.org/T211661) (owner: 10Ori)
[13:56:25] <Lucas_WMDE>	 jouncebot: next
[13:56:25] <jouncebot>	 In 2 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1600)
[13:56:28] <Lucas_WMDE>	 ok phew
[13:56:28] <wikibugs>	 (03Merged) 10jenkins-bot: Enable IP Masking on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947847 (https://phabricator.wikimedia.org/T343980) (owner: 10Lucas Werkmeister (WMDE))
[13:56:31] <Lucas_WMDE>	 it’ll probably overrun a bit
[13:56:46] <Lucas_WMDE>	 oh wait, no it won’t
[13:56:52] <Lucas_WMDE>	 no scap sync for beta-only changes ^^
[13:57:06] * Lucas_WMDE done
[13:57:50] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:57:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr)
[13:58:14] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) @MoritzMuehlenhoff you nailed it. Got that updated for you. Can you confirm that it's working as expected now?
[13:58:42] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki)
[14:01:16] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-coord1001.eqiad.wmnet with OS bullseye
[14:02:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "overall lgtm but 1 comment" [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh)
[14:02:24] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "[I've looked at this one and it seems right to me; you already have +1s on all of these similar changes, so I don't propose to re-review t" [puppet] - 10https://gerrit.wikimedia.org/r/944959 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans)
[14:02:55] <James_F>	 I'll take the quiet moment to push out a few more WF patches.
[14:03:11] <wikibugs>	 (03PS3) 10Jforrester: Add wikifunctions-staff to wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946584 (https://phabricator.wikimedia.org/T342868)
[14:03:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946584 (https://phabricator.wikimedia.org/T342868) (owner: 10Jforrester)
[14:03:47] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] bird: create /etc/bird without relying on postint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh)
[14:03:56] <wikibugs>	 (03Merged) 10jenkins-bot: Add wikifunctions-staff to wmgPrivilegedGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946584 (https://phabricator.wikimedia.org/T342868) (owner: 10Jforrester)
[14:04:10] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:946584|Add wikifunctions-staff to wmgPrivilegedGroups (T342868)]]
[14:04:18] <stashbot>	 T342868: Add oathauth-enable to wikifunctions-staff - https://phabricator.wikimedia.org/T342868
[14:05:42] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Backport for [[gerrit:946584|Add wikifunctions-staff to wmgPrivilegedGroups (T342868)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[14:05:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P50421 and previous config saved to /var/cache/conftool/dbconfig/20230810-140546-ladsgroup.json
[14:06:07] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Continuing with sync
[14:06:33] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "on PTO, but this LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:11:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "on PTO, but this LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[14:12:45] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:946584|Add wikifunctions-staff to wmgPrivilegedGroups (T342868)]] (duration: 08m 35s)
[14:12:54] <stashbot>	 T342868: Add oathauth-enable to wikifunctions-staff - https://phabricator.wikimedia.org/T342868
[14:12:58] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/945586 (owner: 10Muehlenhoff)
[14:13:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945809 (https://phabricator.wikimedia.org/T342753) (owner: 10Jforrester)
[14:13:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:13:36] <wikibugs>	 (03PS2) 10Jforrester: Wikifunctions: Tell WikiLambda to stash results in our bespoke cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945809 (https://phabricator.wikimedia.org/T342753)
[14:13:39] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945809 (https://phabricator.wikimedia.org/T342753) (owner: 10Jforrester)
[14:14:21] <wikibugs>	 (03Merged) 10jenkins-bot: Wikifunctions: Tell WikiLambda to stash results in our bespoke cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945809 (https://phabricator.wikimedia.org/T342753) (owner: 10Jforrester)
[14:14:38] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:945809|Wikifunctions: Tell WikiLambda to stash results in our bespoke cache (T342753)]]
[14:14:41] <stashbot>	 T342753: Add MW caching for Wikifunctions functions calls into Wikimedia production - https://phabricator.wikimedia.org/T342753
[14:16:06] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Backport for [[gerrit:945809|Wikifunctions: Tell WikiLambda to stash results in our bespoke cache (T342753)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[14:16:17] <logmsgbot>	 !log jforrester@deploy1002 jforrester: Continuing with sync
[14:16:33] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:42] <wikibugs>	 (03PS8) 10Herron: thanos-fe: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987)
[14:17:41] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Allow transwiki import from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946541 (https://phabricator.wikimedia.org/T343365) (owner: 10Stang)
[14:18:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:18:58] <wikibugs>	 (03CR) 10Herron: thanos-fe: switch to cfssl (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[14:20:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T342617)', diff saved to https://phabricator.wikimedia.org/P50422 and previous config saved to /var/cache/conftool/dbconfig/20230810-142053-ladsgroup.json
[14:20:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[14:21:00] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[14:21:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance
[14:21:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50423 and previous config saved to /var/cache/conftool/dbconfig/20230810-142117-ladsgroup.json
[14:22:13] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:22:53] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:945809|Wikifunctions: Tell WikiLambda to stash results in our bespoke cache (T342753)]] (duration: 08m 15s)
[14:22:58] <stashbot>	 T342753: Add MW caching for Wikifunctions functions calls into Wikimedia production - https://phabricator.wikimedia.org/T342753
[14:23:35] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:25:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946541 (https://phabricator.wikimedia.org/T343365) (owner: 10Stang)
[14:25:44] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Allow transwiki import from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946541 (https://phabricator.wikimedia.org/T343365) (owner: 10Stang)
[14:25:58] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:946541|wikifunctions: Allow transwiki import from Wikidata (T343365)]]
[14:26:01] <stashbot>	 T343365: Allow transwiki import from Wikidata to Wikifunctions - https://phabricator.wikimedia.org/T343365
[14:26:33] <wikibugs>	 (03CR) 10Ayounsi: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[14:27:29] <logmsgbot>	 !log jforrester@deploy1002 stang and jforrester: Backport for [[gerrit:946541|wikifunctions: Allow transwiki import from Wikidata (T343365)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[14:28:17] <logmsgbot>	 !log jforrester@deploy1002 stang and jforrester: Continuing with sync
[14:30:04] <wikibugs>	 (03PS2) 10Effie Mouzeli: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[14:30:08] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] gitlab: Add missing options for objectstore and extract swift key [puppet] - 10https://gerrit.wikimedia.org/r/947798 (owner: 10EoghanGaffney)
[14:31:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH) >>! In T342159#9067390, @ssingh wrote: >>>! In T342159#9025176, @RobH wrote: >> Please note parent task 341588 has the range of cp1[090-105] however, cp1090 is already live/in us...
[14:34:27] <wikibugs>	 (03PS3) 10Effie Mouzeli: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[14:35:21] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:946541|wikifunctions: Allow transwiki import from Wikidata (T343365)]] (duration: 09m 22s)
[14:35:24] <stashbot>	 T343365: Allow transwiki import from Wikidata to Wikifunctions - https://phabricator.wikimedia.org/T343365
[14:36:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:36:36] <James_F>	 (All done.)
[14:36:57] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:37:32] <_joe_>	 James_F: I don't see any use of the caches though
[14:37:41] <wikibugs>	 (03PS4) 10Hnowlan: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488)
[14:38:14] <wikibugs>	 (03CR) 10Hnowlan: thumbor: remove thumbor server configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[14:38:22] <wikibugs>	 (03PS6) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390
[14:38:37] <wikibugs>	 (03PS5) 10Effie Mouzeli: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[14:39:20] <wikibugs>	 (03CR) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori)
[14:40:31] <wikibugs>	 (03PS1) 10Ayounsi: esams/knams: stop anycast advertisments [homer/public] - 10https://gerrit.wikimedia.org/r/947856
[14:40:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "This is a historic moment." [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[14:41:35] <jinxer-wm>	 (KubernetesAPILatency) resolved: (10) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:41:50] <wikibugs>	 (03PS1) 10Btullis: Create component/libmysql-java for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/947857 (https://phabricator.wikimedia.org/T329363)
[14:42:32] <wikibugs>	 (03CR) 10Btullis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) (owner: 10Btullis)
[14:44:56] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] Add dr0ptp4kt to wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947028 (https://phabricator.wikimedia.org/T343862) (owner: 10Dr0ptp4kt)
[14:47:27] <wikibugs>	 (03PS6) 10Effie Mouzeli: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[14:47:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[14:50:04] <wikibugs>	 (03PS7) 10Effie Mouzeli: thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[14:50:29] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[14:53:13] <wikibugs>	 (03PS2) 10Btullis: Use a routable email address for sending kerberos details [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155)
[14:56:38] <wikibugs>	 (03CR) 10Btullis: Use a routable email address for sending kerberos details (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) (owner: 10Btullis)
[14:59:11] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori)
[14:59:50] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] profile::mirrors::serve: Remove Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff)
[15:01:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Use a routable email address for sending kerberos details [puppet] - 10https://gerrit.wikimedia.org/r/947822 (https://phabricator.wikimedia.org/T318155) (owner: 10Btullis)
[15:02:53] <wikibugs>	 (03CR) 10MVernon: "Is the intention here to make similar changes to the ms swift proxies? I'd like to avoid more skew developing between thanos-swift-config " [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[15:03:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10MW-on-K8s, 10serviceops-radar: Physical re-labeling of mw1497 and mw1498 to kubernetes1025 and kubernetes1026 - https://phabricator.wikimedia.org/T343708 (10jijiki)
[15:05:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10MW-on-K8s, 10serviceops-radar: Physical re-labeling of mw1497 and mw1498 to kubernetes1025 and kubernetes1026 - https://phabricator.wikimedia.org/T343708 (10jijiki)
[15:06:21] <wikibugs>	 (03CR) 10Ori: Revert "Have the Swift rewrite proxy renew expiry headers" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori)
[15:07:40] <James_F>	 _joe_: Hmm, it seems to be working from the application level, at least.
[15:09:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add dr0ptp4kt to wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947028 (https://phabricator.wikimedia.org/T343862) (owner: 10Dr0ptp4kt)
[15:10:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:13:06] <wikibugs>	 (03PS1) 10Hnowlan: deployment_server: add new service geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400)
[15:13:18] <_joe_>	 James_F: do you have one key?
[15:13:38] <James_F>	 Not yet.
[15:13:57] <wikibugs>	 (03CR) 10Herron: thanos-fe: switch to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[15:14:09] <_joe_>	 I suspect for some reason we're sending requests to the wrong cluster
[15:14:49] <wikibugs>	 (03CR) 10Jbond: "see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh)
[15:15:02] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori)
[15:15:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:15:56] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] Revert "Have the Swift rewrite proxy renew expiry headers" [puppet] - 10https://gerrit.wikimedia.org/r/947390 (owner: 10Ori)
[15:16:13] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:29] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:41] <wikibugs>	 (03CR) 10Jbond: "As a matter of process this should get sign of from Nicholas as the group approver" [puppet] - 10https://gerrit.wikimedia.org/r/947028 (https://phabricator.wikimedia.org/T343862) (owner: 10Dr0ptp4kt)
[15:19:28] <wikibugs>	 (03PS1) 10Hnowlan: service, conftool: add base configuration for geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947864 (https://phabricator.wikimedia.org/T336400)
[15:19:41] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:20:45] <icinga-wm>	 PROBLEM - Disk space on config-master2001 is CRITICAL: DISK CRITICAL - free space: /run 145MiB (99% inode=0%): /run/credentials 145MiB (99% inode=0%): /run/systemd/incoming 145MiB (99% inode=0%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=config-master2001&var-datasource=codfw+prometheus/ops
[15:20:58] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe
[15:21:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10jbond) This change has already been merged however as a matter of process the following approvals should have been collected on this ticket  > - access request (or expans...
[15:21:34] <wikibugs>	 (03PS1) 10JMeybohm: CI: Bail out if admin_ng build fails completely [deployment-charts] - 10https://gerrit.wikimedia.org/r/947865 (https://phabricator.wikimedia.org/T343978)
[15:21:36] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978)
[15:22:19] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10jijiki)
[15:23:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[15:25:03] <icinga-wm>	 PROBLEM - Disk space on config-master1001 is CRITICAL: DISK CRITICAL - free space: /run 145MiB (99% inode=0%): /run/credentials 145MiB (99% inode=0%): /run/systemd/incoming 145MiB (99% inode=0%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=config-master1001&var-datasource=eqiad+prometheus/ops
[15:25:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] thanos-fe: switch to cfssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946559 (https://phabricator.wikimedia.org/T343987) (owner: 10Herron)
[15:27:05] <wikibugs>	 (03PS2) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978)
[15:27:07] <James_F>	 _joe_: Not sure how to tell from the shell which BagOStuff I got back.
[15:27:23] <_joe_>	 James_F: check the prefix
[15:27:26] <_joe_>	 the routing prefix
[15:27:30] <wikibugs>	 (03PS1) 10JHathaway: dev env: cadvisor exporter, in container env listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972)
[15:28:06] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:28:25] <_joe_>	 James_F: how did you get the BagOfStuff?
[15:28:49] <James_F>	 _joe_: `use MediaWiki\Extension\WikiLambda\WikiLambdaServices; WikiLambdaServices::getZObjectStash();`
[15:28:52] <wikibugs>	 (03CR) 10JHathaway: "kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:29:25] <wikibugs>	 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster thumbor1005,  thumbor1006 to kubernetes1057 and kubernetes1058 - https://phabricator.wikimedia.org/T343993 (10jijiki)
[15:29:47] <wikibugs>	 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10jijiki)
[15:30:33] <_joe_>	   ["routingPrefix":protected]=>
[15:30:33] <wikibugs>	 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10jijiki)
[15:30:34] <_joe_>	   string(9) "/local/wf"
[15:30:37] <_joe_>	 so that is correct
[15:30:44] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki)
[15:30:46] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe
[15:30:59] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:31:42] <James_F>	 _joe_: Then hopefully it's going to the right place?
[15:32:26] <James_F>	 It's definitely getting cached somewhere.
[15:32:59] <_joe_>	 James_F: I'm trying to understand that :)
[15:33:01] <James_F>	 E.g. if I go to https://www.wikifunctions.org/wiki/Z801 and enter the string Wikimedia it echos back correctly and says it did so 25 mins ago (cached result from my testing).
[15:33:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10Andrew) We discussed and supported this during the wmcs weekly meeting. Nicholas is on vacation and I'm approving as his proxy.
[15:34:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:35:05] <wikibugs>	 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10jijiki)
[15:35:26] <wikibugs>	 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10jijiki)
[15:36:15] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10Andrew) >>! In T343862#9084179, @jbond wrote: > This change has already been merged however as a matter of process the following approvals should have been collected on t...
[15:36:35] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 (10jijiki)
[15:36:38] <wikibugs>	 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10jijiki)
[15:39:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to wmcs-admin for Wiki Replicas for dr0ptp4kt - https://phabricator.wikimedia.org/T343862 (10jbond) >>! In T343862#9084235, @Andrew wrote: > We discussed and supported this during the wmcs weekly meeting. Nicholas is on vacation and I'm approving as his proxy.  a...
[15:40:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/946551 (https://phabricator.wikimedia.org/T343302) (owner: 10Filippo Giunchedi)
[15:42:15] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:42:59] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] dev env: cadvisor exporter, in container env listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:45:47] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] dev env: cadvisor exporter, in container env listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/947868 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway)
[15:48:26] <wikibugs>	 (03PS2) 10Hnowlan: deployment_server: add new service geo-analytics [puppet] - 10https://gerrit.wikimedia.org/r/947862 (https://phabricator.wikimedia.org/T336400)
[15:51:27] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:06] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove yet more unneeded config from cloudbackup200x [puppet] - 10https://gerrit.wikimedia.org/r/947876
[16:00:06] <jouncebot>	 jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1600). Please do the needful.
[16:00:06] <jouncebot>	 dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:12] <dancy>	 o/
[16:00:19] <rzl>	 dancy: hey, really sorry to miss you on Tuesday, I was out sick
[16:00:27] <dancy>	 No problem!
[16:00:32] <dancy>	 sorry to hear you were ill!
[16:00:36] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Revert "logspam.pl: Filter out some persistent noise" [puppet] - 10https://gerrit.wikimedia.org/r/945792 (https://phabricator.wikimedia.org/T323254) (owner: 10Bartosz Dziewoński)
[16:02:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove yet more unneeded config from cloudbackup200x [puppet] - 10https://gerrit.wikimedia.org/r/947876 (owner: 10Andrew Bogott)
[16:02:39] <rzl>	 dancy: manual puppet run on the mwlog hosts, right?
[16:02:55] <dancy>	 Yes please.
[16:04:52] <rzl>	 dancy: done, have a look
[16:09:10] <wikibugs>	 (03PS1) 10Btullis: Use the libmysql-java component on bullseye as well [puppet] - 10https://gerrit.wikimedia.org/r/947880 (https://phabricator.wikimedia.org/T329363)
[16:09:12] <wikibugs>	 (03PS1) 10Btullis: Use the libmariadb-java connector for sqoop [puppet] - 10https://gerrit.wikimedia.org/r/947881 (https://phabricator.wikimedia.org/T329363)
[16:09:54] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Create component/libmysql-java for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/947857 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[16:10:16] <dancy>	 rzl: Everything still works. Thanks!
[16:11:01] <wikibugs>	 (03PS3) 10JMeybohm: admin_ng: Add more configuration options for resourcequota and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/947866 (https://phabricator.wikimedia.org/T343978)
[16:11:05] <rzl>	 👍
[16:13:27] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] admin: add darthmon to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/946632 (https://phabricator.wikimedia.org/T342968) (owner: 10Eevans)
[16:14:47] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia.org: update glue record for ns2 [dns] - 10https://gerrit.wikimedia.org/r/947807 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[16:15:05] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "seems good" [puppet] - 10https://gerrit.wikimedia.org/r/947810 (https://phabricator.wikimedia.org/T343942) (owner: 10Ssingh)
[16:15:19] <sukhe>	 !log running authdns-update to update ns2 and point it to nsa.wikimedia.org
[16:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Eevans) Hi @darthmon_wmde, this should now be complete.  I'll close the issue, but don't hesitate to reopen if you have any issues!
[16:23:33] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:23:38] <wikibugs>	 (03PS1) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883
[16:24:10] <wikibugs>	 (03PS5) 10JHathaway: Enforce using a node regex without the wmnet tld [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920
[16:26:20] <wikibugs>	 (03PS2) 10Btullis: Use the libmariadb-java connector for sqoop [puppet] - 10https://gerrit.wikimedia.org/r/947881 (https://phabricator.wikimedia.org/T329363)
[16:32:26] <wikibugs>	 (03PS2) 10Btullis: Use the libmysql-java component on bullseye as well [puppet] - 10https://gerrit.wikimedia.org/r/947880 (https://phabricator.wikimedia.org/T329363)
[16:32:28] <wikibugs>	 (03PS3) 10Btullis: Use the libmariadb-java connector for sqoop [puppet] - 10https://gerrit.wikimedia.org/r/947881 (https://phabricator.wikimedia.org/T329363)
[16:32:30] <wikibugs>	 (03PS1) 10Eevans: admin: update ssh key for user adri [puppet] - 10https://gerrit.wikimedia.org/r/947884 (https://phabricator.wikimedia.org/T342969)
[16:34:08] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42827/console" [puppet] - 10https://gerrit.wikimedia.org/r/947880 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[16:35:16] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Use the libmysql-java component on bullseye as well [puppet] - 10https://gerrit.wikimedia.org/r/947880 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[16:37:08] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] admin: update ssh key for user adri [puppet] - 10https://gerrit.wikimedia.org/r/947884 (https://phabricator.wikimedia.org/T342969) (owner: 10Eevans)
[16:38:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10Eevans) 05Open→03Resolved Done!
[16:39:36] <wikibugs>	 (03PS1) 10Andrew Bogott: eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/947886 (https://phabricator.wikimedia.org/T341495)
[16:40:03] <wikibugs>	 (03CR) 10JHathaway: "@jbond I think this is ready for inclusion, if you could help me with pushing a release that would be much appreciated!" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway)
[16:40:08] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/947886 (https://phabricator.wikimedia.org/T341495) (owner: 10Andrew Bogott)
[16:40:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] eqiad cloudceph config: add cloudcontrol1005.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/947886 (https://phabricator.wikimedia.org/T341495) (owner: 10Andrew Bogott)
[16:41:31] <wikibugs>	 (03PS1) 10Ayounsi: drmrs: advertise 198.35.27.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/947887
[16:42:24] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] drmrs: advertise 198.35.27.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/947887 (owner: 10Ayounsi)
[16:42:28] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] drmrs: advertise 198.35.27.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/947887 (owner: 10Ayounsi)
[16:42:33] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] drmrs: advertise 198.35.27.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/947887 (owner: 10Ayounsi)
[16:44:47] <wikibugs>	 (03PS2) 10Eevans: admin: add roti to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/947022 (https://phabricator.wikimedia.org/T342972)
[16:45:13] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] bird: create /etc/bird without relying on postint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh)
[16:46:04] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] admin: add roti to releasers-wikibase [puppet] - 10https://gerrit.wikimedia.org/r/947022 (https://phabricator.wikimedia.org/T342972) (owner: 10Eevans)
[16:48:26] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to  releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10Eevans) 05Open→03Resolved Hi @roti_WMDE, this should now be complete.  I am closing the ticket, but don't hesitate to reopen if you have any problems.
[16:48:48] <wikibugs>	 (03PS2) 10Ssingh: bird: add dependency for bird.conf on bird2 package [puppet] - 10https://gerrit.wikimedia.org/r/947843
[16:49:10] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:50:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Eevans) 05Open→03Resolved
[16:50:14] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Re-enable the gobblin timers on hadoop_test [puppet] - 10https://gerrit.wikimedia.org/r/947812 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[16:50:30] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42828/console" [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh)
[16:51:19] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) a:05Eevans→03Tsevener
[16:53:10] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:33] <wikibugs>	 (03PS8) 10JHathaway: site.pp: Drop top level domain names: .wmnet .org [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806)
[16:57:50] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) (owner: 10JHathaway)
[16:59:45] <wikibugs>	 (03PS9) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[17:00:04] <jouncebot>	 bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1700).
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T1700)
[17:00:28] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:02:18] <wikibugs>	 (03PS10) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[17:03:22] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:04:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[17:06:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Enforce using a node regex without the wmnet tld (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway)
[17:07:32] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:08:34] <icinga-wm>	 RECOVERY - cinder-volume process on cloudcontrol1005 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:08:53] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org pages should have a "who to contact" link - https://phabricator.wikimedia.org/T344000 (10Legoktm) >  Nobody knew the answer.  I find this hard to believe given we've worked with multiple functionaries on different list issues. In any case, you found the rig...
[17:12:51] <wikibugs>	 (03CR) 10Majavah: replica_cnf_api: add envvars backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) (owner: 10David Caro)
[17:18:00] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] bird: add dependency for bird.conf on bird2 package [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh)
[17:19:23] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/947832 (https://phabricator.wikimedia.org/T343975) (owner: 10Jelto)
[17:19:38] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] bird: add dependency for bird.conf on bird2 package [puppet] - 10https://gerrit.wikimedia.org/r/947843 (owner: 10Ssingh)
[17:21:07] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm
[17:21:45] <wikibugs>	 (03PS11) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[17:25:58] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:26:02] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:26:08] <sukhe>	 expected ^ 
[17:43:00] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
[17:46:22] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
[17:48:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Awesome :-)" [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[17:56:11] <icinga-wm>	 PROBLEM - SSH on config-master2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:57:13] <icinga-wm>	 RECOVERY - SSH on config-master2001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:05:37] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:06:01] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:06:19] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:06:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[18:06:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance
[18:06:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T342617)', diff saved to https://phabricator.wikimedia.org/P50426 and previous config saved to /var/cache/conftool/dbconfig/20230810-180656-ladsgroup.json
[18:06:59] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[18:07:31] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:08:19] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 4.438 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:08:37] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:09:45] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:10:25] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:10:40] <wikibugs>	 (03PS1) 10Urbanecm: ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903
[18:10:56] <urbanecm>	 jouncebot: nowandnext
[18:10:56] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 49 minute(s)
[18:10:56] <jouncebot>	 In 1 hour(s) and 49 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T2000)
[18:11:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903 (owner: 10Urbanecm)
[18:12:01] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm
[18:12:38] <wikibugs>	 (03PS2) 10Urbanecm: ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903
[18:14:01] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:14:11] <wikibugs>	 (03PS3) 10Urbanecm: ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903
[18:14:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903 (owner: 10Urbanecm)
[18:15:11] <wikibugs>	 (03Merged) 10jenkins-bot: ltwiki: Disable Growth features [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947903 (owner: 10Urbanecm)
[18:15:37] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:947903|ltwiki: Disable Growth features]]
[18:16:33] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:17:08] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:947903|ltwiki: Disable Growth features]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[18:18:57] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Continuing with sync
[18:21:32] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs2007.codfw.wmnet with OS bullseye
[18:24:25] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:25:43] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:947903|ltwiki: Disable Growth features]] (duration: 10m 05s)
[18:26:01] * urbanecm done
[18:26:22] <wikibugs>	 (03PS1) 10Legoktm: admin: Temporarily disable legoktm's access [puppet] - 10https://gerrit.wikimedia.org/r/947905
[18:35:13] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] admin: Temporarily disable legoktm's access [puppet] - 10https://gerrit.wikimedia.org/r/947905 (owner: 10Legoktm)
[18:36:44] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) (owner: 10JHathaway)
[18:38:22] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:40:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2007.codfw.wmnet with reason: host reimage
[18:43:26] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:43:43] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2007.codfw.wmnet with reason: host reimage
[18:46:10] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:57] <sukhe>	 the uncomitted DNS changes are for the ganeti
[18:47:00] <sukhe>	 +ganeti02                                 1H IN A 10.80.1.18                                                                          
[18:47:03] <sukhe>	 +18  1H IN PTR ganeti02.svc.esams.wmnet.                                                                                              
[18:47:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:55:35] <logmsgbot>	 !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@4312d99]: (no justification provided)
[18:55:56] <logmsgbot>	 !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@4312d99]: (no justification provided) (duration: 00m 20s)
[18:59:10] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:00:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:02:06] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:03:04] <moritzm>	 sukhe: ganeti02 is fine to merge, just prep work for the upcoming knams installation, Cathal created it earlier
[19:08:28] <wikibugs>	 (03PS5) 10AOkoth: vrts: add test VM to site [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027)
[19:13:00] <wikibugs>	 (03PS1) 10Bking: query_service: install git-fat [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300)
[19:14:42] <sukhe>	 moritzm: ok merging
[19:14:56] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[19:15:45] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking)
[19:16:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge ganeti changes - sukhe@cumin2002"
[19:18:33] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge ganeti changes - sukhe@cumin2002"
[19:18:34] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:19:04] <wikibugs>	 (03PS2) 10Bking: query_service: install git-fat [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300)
[19:22:11] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking)
[19:22:34] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:23:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:24:18] <logmsgbot>	 !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@b5a1d04]: (no justification provided)
[19:24:28] <logmsgbot>	 !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@b5a1d04]: (no justification provided) (duration: 00m 09s)
[19:26:32] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] query_service: install git-fat [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking)
[19:26:37] <wikibugs>	 (03CR) 10Bking: [C: 03+2] query_service: install git-fat [puppet] - 10https://gerrit.wikimedia.org/r/947928 (https://phabricator.wikimedia.org/T331300) (owner: 10Bking)
[19:28:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:28:35] <wikibugs>	 (03CR) 10JHathaway: "@jbond this is ready to merge, if you could take another pass, that would be appreciated!" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T342806) (owner: 10JHathaway)
[19:29:45] <wikibugs>	 (03PS1) 10Bking: wdqs.data-transfer: ensure data_loaded file is created [cookbooks] - 10https://gerrit.wikimedia.org/r/947930 (https://phabricator.wikimedia.org/T331300)
[19:30:10] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:31:34] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:32:18] <wikibugs>	 (03PS2) 10Bking: wdqs.data-transfer: ensure data_loaded file is created [cookbooks] - 10https://gerrit.wikimedia.org/r/947930 (https://phabricator.wikimedia.org/T331300)
[19:32:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] [tests] Ensure each config has at most one value per wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm)
[19:33:12] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:33:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:37:28] <icinga-wm>	 PROBLEM - Disk space on cloudbackup2002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=93%): /tmp 0 MB (0% inode=93%): /var/tmp 0 MB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup2002&var-datasource=codfw+prometheus/ops
[19:38:25] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:43:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:49:25] <wikibugs>	 (03PS1) 10Majavah: Disable access to grafana-labs/cloud.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465)
[19:52:08] <wikibugs>	 (03PS2) 10Majavah: Disable access to grafana-labs/cloud.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465)
[19:53:31] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42830/console" [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah)
[19:54:32] <wikibugs>	 (03PS3) 10Majavah: Disable access to grafana-labs/cloud.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465)
[19:56:20] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42831/console" [puppet] - 10https://gerrit.wikimedia.org/r/947933 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah)
[19:58:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:00:04] <jouncebot>	 brennen and TheresNoTime: Time to snap out of that daydream and deploy UTC late backport and config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230810T2000).
[20:01:29] <TheresNoTime>	 (nothing to deploy)
[20:01:34] <brennen>	 (yay)
[20:01:43] <RhinosF1>	 Enjoy your evening then
[20:02:42] <wikibugs>	 (03CR) 10Majavah: [tests] Ensure each config has at most one value per wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm)
[20:03:01] <taavi>	 urbanecm: did you get a chance to review the centralauth patch yet? I was hoping to backport that one today too
[20:03:24] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:03:26] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:08:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:09:30] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:13:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:16:08] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[20:18:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:18:43] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T343715 (10bking) In the interest of moving this forward, I'm going to go ahead and start provisioning these VMs.  If there is a resource shortage in CODFW (or o...
[20:19:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:23:25] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:28:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:33:18] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:33:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:34:03] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:34:07] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: f1a6177
[20:34:23] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: f1a6177 (duration: 00m 16s)
[20:37:24] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@f1a6177]: f1a6177
[20:38:07] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@f1a6177]: f1a6177 (duration: 00m 42s)
[20:38:24] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:39:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Eevans)
[20:40:56] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer
[20:41:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Eevans)
[20:42:56] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:52:20] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:59:42] <icinga-wm>	 RECOVERY - Disk space on cloudbackup2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup2002&var-datasource=codfw+prometheus/ops
[21:02:50] <urbanecm>	 taavi: sorry, not yet. I'll look in 20 mins. 
[21:03:35] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:06:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Eevans) SSH key verified against [[ https://meta.wikimedia.org/w/index.php?title=User:Ricki_Jay_(WMDE)&oldid=25435044 | https://meta.wikimedia.org/w/index.php?title=User:Ricki...
[21:07:23] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Eevans)
[21:08:24] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:08:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10Eevans) @KFrancis can you confirm we have an NDA on file?
[21:13:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:18:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:18:35] <wikibugs>	 (03PS1) 10Cathal Mooney: Depool esams for duration of esams -> knams migration [dns] - 10https://gerrit.wikimedia.org/r/947945 (https://phabricator.wikimedia.org/T329219)
[21:21:44] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs2007.codfw.wmnet with OS bullseye
[21:22:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50428 and previous config saved to /var/cache/conftool/dbconfig/20230810-212241-ladsgroup.json
[21:22:44] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[21:33:10] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:33:18] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:37:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P50429 and previous config saved to /var/cache/conftool/dbconfig/20230810-213747-ladsgroup.json
[21:38:18] <jinxer-wm>	 (ConfdResourceFailed) resolved: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:39:21] <jinxer-wm>	 (ConfdResourceFailed) firing: (64) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:40:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Tsevener) @Eevans Here you go, thanks!  https://www.mediawiki.org/wiki/User:TSevener_(WMF)
[21:44:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:45:24] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:45:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to restricted for Tsevener - https://phabricator.wikimedia.org/T343596 (10Eevans) a:05Tsevener→03Eevans
[21:49:26] <jinxer-wm>	 (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:52:28] <wikibugs>	 (03PS1) 10Eevans: admin: add user tsev to group restricted [puppet] - 10https://gerrit.wikimedia.org/r/947957 (https://phabricator.wikimedia.org/T343596)
[21:52:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P50430 and previous config saved to /var/cache/conftool/dbconfig/20230810-215253-ladsgroup.json
[21:54:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:54:29] <wikibugs>	 (03PS1) 10Urbanecm: GlobalRenameUser: Ensure old username is in canonical form [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947910 (https://phabricator.wikimedia.org/T343958)
[21:54:39] <taavi>	 jouncebot: nowandnext
[21:54:40] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 5 minute(s)
[21:54:40] <jouncebot>	 In 8 hour(s) and 5 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230811T0600)
[21:54:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] GlobalRenameUser: Ensure old username is in canonical form [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947910 (https://phabricator.wikimedia.org/T343958) (owner: 10Urbanecm)
[21:55:01] <urbanecm>	 taavi: i'm backporting it
[21:55:08] <taavi>	 thanks
[21:55:14] <urbanecm>	 thanks for writing the fix!
[21:56:24] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:59:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:59:47] <wikibugs>	 (03Merged) 10jenkins-bot: GlobalRenameUser: Ensure old username is in canonical form [extensions/CentralAuth] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947910 (https://phabricator.wikimedia.org/T343958) (owner: 10Urbanecm)
[22:00:05] <urbanecm>	 that was quick
[22:00:35] <urbanecm>	 (well, it's CA)
[22:00:36] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:947910|GlobalRenameUser: Ensure old username is in canonical form (T343958)]]
[22:00:46] <stashbot>	 T343958: Renaming one account multiple times creates duplicate global accounts - https://phabricator.wikimedia.org/T343958
[22:00:59] <taavi>	 it does not run the gate, that's the normal CI speed for it :P
[22:01:06] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:01:52] <urbanecm>	 yeah, it's CA :))
[22:02:08] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:947910|GlobalRenameUser: Ensure old username is in canonical form (T343958)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[22:03:52] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Continuing with sync
[22:08:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50431 and previous config saved to /var/cache/conftool/dbconfig/20230810-220759-ladsgroup.json
[22:08:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[22:08:06] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[22:08:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance
[22:08:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T342617)', diff saved to https://phabricator.wikimedia.org/P50432 and previous config saved to /var/cache/conftool/dbconfig/20230810-220820-ladsgroup.json
[22:10:24] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:947910|GlobalRenameUser: Ensure old username is in canonical form (T343958)]] (duration: 09m 48s)
[22:10:28] <stashbot>	 T343958: Renaming one account multiple times creates duplicate global accounts - https://phabricator.wikimedia.org/T343958
[22:12:13] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[22:14:16] * urbanecm done
[22:15:26] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:34] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job cloudlb_harpoxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:26:04] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:29:11] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:34:01] <wikibugs>	 (03PS1) 10BCornwall: Release 9.2.1-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154)
[22:34:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:34:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 9.2.1-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[22:35:10] <wikibugs>	 (03PS2) 10BCornwall: Release 9.2.1-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154)
[22:38:26] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:44:07] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:44:23] <jinxer-wm>	 (ConfdResourceFailed) resolved: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:45:20] <jinxer-wm>	 (ConfdResourceFailed) firing: (64) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:47:54] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:48:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[22:48:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[22:49:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:49:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[22:49:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[22:49:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[22:49:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[22:50:32] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f5a7ff82280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
[22:50:32] <icinga-wm>	 org/wiki/Search%23Administration
[22:50:44] <icinga-wm>	 PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:52:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance
[22:52:38] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:52:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance
[22:52:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[22:53:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[22:53:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[22:53:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[22:53:46] <icinga-wm>	 RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:54:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:55:10] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 619, active_shards: 1421, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar
[22:55:10] <icinga-wm>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:55:35] <logmsgbot>	 !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@ff0a21b]: (no justification provided)
[22:55:55] <logmsgbot>	 !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@ff0a21b]: (no justification provided) (duration: 00m 20s)
[22:59:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:04:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:05:17] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:09:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:10:46] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "lintian is happy; piuparts is giving me trouble for something unrelated." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[23:13:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 9.2.1-1wm2 to target Bookworm [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/947963 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[23:19:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:24:09] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:25:37] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:29:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:30:17] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:39:02] <jinxer-wm>	 (ConfdResourceFailed) resolved: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:39:32] <jinxer-wm>	 (ConfdResourceFailed) firing: (64) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:40:17] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:44:44] <icinga-wm>	 PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@11984.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:46:20] <icinga-wm>	 PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service,fstrim.service,logrotate.service,man-db.service,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@0.service,user-runtime-dir@499.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:47:04] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[23:48:17] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/947393
[23:49:32] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:50:22] <TheresNoTime>	 jinxer-wm: hush
[23:52:36] <wikibugs>	 (03PS1) 10Krinkle: clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 (https://phabricator.wikimedia.org/T343944)
[23:53:46] <wikibugs>	 (03PS1) 10Krinkle: mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947913
[23:53:59] <wikibugs>	 (03PS2) 10Krinkle: mediawiki.util: Investigate when mw.util is compromised by third-party script [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947913 (https://phabricator.wikimedia.org/T343944)
[23:54:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 (https://phabricator.wikimedia.org/T343944) (owner: 10Krinkle)
[23:54:47] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:55:17] <jinxer-wm>	 (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[23:56:09] <wikibugs>	 (03PS2) 10Krinkle: clientError: Investigate when mw.util is compromised by third-party script [extensions/WikimediaEvents] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/947912 (https://phabricator.wikimedia.org/T343944)