[00:23:42] <wikibugs>	 (03PS1) 10Clare Ming: Add stream config for Android article instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980963 (https://phabricator.wikimedia.org/T351292)
[00:29:50] <wikibugs>	 (03PS1) 10Jforrester: [BETA CLUSTER] Switch log host from mwlog01 to mwlog02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980965 (https://phabricator.wikimedia.org/T345566)
[00:30:12] <wikibugs>	 (03CR) 10Clare Ming: "hi all - just need a single +1 if this lgtu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980963 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming)
[00:33:28] <wikibugs>	 (03PS1) 10Jforrester: deployment-prep: Switch mwlog01 to mwlog02 [puppet] - 10https://gerrit.wikimedia.org/r/980966 (https://phabricator.wikimedia.org/T345566)
[00:38:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/980832
[00:38:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/980832 (owner: 10TrainBranchBot)
[00:39:53] <icinga-wm>	 PROBLEM - Check systemd state on centrallog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:41:47] <icinga-wm>	 PROBLEM - Check systemd state on centrallog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:59] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:51:23] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[00:53:14] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[00:53:17] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1080.eqiad.wmnet with OS bullseye
[00:53:19] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1081.eqiad.wmnet with OS bullseye
[00:53:24] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1080.eqiad.wmnet with OS bullseye completed: - ms-be...
[00:53:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1081.eqiad.wmnet with OS bullseye completed: - ms-be...
[00:53:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1082.eqiad.wmnet with OS bullseye
[00:53:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1082.eqiad.wmnet with OS bullseye
[00:57:37] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/980832 (owner: 10TrainBranchBot)
[01:14:53] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:22:31] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] Switch log host from mwlog01 to mwlog02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980965 (https://phabricator.wikimedia.org/T345566) (owner: 10Jforrester)
[01:23:45] <wikibugs>	 (03Merged) 10jenkins-bot: [BETA CLUSTER] Switch log host from mwlog01 to mwlog02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980965 (https://phabricator.wikimedia.org/T345566) (owner: 10Jforrester)
[01:31:45] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.16.241:9042 on restbase2029 is OK: TCP OK - 0.035 second response time on 10.192.16.241 port 9042 https://phabricator.wikimedia.org/T93886
[01:44:41] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.16.242:7000 on restbase2029 is OK: SSL OK - Certificate restbase2029-c valid until 2025-12-05 16:11:15 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[01:44:49] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2029 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:49:59] <wikibugs>	 (03CR) 10Ori: [V: 03+2 C: 03+2] "Cherry-picked on beta cluster. Verified by running tcpdump on a beta MediaWiki host:" [puppet] - 10https://gerrit.wikimedia.org/r/980966 (https://phabricator.wikimedia.org/T345566) (owner: 10Jforrester)
[01:56:03] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[02:05:39] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[02:22:08] <wikibugs>	 (03CR) 10Sharvaniharan: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980963 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming)
[02:39:06] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:53:51] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott)
[02:59:16] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott)
[03:00:05] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[03:00:24] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:01:44] <andrewbogott>	 mutante, James_F, I merged your puppet patches
[03:09:06] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:19:06] <jinxer-wm>	 (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:21:11] <wikibugs>	 (03PS3) 10Andrew Bogott: Horizon: allow image uploading via horizon for users with glance admin [puppet] - 10https://gerrit.wikimedia.org/r/980021 (https://phabricator.wikimedia.org/T326818)
[03:21:13] <wikibugs>	 (03PS3) 10Andrew Bogott: cloud-init: make puppet optional [puppet] - 10https://gerrit.wikimedia.org/r/980079 (https://phabricator.wikimedia.org/T326818)
[03:21:25] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980079 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott)
[03:45:53] <icinga-wm>	 PROBLEM - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [250000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now
[03:56:09] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 368.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:31:58] <wikibugs>	 (03PS4) 10Andrew Bogott: cloud-init: make puppet optional [puppet] - 10https://gerrit.wikimedia.org/r/980079 (https://phabricator.wikimedia.org/T326818)
[04:35:49] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:34:11] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:35:45] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.16.242:9042 on restbase2029 is OK: TCP OK - 0.032 second response time on 10.192.16.242 port 9042 https://phabricator.wikimedia.org/T93886
[05:38:09] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:56:03] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[06:05:21] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:08:37] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:22:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[06:23:53] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:31:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "wmnet: Failover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/980484
[06:32:00] <wikibugs>	 (03PS2) 10Marostegui: Revert "wmnet: Failover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/980484
[06:32:15] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:33:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Failover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/980484 (owner: 10Marostegui)
[06:35:01] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m5-master to dbproxy1027 [dns] - 10https://gerrit.wikimedia.org/r/980978 (https://phabricator.wikimedia.org/T351864)
[06:35:55] <marostegui>	 !log Failover m5-master from dbproxy1021 to dbproxy1027 T351864
[06:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:59] <stashbot>	 T351864: Migrate dbproxy hosts to Bookworm - https://phabricator.wikimedia.org/T351864
[06:36:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master to dbproxy1027 [dns] - 10https://gerrit.wikimedia.org/r/980978 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui)
[06:37:59] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[06:41:03] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:41:09] <Amir1>	 Thanks for the failover marostegui I'm around for a bit
[06:44:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[06:44:38] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db1119 [puppet] - 10https://gerrit.wikimedia.org/r/980979 (https://phabricator.wikimedia.org/T337206)
[06:44:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1119.eqiad.wmnet
[06:48:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1119 [puppet] - 10https://gerrit.wikimedia.org/r/980979 (https://phabricator.wikimedia.org/T337206) (owner: 10Marostegui)
[06:50:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[06:52:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1119.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[06:53:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1119.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[06:53:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:53:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1119.eqiad.wmnet
[06:55:29] <wikibugs>	 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 (10Marostegui) a:05Marostegui→03None
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T0700).
[07:00:24] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:02:26] <wikibugs>	 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 (10Marostegui) This is ready for DC-Ops
[07:02:32] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870 (10Joe) 05Open→03Declined Or not :)
[07:03:39] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 (10Marostegui)
[07:05:24] <wikibugs>	 (03CR) 10KartikMistry: Provide python3-bookworm image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry)
[07:17:08] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/980984 (https://phabricator.wikimedia.org/T343674)
[07:18:24] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+1] mariadb: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/980984 (https://phabricator.wikimedia.org/T343674) (owner: 10Marostegui)
[07:18:40] <wikibugs>	 (03CR) 10Marostegui: "db2131 is pooled, it should have notifications enabled" [puppet] - 10https://gerrit.wikimedia.org/r/980984 (https://phabricator.wikimedia.org/T343674) (owner: 10Marostegui)
[07:19:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/980984 (https://phabricator.wikimedia.org/T343674) (owner: 10Marostegui)
[07:20:10] <jinxer-wm>	 (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:31:17] <wikibugs>	 (03PS2) 10Ayounsi: BGPPeers: add codfw racks A1 to B8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893)
[07:37:42] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:42:00] <icinga-wm>	 PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:58:20] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:05] <jouncebot>	 Amir1, apergos, and jnuche: Dear deployers, time to do the UTC morning backport and config training deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T0800).
[08:00:17] <apergos>	 morning.  no patches scheduled for deployment, no trainees signed up to learn, and that's a wrap.  see you next time... 
[08:01:43] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:06:05] <wikibugs>	 (03PS1) 10Slyngshede: Add instance to summary for NTP [alerts] - 10https://gerrit.wikimedia.org/r/981175
[08:07:06] <icinga-wm>	 PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:13:57] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/output/980927/846/" [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi)
[08:16:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[08:18:00] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:19:28] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:21:06] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable requestctl-based block list for nftables on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/980868 (https://phabricator.wikimedia.org/T348734)
[08:21:15] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980868 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff)
[08:21:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:22:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I don't like the idea to allow people to do this as it's a terrible footgun, but also I think in this case the implementation is wrong - y" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/980864 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto)
[08:22:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[08:25:02] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:25:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:25:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org
[08:27:52] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:29:26] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:31:43] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:32:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org
[08:32:25] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] api-gateway: Add entry for recommendation-api-ng on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[08:32:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:33:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable requestctl-based block list for nftables on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/980868 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff)
[08:34:22] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:35:38] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Vgutierrez) 05Open→03Resolved
[08:37:22] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:38:22] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Thank you for spearheading the work on this!" [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[08:40:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178
[08:42:38] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:43:58] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.386 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:44:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:45:02] <icinga-wm>	 PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:45:33] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "This looks like a good approach to me, thanks." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (owner: 10Giuseppe Lavagetto)
[08:46:18] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:46:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:50:06] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:10] <icinga-wm>	 RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:51:14] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add asyncio implementation [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/980918 (https://phabricator.wikimedia.org/T338297)
[08:52:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 31 days, 0:00:00 on sretest1001.eqiad.wmnet with reason: WIP nftables
[08:52:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 31 days, 0:00:00 on sretest1001.eqiad.wmnet with reason: WIP nftables
[08:53:58] <icinga-wm>	 PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 27433MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops
[08:57:14] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:07:22] <wikibugs>	 (03Abandoned) 10Jelto: add optional install_recommends to apt_install [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/980864 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto)
[09:07:25] <icinga-wm>	 RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:09:09] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:09:23] <icinga-wm>	 RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:11:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:12:01] <icinga-wm>	 PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:14:11] <icinga-wm>	 RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:20:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] ircecho: Migrate the ircecho script from Python 2 to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[09:20:42] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10JMeybohm)
[09:21:12] <wikibugs>	 (03PS13) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950)
[09:21:14] <wikibugs>	 (03PS1) 10Muehlenhoff: nftables requestctl: Some tweaks and fixes [puppet] - 10https://gerrit.wikimedia.org/r/981280
[09:21:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[09:21:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add instance to summary for NTP [alerts] - 10https://gerrit.wikimedia.org/r/981175 (owner: 10Slyngshede)
[09:23:15] <wikibugs>	 (03PS2) 10Muehlenhoff: nftables requestctl: Some tweaks and fixes [puppet] - 10https://gerrit.wikimedia.org/r/981280 (https://phabricator.wikimedia.org/T348734)
[09:29:21] <icinga-wm>	 RECOVERY - Check systemd state on centrallog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:30:07] <icinga-wm>	 RECOVERY - Check systemd state on centrallog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:30:11] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: cirrusSearchCheckerJob: Revert to baremetal [deployment-charts] - 10https://gerrit.wikimedia.org/r/981282 (https://phabricator.wikimedia.org/T352906)
[09:31:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] cirrusSearchCheckerJob: Revert to baremetal [deployment-charts] - 10https://gerrit.wikimedia.org/r/981282 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[09:34:33] <icinga-wm>	 RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops
[09:35:03] <wikibugs>	 (03CR) 10Jelto: "one comment regarding the templating" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (owner: 10Giuseppe Lavagetto)
[09:35:42] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/981280 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff)
[09:36:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] cirrusSearchCheckerJob: Revert to baremetal [deployment-charts] - 10https://gerrit.wikimedia.org/r/981282 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[09:37:27] <wikibugs>	 (03Merged) 10jenkins-bot: cirrusSearchCheckerJob: Revert to baremetal [deployment-charts] - 10https://gerrit.wikimedia.org/r/981282 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[09:37:28] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Volans) @Papaul another test we could do is use the dhcp cookbook and then try to reboot into PXE using remote IPMI like the cookbook does. The co...
[09:39:55] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[09:40:12] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[09:40:41] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[09:41:05] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[09:41:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox: remove SECURE_PROXY_SSL_HEADER [puppet] - 10https://gerrit.wikimedia.org/r/980815 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[09:42:00] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[09:42:15] <wikibugs>	 (03Abandoned) 10Ayounsi: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov)
[09:42:22] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[09:42:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove now obsolete cergen Ganeti certs [puppet] - 10https://gerrit.wikimedia.org/r/981285 (https://phabricator.wikimedia.org/T350686)
[09:48:19] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Update the refinery version used by the refine test jobs [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis)
[09:48:32] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Update the refinery version used by the refine test jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis)
[09:48:59] <icinga-wm>	 RECOVERY - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is OK: OK: Less than 1.00% above the threshold [100000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now
[09:50:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/981285 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[09:56:04] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[09:56:10] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fixes" [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[10:02:09] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) (owner: 10EoghanGaffney)
[10:02:19] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] [admin] Add ehughes ldap account [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) (owner: 10EoghanGaffney)
[10:04:57] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rsyslog: add receiver action names [puppet] - 10https://gerrit.wikimedia.org/r/981287 (https://phabricator.wikimedia.org/T351710)
[10:05:47] <jinxer-wm>	 (ConfdResourceFailed) resolved: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[10:06:09] <wikibugs>	 (03CR) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[10:06:28] <wikibugs>	 (03PS5) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152)
[10:10:26] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] [admin] Add user account for xiaoxiao to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979389 (https://phabricator.wikimedia.org/T352098) (owner: 10EoghanGaffney)
[10:13:55] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] [admin] Add user account for xiaoxiao to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979389 (https://phabricator.wikimedia.org/T352098) (owner: 10EoghanGaffney)
[10:16:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix handling of Ferm's 00_defs_requestctl when changing firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/981288 (https://phabricator.wikimedia.org/T348734)
[10:16:24] <wikibugs>	 (03PS1) 10Slyngshede: Blackbox alerting for urldownloaders [alerts] - 10https://gerrit.wikimedia.org/r/981289 (https://phabricator.wikimedia.org/T350694)
[10:20:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10eoghan)
[10:20:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: add receiver action names [puppet] - 10https://gerrit.wikimedia.org/r/981287 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[10:21:52] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/981288 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff)
[10:22:07] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[10:22:21] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[10:22:47] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[10:23:01] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[10:23:36] <wikibugs>	 (03CR) 10Ayounsi: Netbox: add generic function to execute a Netbox script (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[10:24:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10eoghan) 05Open→03Resolved Hi @XiaoXiao-WMF , this should be done! Please reach out if you're having any problems!
[10:27:43] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons.
[10:28:08] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980834
[10:28:37] <wikibugs>	 (03PS3) 10Ayounsi: Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152)
[10:30:52] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980834 (owner: 10PipelineBot)
[10:31:42] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980834 (owner: 10PipelineBot)
[10:32:05] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] mw-api-int: increase replicas by 33% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980888 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková)
[10:32:55] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-int: increase replicas by 33% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980888 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková)
[10:32:56] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[10:33:11] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[10:33:22] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] api-gateway: Add entry for recommendation-api-ng on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[10:33:29] <klausman>	 ⎋/query hnowlan 
[10:33:32] <klausman>	 oops :)
[10:33:38] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[10:33:53] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[10:34:08] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[10:34:36] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[10:34:52] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance
[10:35:07] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance
[10:35:34] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Add entry for recommendation-api-ng on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[10:36:35] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] mobileapps: 60% to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/976222 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[10:36:51] <wikibugs>	 (03Abandoned) 10Elukey: profile::thanos: improve istio sli recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974486 (owner: 10Elukey)
[10:38:38] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[10:38:55] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[10:42:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump standards version [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981293
[10:44:31] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) I've added some more visibility into how much we are writing to local files, compared to the amount of logs we are receiving.  Turns out we re...
[10:45:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: cluster::management
[10:47:30] <wikibugs>	 10SRE-swift-storage, 10observability, 10SRE Observability (FY2023/2024-Q2), 10User-fgiunchedi: Stop sending swift access logs to centrallog for non state-changing requests - https://phabricator.wikimedia.org/T352968 (10fgiunchedi)
[10:47:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cluster::management to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/981294 (https://phabricator.wikimedia.org/T349619)
[10:49:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cluster::management to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/981294 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:50:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/980817/847/build2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/980817 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[10:50:56] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[10:51:10] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[10:51:39] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[10:51:53] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[10:52:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_pkg: install convenience symlink [puppet] - 10https://gerrit.wikimedia.org/r/980817 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[10:52:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] docker_pkg: install convenience symlink [puppet] - 10https://gerrit.wikimedia.org/r/980817 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi)
[10:53:16] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons.
[10:54:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: cluster::management
[10:58:11] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[10:58:16] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[11:00:04] <jouncebot>	 mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1100). nyaa~
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1100)
[11:00:25] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:01:03] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons.
[11:03:15] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro)
[11:03:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Switch cluster::management to Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/981296
[11:05:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert "Switch cluster::management to Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/981296 (owner: 10Muehlenhoff)
[11:09:26] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Thanks, and LGTM.  However I'm unsure exactly when we should merge, does this just affect initial config or will it cause current systems " [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi)
[11:10:47] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons.
[11:10:49] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[11:12:26] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-1] Provide python3-bookworm image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry)
[11:12:56] <logmsgbot>	 !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[11:13:26] <logmsgbot>	 !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[11:13:55] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[11:14:09] <wikibugs>	 (03PS5) 10EoghanGaffney: [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918)
[11:14:14] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[11:14:26] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[11:14:51] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui)
[11:14:59] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) p:05Triage→03Medium
[11:17:05] <logmsgbot>	 !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[11:17:27] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) (owner: 10EoghanGaffney)
[11:17:27] <logmsgbot>	 !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[11:18:17] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) (owner: 10EoghanGaffney)
[11:19:58] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:20:10] <jinxer-wm>	 (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:20:46] <wikibugs>	 (03CR) 10Effie Mouzeli: mcrouter: add chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:21:42] <wikibugs>	 (03PS28) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[11:21:48] <wikibugs>	 (03CR) 10Effie Mouzeli: mcrouter: add chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:22:13] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:25:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: write to local files and ban before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T351710)
[11:25:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981293 (owner: 10Muehlenhoff)
[11:30:24] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[11:30:28] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[11:30:30] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[11:30:34] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[11:33:23] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[11:33:53] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[11:34:03] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:34:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10eoghan)
[11:35:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10eoghan) 05Open→03Resolved This is done, please reach out if there's any issues!
[11:40:59] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) cumin1001 has been reverted to Puppet 5, but cumin2002 is on Puppet 7 and can be used to reproduce.
[11:45:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete cergen Ganeti certs [puppet] - 10https://gerrit.wikimedia.org/r/981285 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[11:48:20] <logmsgbot>	 !log btullis@deploy2002 Started deploy [analytics/refinery@b6499b1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@b6499b17]
[11:48:28] <wikibugs>	 (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[11:50:34] <wikibugs>	 (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977227 (owner: 10PipelineBot)
[11:50:40] <wikibugs>	 (03Abandoned) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/976958 (owner: 10PipelineBot)
[11:50:49] <wikibugs>	 (03PS1) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300
[11:51:28] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) db1124 can be used for testing. It is a test host running puppet 7. It can be restarted, rebooted, reimaged, whatever is needed
[11:51:37] <logmsgbot>	 !log btullis@deploy2002 Finished deploy [analytics/refinery@b6499b1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@b6499b17] (duration: 03m 17s)
[11:57:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/981301 (https://phabricator.wikimedia.org/T350686)
[12:01:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede)
[12:04:33] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) Just took a quick look: ` # db-mysql db1133 ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain `...
[12:08:26] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) This has more implications, as orchestrator cannot see these hosts (db1124, db1133) (with the changed cert). So this really needs lo...
[12:09:00] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) ` 15 dborch1001 orchestrator[425]: 2023-12-07 12:07:15 ERROR ReadTopologyInstance(db1124.eqiad.wmnet:3306) show global status like '...
[12:12:15] <wikibugs>	 (03PS1) 10Jgiannelos: Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981188
[12:12:25] <wikibugs>	 (03CR) 10KartikMistry: Provide python3-bookworm image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry)
[12:13:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1001.eqiad.wmnet
[12:14:20] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981188 (owner: 10Jgiannelos)
[12:15:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981188 (owner: 10Jgiannelos)
[12:16:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cloudcephosd1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/981304 (https://phabricator.wikimedia.org/T349619)
[12:16:49] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:17:15] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:17:23] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[12:17:56] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[12:18:06] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[12:18:44] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[12:23:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cloudcephosd1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/981304 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:27:36] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: service_proxy/mesh: Bump to newer version globally [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906)
[12:30:03] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!  Thanks for the good explanation I didn't grok the reason for it looking at the change alone." [puppet] - 10https://gerrit.wikimedia.org/r/981288 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff)
[12:36:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/981312 (owner: 10L10n-bot)
[12:38:06] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980838
[12:38:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1001.eqiad.wmnet
[12:45:17] <wikibugs>	 (03PS2) 10EoghanGaffney: [admin] Add ehughes ldap account [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387)
[12:46:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [admin] Add ehughes ldap account [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) (owner: 10EoghanGaffney)
[12:46:23] <wikibugs>	 (03CR) 10JMeybohm: "We should test with one service first to be sure nothing breaks" [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[12:46:50] <wikibugs>	 (03PS3) 10EoghanGaffney: [admin] Add ehughes ldap account [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387)
[12:47:14] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[12:47:17] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:48:44] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[12:48:57] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:49:33] <wikibugs>	 (03PS4) 10EoghanGaffney: [admin] Add ehughes shell account with no ssh key [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387)
[12:52:23] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[12:52:26] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[12:55:39] <wikibugs>	 (03PS1) 10JMeybohm: ml-staging: Enable certmanager for mesh certs by default [puppet] - 10https://gerrit.wikimedia.org/r/981325 (https://phabricator.wikimedia.org/T300033)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1300)
[13:04:30] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove obsolete dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/981301 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[13:07:08] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[13:07:36] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[13:08:41] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[13:09:36] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[13:09:44] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[13:09:46] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[13:09:51] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[13:10:21] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[13:13:06] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Code LGTM! Time to write the tests now ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[13:14:43] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981326
[13:16:41] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981326 (owner: 10Jgiannelos)
[13:17:35] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981326 (owner: 10Jgiannelos)
[13:18:47] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[13:18:57] <Amir1>	 jouncebot: nowandnext
[13:18:58] <jouncebot>	 For the next 0 hour(s) and 41 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1300)
[13:18:58] <jouncebot>	 In 0 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1400)
[13:19:03] <Amir1>	 awesome
[13:19:29] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] api: Only force backlink namespace index when there is one ns only [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980483 (https://phabricator.wikimedia.org/T351237) (owner: 10Jforrester)
[13:19:52] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[13:21:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-staging: Enable certmanager for mesh certs by default [puppet] - 10https://gerrit.wikimedia.org/r/981325 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[13:24:42] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[13:24:46] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[13:24:51] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: sync
[13:25:06] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[13:25:09] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[13:25:10] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: sync
[13:27:30] <wikibugs>	 (03PS3) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253)
[13:27:31] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[13:27:51] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:29:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup)
[13:31:40] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[13:31:52] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[13:32:10] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[13:32:13] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[13:33:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Let me know if you want to me to puppet-merge this when you feel it's ready." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[13:33:42] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[13:34:07] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[13:34:18] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[13:34:39] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[13:36:58] <wikibugs>	 (03Merged) 10jenkins-bot: api: Only force backlink namespace index when there is one ns only [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980483 (https://phabricator.wikimedia.org/T351237) (owner: 10Jforrester)
[13:37:24] <wikibugs>	 (03PS1) 10Ladsgroup: Drop python2 from tox [puppet] - 10https://gerrit.wikimedia.org/r/981329
[13:38:50] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:980483|api: Only force backlink namespace index when there is one ns only (T351237)]]
[13:38:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix handling of Ferm's 00_defs_requestctl when changing firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/981288 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff)
[13:38:54] <stashbot>	 T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237
[13:40:25] <logmsgbot>	 !log ladsgroup@deploy2002 jforrester and ladsgroup: Backport for [[gerrit:980483|api: Only force backlink namespace index when there is one ns only (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:42:22] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Use parsoid via restbase on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981330
[13:42:56] <logmsgbot>	 !log ladsgroup@deploy2002 jforrester and ladsgroup: Continuing with sync
[13:46:10] <wikibugs>	 (03PS4) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253)
[13:48:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup)
[13:48:51] <wikibugs>	 (03CR) 10Jgiannelos: "This only affects staging. Production uses restbase either way." [deployment-charts] - 10https://gerrit.wikimedia.org/r/981330 (owner: 10Jgiannelos)
[13:49:46] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:980483|api: Only force backlink namespace index when there is one ns only (T351237)]] (duration: 10m 55s)
[13:49:49] <stashbot>	 T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237
[13:51:04] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: citoid: Set service_mesh version to 1.23.10-2-s4-20231203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981331 (https://phabricator.wikimedia.org/T352906)
[13:51:51] <wikibugs>	 (03PS1) 10JMeybohm: Add new istio module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981332 (https://phabricator.wikimedia.org/T300033)
[13:51:53] <wikibugs>	 (03Abandoned) 10Ladsgroup: Drop python2 from tox [puppet] - 10https://gerrit.wikimedia.org/r/981329 (owner: 10Ladsgroup)
[13:51:55] <wikibugs>	 (03PS1) 10JMeybohm: ingress.istio: Remove trust for every SAN but the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033)
[13:52:08] <xSavitar>	 TheresNoTime, sorry I wasn't around for the deployment yesterday. I've rescheduled for today
[13:52:12] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: service_proxy/mesh: Bump to newer version globally [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906)
[13:52:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Good point, doing so in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/981331" [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[13:52:30] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[13:52:56] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[13:54:59] <wikibugs>	 (03PS2) 10JMeybohm: ingress.istio: Remove trust for every SAN but the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033)
[13:55:31] <wikibugs>	 (03Abandoned) 10Majavah: openstack: codfw1dev: designate: listen-on only the new address [puppet] - 10https://gerrit.wikimedia.org/r/929740 (https://phabricator.wikimedia.org/T338938) (owner: 10Arturo Borrero Gonzalez)
[13:58:13] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:58:27] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance
[13:58:49] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: upload_puppet_facts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1400).
[14:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:04:53] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] citoid: Set service_mesh version to 1.23.10-2-s4-20231203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981331 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[14:15:17] <wikibugs>	 (03PS3) 10JMeybohm: ingress.istio: Remove trust for every SAN but the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033)
[14:15:19] <wikibugs>	 (03PS1) 10JMeybohm: function-orchestrator: Update to ingress.istio:1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981336 (https://phabricator.wikimedia.org/T300033)
[14:16:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] nftables requestctl: Some tweaks and fixes [puppet] - 10https://gerrit.wikimedia.org/r/981280 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff)
[14:19:31] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: [beta] ores-extension: enable revertrisk model for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298)
[14:24:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mobileapps: Use parsoid via restbase on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981330 (owner: 10Jgiannelos)
[14:24:34] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Use parsoid via restbase on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981330 (owner: 10Jgiannelos)
[14:25:28] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Use parsoid via restbase on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981330 (owner: 10Jgiannelos)
[14:26:07] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:26:36] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:26:59] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[14:27:34] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178
[14:27:38] <wikibugs>	 (03PS2) 10Filippo Giunchedi: swift: write to local files and ban before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968)
[14:27:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: wmf-debci: also install recommended dependencies (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (owner: 10Giuseppe Lavagetto)
[14:29:32] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:30:15] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:31:32] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:32:40] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:34:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix syntax for drop rule [puppet] - 10https://gerrit.wikimedia.org/r/981339
[14:36:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: "This functionality is provided out of the (black)box by prometheus::blackbox::check::http, see 'team' parameter" [alerts] - 10https://gerrit.wikimedia.org/r/981289 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[14:38:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix syntax for drop rule [puppet] - 10https://gerrit.wikimedia.org/r/981339 (owner: 10Muehlenhoff)
[14:39:06] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:38] <wikibugs>	 (03PS3) 10EoghanGaffney: [apt-staging] Add script to pull artifacts from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/979912
[14:39:40] <wikibugs>	 (03CR) 10EoghanGaffney: [apt-staging] Add script to pull artifacts from gitlab (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979912 (owner: 10EoghanGaffney)
[14:41:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cp4037.ulsfo.wmnet
[14:42:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Our current scaffolding system allows you to only select components you need. This patch includes the service mesh Service that is definit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[14:42:28] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: citoid: Set service_mesh version to 1.23.10-2-s4-20231203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981331 (https://phabricator.wikimedia.org/T352906)
[14:42:30] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mesh: Ship new configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/981340 (https://phabricator.wikimedia.org/T352906)
[14:42:32] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mesh: Use ca-certificates instead of wmf-ca-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/981341 (https://phabricator.wikimedia.org/T352906)
[14:43:55] <wikibugs>	 (03PS4) 10Ayounsi: Get the server's BGP peer info from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649)
[14:46:06] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Papaul) @Volans after i enter the mgmt password the only line i get it  ` Set Boot Device to force_pxe `
[14:48:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye
[14:48:55] <wikibugs>	 (03PS1) 10Klausman: API GW: Add ingress endpoints on Lift Wing to allowed destinations [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263)
[14:48:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye
[14:48:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye
[14:49:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2004.codfw.wmnet with OS bullseye
[14:49:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2005.codfw.wmnet with OS bullseye
[14:49:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2006.codfw.wmnet with OS bullseye
[14:49:16] <wikibugs>	 (03PS6) 10Ayounsi: Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649)
[14:50:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bullseye
[14:50:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye
[14:50:47] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd2002.codfw.wmnet with OS bullseye
[14:50:48] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] [beta] ores-extension: enable revertrisk model for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos)
[14:50:51] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye executed with errors: - ce...
[14:51:07] <icinga-wm>	 RECOVERY - Check systemd state on mw1350 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:51:11] <icinga-wm>	 PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 27308MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops
[14:51:26] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] API GW: Add ingress endpoints on Lift Wing to allowed destinations [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[14:52:03] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] API GW: Add ingress endpoints on Lift Wing to allowed destinations [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[14:52:55] <wikibugs>	 (03Merged) 10jenkins-bot: API GW: Add ingress endpoints on Lift Wing to allowed destinations [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[14:53:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bullseye
[14:53:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye
[14:53:14] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[14:53:16] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[14:53:30] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[14:53:45] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[14:54:06] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:55:16] <wikibugs>	 (03CR) 10Herron: [C: 03+1] swift: write to local files and ban before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi)
[14:55:53] <wikibugs>	 (03PS1) 10Klausman: APIGW: add missing /32 to egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263)
[14:57:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] citoid: Set service_mesh version to 1.23.10-2-s4-20231203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981331 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[14:58:19] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: Set service_mesh version to 1.23.10-2-s4-20231203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981331 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[15:00:25] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:00:58] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply
[15:01:14] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[15:01:40] <wikibugs>	 (03CR) 10Elukey: API GW: Add ingress endpoints on Lift Wing to allowed destinations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[15:01:49] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply
[15:02:14] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[15:02:49] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] APIGW: add missing /32 to egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[15:03:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm)
[15:03:55] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[15:03:56] <wikibugs>	 (03PS2) 10Klausman: APIGW: add missing /32 to egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263)
[15:04:16] <wikibugs>	 (03PS3) 10Klausman: APIGW: add missing /32 to egress rules and fix port [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263)
[15:04:24] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[15:04:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] APIGW: add missing /32 to egress rules and fix port [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[15:05:05] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] APIGW: add missing /32 to egress rules and fix port [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[15:05:56] <wikibugs>	 (03Merged) 10jenkins-bot: APIGW: add missing /32 to egress rules and fix port [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[15:06:26] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[15:06:40] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[15:07:25] <wikibugs>	 (03CR) 10Ayounsi: BGPPeers: add codfw racks A1 to B8 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi)
[15:07:26] <logmsgbot>	 !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[15:07:29] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[15:07:33] <logmsgbot>	 !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[15:07:44] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[15:07:50] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54270 and previous config saved to /var/cache/conftool/dbconfig/20231207-150750-arnaudb.json
[15:07:58] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[15:08:01] <logmsgbot>	 !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[15:08:11] <logmsgbot>	 !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[15:09:09] <wikibugs>	 (03PS1) 10Ayounsi: Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152)
[15:10:00] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] API GW: Add ingress endpoints on Lift Wing to allowed destinations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman)
[15:10:44] <wikibugs>	 (03PS2) 10Ayounsi: Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152)
[15:11:37] <icinga-wm>	 RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops
[15:11:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54271 and previous config saved to /var/cache/conftool/dbconfig/20231207-151152-arnaudb.json
[15:13:20] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980839
[15:14:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[15:17:25] <wikibugs>	 (03PS3) 10Ayounsi: Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152)
[15:18:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980885 (owner: 10Jgiannelos)
[15:18:49] <wikibugs>	 (03PS5) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253)
[15:19:01] <wikibugs>	 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10Ottomata) Recent flink-app based deployments should use envoy.  Not sure about the older rdf-streaming-updater, but there are plans to move that to flink-app chart.
[15:19:09] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cp4037.ulsfo.wmnet
[15:20:10] <jinxer-wm>	 (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:21:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup)
[15:21:49] <bawolff>	 arnaudb: I just wanted to say thank you for all the work on the img_size schema change :)
[15:22:16] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] tegola: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980885 (owner: 10Jgiannelos)
[15:23:22] <wikibugs>	 (03Merged) 10jenkins-bot: tegola: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980885 (owner: 10Jgiannelos)
[15:23:26] <arnaudb>	 thanks bawolff ! <3 
[15:24:16] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply
[15:26:12] <wikibugs>	 (03PS3) 10Cathal Mooney: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920)
[15:26:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P54272 and previous config saved to /var/cache/conftool/dbconfig/20231207-152659-arnaudb.json
[15:27:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) (owner: 10Cathal Mooney)
[15:27:20] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply
[15:27:44] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply
[15:27:49] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
[15:28:47] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
[15:28:54] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[15:29:28] <wikibugs>	 (03PS6) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253)
[15:29:31] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[15:31:31] <wikibugs>	 (03PS4) 10Cathal Mooney: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920)
[15:32:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) (owner: 10Cathal Mooney)
[15:34:56] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup)
[15:35:05] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper)
[15:36:39] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2005.codfw.wmnet with OS bullseye
[15:36:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2005.codfw.wmnet with OS bullseye executed with errors: -...
[15:36:50] <wikibugs>	 (03PS5) 10Cathal Mooney: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920)
[15:37:02] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2006.codfw.wmnet with OS bullseye
[15:37:10] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2006.codfw.wmnet with OS bullseye executed with errors: -...
[15:37:43] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] profile::cache::kafka::webrequest: allow to customize the format [puppet] - 10https://gerrit.wikimedia.org/r/980911 (https://phabricator.wikimedia.org/T346463) (owner: 10Elukey)
[15:38:51] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[15:39:57] <wikibugs>	 (03PS7) 10Ladsgroup: Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253)
[15:42:06] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P54273 and previous config saved to /var/cache/conftool/dbconfig/20231207-154205-arnaudb.json
[15:42:17] <wikibugs>	 (03PS4) 10Cathal Mooney: Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980948 (https://phabricator.wikimedia.org/T352918)
[15:42:58] <wikibugs>	 (03PS1) 10Ottomata: webrequest varnishkafka - Add to X-Analytics the Sec-Purpose HTTP header [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463)
[15:43:50] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup)
[15:44:03] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[15:44:21] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[15:44:58] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[15:45:15] <logmsgbot>	 !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[15:45:30] <logmsgbot>	 !log milimetric@deploy2002 Started deploy [analytics/refinery@8b8f178]: hotfix: sqoop
[15:48:01] <sukhe>	 !log clear out dns6001 resolv.conf to check for SSH config-based authdns-update
[15:48:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:06] <wikibugs>	 (03PS8) 10Ladsgroup: Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253)
[15:50:17] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup)
[15:50:26] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) >>! In T351710#9385748, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/I...
[15:50:32] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) a:03ABran-WMF
[15:50:50] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm now" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (owner: 10Giuseppe Lavagetto)
[15:51:19] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) 05Open→03In progress
[15:51:29] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10ABran-WMF)
[15:52:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: service_proxy/mesh: Bump to newer version globally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[15:53:15] <icinga-wm>	 PROBLEM - AuthDNS-over-TLS Works on dns6001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS
[15:53:23] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-1] Provide python3-bookworm image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry)
[15:53:24] <sukhe>	 ^ expected
[15:53:32] <sukhe>	 !log running authdns-update with broken resolv.conf on dns6001
[15:53:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:58] <wikibugs>	 (03CR) 10Ottomata: "Alternative:" [puppet] - 10https://gerrit.wikimedia.org/r/980912 (https://phabricator.wikimedia.org/T346463) (owner: 10Elukey)
[15:55:38] <logmsgbot>	 !log milimetric@deploy2002 Finished deploy [analytics/refinery@8b8f178]: hotfix: sqoop (duration: 10m 08s)
[15:55:38] <wikibugs>	 10SRE, 10Observability-Alerting: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996 (10fgiunchedi) Note for later and reworked for an alertmanager/prometheus world: we should extend `netops::prometheus::hosts` to also probe for ipv6, this way we'll have smoke probes also testin...
[15:55:54] <wikibugs>	 10SRE, 10Observability-Alerting: Probe for ipv6 host reachability - https://phabricator.wikimedia.org/T163996 (10fgiunchedi)
[15:56:07] <icinga-wm>	 RECOVERY - AuthDNS-over-TLS Works on dns6001 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS
[15:57:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54274 and previous config saved to /var/cache/conftool/dbconfig/20231207-155712-arnaudb.json
[15:57:17] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[15:57:38] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi)
[15:58:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis)
[15:59:54] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10dcaro)
[16:00:30] <logmsgbot>	 !log milimetric@deploy2002 Started deploy [analytics/refinery@8b8f178] (thin): hotfix: sqoop
[16:00:37] <logmsgbot>	 !log milimetric@deploy2002 Finished deploy [analytics/refinery@8b8f178] (thin): hotfix: sqoop (duration: 00m 07s)
[16:01:39] <wikibugs>	 (03PS12) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper)
[16:01:56] <wikibugs>	 (03PS1) 10Majavah: netops: prometheus::hosts: also probe ipv6 if available [puppet] - 10https://gerrit.wikimedia.org/r/981358 (https://phabricator.wikimedia.org/T163996)
[16:02:45] <sukhe>	 !log run dummy authdns-update on dns6001
[16:02:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:57] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/981358 (https://phabricator.wikimedia.org/T163996) (owner: 10Majavah)
[16:08:20] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] service_proxy/mesh: Bump to newer version globally [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris)
[16:09:13] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[16:09:27] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[16:16:18] <wikibugs>	 10SRE, 10Observability-Alerting: Reminders for unhandled/unacked alerts - https://phabricator.wikimedia.org/T307958 (10fgiunchedi) This is essentially what https://alerts.wikimedia.org/triage/ displays now, for `hide_alerts_older_than: '1200h'` alerts. The app also offers the user a button to open a task
[16:16:29] <wikibugs>	 (03PS10) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722)
[16:16:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol)
[16:17:23] <wikibugs>	 (03PS11) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722)
[16:17:32] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10jijiki) @ehughes please sign the L3 Acknowledgement of Wikimedia Server Access Responsibilities Documen...
[16:18:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol)
[16:20:20] <wikibugs>	 (03PS12) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722)
[16:20:36] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10User-jbond: Thanos and Grafana lose the session after an hour - https://phabricator.wikimedia.org/T268233 (10fgiunchedi) Untagging o11y here, since we moved thanos to oauth2-proxy I believe this should not apply to thanos anymore (though might still apply t...
[16:22:33] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-whole-mediawiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:23:56] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[16:24:00] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[16:24:49] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[16:25:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[16:26:23] <wikibugs>	 (03PS11) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591)
[16:26:35] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[16:26:48] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[16:27:20] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[16:27:33] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[16:28:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:31] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: set host and ingress for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/981359 (https://phabricator.wikimedia.org/T347263)
[16:37:04] <wikibugs>	 (03CR) 10Volans: Move git search related classes to __init__ (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[16:38:27] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons.
[16:39:06] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd2002.codfw.wmnet with OS bullseye
[16:39:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye executed with errors: - ce...
[16:39:47] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bullseye
[16:39:52] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye
[16:43:57] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[16:44:12] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[16:44:47] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance
[16:45:01] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance
[16:52:41] <wikibugs>	 (03Abandoned) 10Jdlrobson: References previews is no longer a beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980901 (https://phabricator.wikimedia.org/T282999) (owner: 10Jdlrobson)
[16:52:53] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch)
[16:54:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10jijiki) a:05eoghan→03ehughes
[17:00:04] <jouncebot>	 jhathaway and rzl: Dear deployers, time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:03:36] <wikibugs>	 (03PS1) 10Brouberol: Define an echoserver namespace for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/981363 (https://phabricator.wikimedia.org/T353004)
[17:05:51] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.dns.netbox
[17:08:41] <logmsgbot>	 !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cleanup logstash/kibana records T299700 - herron@cumin1001"
[17:08:45] <stashbot>	 T299700: Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700
[17:09:35] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cleanup logstash/kibana records T299700 - herron@cumin1001"
[17:09:35] <logmsgbot>	 !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:13:49] <wikibugs>	 10SRE, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron)
[17:13:53] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) 05Open→03Resolved >>! In T299700#9242375, @Volans wrote: > FYI the service IPs are still allocated in Netbox: > https://netbox.wikimedia.org/i...
[17:18:01] <wikibugs>	 (03PS2) 10Brouberol: Define an echoserver namespace for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/981363 (https://phabricator.wikimedia.org/T353004)
[17:18:03] <wikibugs>	 (03PS1) 10Brouberol: Define a simple echoserver chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/981367 (https://phabricator.wikimedia.org/T353004)
[17:18:05] <wikibugs>	 (03PS1) 10Brouberol: Define deployment helmfiles for echoserver in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/981368 (https://phabricator.wikimedia.org/T353004)
[17:18:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Define a simple echoserver chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/981367 (https://phabricator.wikimedia.org/T353004) (owner: 10Brouberol)
[17:18:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Define an echoserver namespace for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/981363 (https://phabricator.wikimedia.org/T353004) (owner: 10Brouberol)
[17:19:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Define deployment helmfiles for echoserver in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/981368 (https://phabricator.wikimedia.org/T353004) (owner: 10Brouberol)
[17:23:08] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd2002.codfw.wmnet with OS bullseye
[17:23:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye executed with errors: - ce...
[17:23:15] <wikibugs>	 (03PS2) 10Brouberol: Define a simple echoserver chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/981367 (https://phabricator.wikimedia.org/T353004)
[17:23:17] <wikibugs>	 (03PS3) 10Brouberol: Define an echoserver namespace for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/981363 (https://phabricator.wikimedia.org/T353004)
[17:23:19] <wikibugs>	 (03PS2) 10Brouberol: Define deployment helmfiles for echoserver in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/981368 (https://phabricator.wikimedia.org/T353004)
[17:29:24] <wikibugs>	 (03PS1) 10Eevans: restbase: set production role and add config for restbase2030 [puppet] - 10https://gerrit.wikimedia.org/r/981371 (https://phabricator.wikimedia.org/T352468)
[17:29:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) having an issue with cephosd2001 and 2002.  cephosd2001 fails at this part.  [39/50, retrying in 117.00s] Attempt to run 'cookbooks.sre.hosts.reimage....
[17:34:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) having an issue with all the new sessionstore servers that I think stems from the HBA355i Fnt card.  When the install gets to partitioning the drives, I...
[17:34:06] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] api-gateway: set host and ingress for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/981359 (https://phabricator.wikimedia.org/T347263) (owner: 10Hnowlan)
[17:34:21] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: set host and ingress for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/981359 (https://phabricator.wikimedia.org/T347263)
[17:36:18] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: set host and ingress for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/981359 (https://phabricator.wikimedia.org/T347263) (owner: 10Hnowlan)
[17:37:25] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: set host and ingress for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/981359 (https://phabricator.wikimedia.org/T347263) (owner: 10Hnowlan)
[17:37:46] <wikibugs>	 10SRE, 10SRE Observability: certspotter failures on alert1001 - https://phabricator.wikimedia.org/T318911 (10herron) 05Open→03Resolved a:03herron I'm reviewing the backlog today (almost exactly one year since the last update!) and I think we're ok to close this since certspotter failures were addressed,...
[17:38:42] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[17:39:07] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[17:40:09] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply
[17:40:28] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply
[17:43:06] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (owner: 10Giuseppe Lavagetto)
[17:45:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[17:52:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[17:57:57] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs1024.eqiad.wmnet
[17:57:58] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1024.eqiad.wmnet
[18:00:05] <jouncebot>	 bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1800)
[18:02:00] <bd808>	 nothing from me today
[18:03:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[18:04:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[18:04:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[18:04:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[18:04:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T343198)', diff saved to https://phabricator.wikimedia.org/P54277 and previous config saved to /var/cache/conftool/dbconfig/20231207-180427-ladsgroup.json
[18:04:31] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[18:05:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[18:05:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[18:05:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[18:05:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[18:08:52] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[18:19:08] <wikibugs>	 (03CR) 10Bernard Wang: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980951 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson)
[18:31:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm! (for now, if possible we should fix TLS later though)" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper)
[18:33:17] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper)
[18:37:59] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[18:38:26] <wikibugs>	 (03PS1) 10Bking: Revert "wdqs: monitor ldf endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/981190
[18:39:26] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981380
[18:42:03] <mutante>	 !log puppetmaster1001 - revoke cert for miscweb.discovery.wmnet
[18:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:40] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981380 (owner: 10Ebernhardson)
[18:44:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T343198)', diff saved to https://phabricator.wikimedia.org/P54278 and previous config saved to /var/cache/conftool/dbconfig/20231207-184406-ladsgroup.json
[18:44:14] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[18:44:27] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981380 (owner: 10Ebernhardson)
[18:45:32] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[18:45:56] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:59:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P54279 and previous config saved to /var/cache/conftool/dbconfig/20231207-185913-ladsgroup.json
[19:00:25] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:14:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P54280 and previous config saved to /var/cache/conftool/dbconfig/20231207-191420-ladsgroup.json
[19:16:44] <jinxer-wm>	 (SystemdUnitFailed) firing: wdqs-blazegraph.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:21:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: wdqs-blazegraph.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:27:31] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981385
[19:28:19] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 940.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:29:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T343198)', diff saved to https://phabricator.wikimedia.org/P54281 and previous config saved to /var/cache/conftool/dbconfig/20231207-192926-ladsgroup.json
[19:29:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[19:29:31] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[19:29:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[19:29:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T343198)', diff saved to https://phabricator.wikimedia.org/P54282 and previous config saved to /var/cache/conftool/dbconfig/20231207-192949-ladsgroup.json
[19:32:00] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981385 (owner: 10Ebernhardson)
[19:32:57] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981385 (owner: 10Ebernhardson)
[19:35:21] <wikibugs>	 (03PS1) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981386 (https://phabricator.wikimedia.org/T219903)
[19:35:24] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: graph split experiments T350106
[19:35:31] <stashbot>	 T350106: Implement a spark job that converts a RDF triples table into a RDF file format - https://phabricator.wikimedia.org/T350106
[19:35:40] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: graph split experiments T350106
[19:38:21] <wikibugs>	 (03PS1) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355)
[19:38:22] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:38:31] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:38:46] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:40:58] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Revert "wdqs: monitor ldf endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/981190 (owner: 10Bking)
[19:43:21] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981388
[19:44:32] <wikibugs>	 (03PS2) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355)
[19:45:52] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981388 (owner: 10Ebernhardson)
[19:46:37] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981388 (owner: 10Ebernhardson)
[19:48:02] <wikibugs>	 (03PS1) 10Gmodena: mw-page-content-enrich: version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/980359 (https://phabricator.wikimedia.org/T345806)
[19:49:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1082.eqiad.wmnet with OS bullseye
[19:49:17] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1082.eqiad.wmnet with OS bullseye
[19:49:42] <wikibugs>	 (03PS3) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355)
[19:50:18] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2030 [puppet] - 10https://gerrit.wikimedia.org/r/981371 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[19:51:41] <wikibugs>	 (03PS4) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355)
[19:52:20] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981390
[19:52:28] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:55:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[19:58:10] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981390 (owner: 10Ebernhardson)
[19:58:58] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981390 (owner: 10Ebernhardson)
[19:59:17] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:59:26] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:01:23] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[20:01:35] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:02:16] <wikibugs>	 (03PS1) 10Dzahn: Revert "microsites/query_service: enable TLS when monitoring commons-query" [puppet] - 10https://gerrit.wikimedia.org/r/981191
[20:02:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1082.eqiad.wmnet with reason: host reimage
[20:02:46] <wikibugs>	 (03PS2) 10Dzahn: Revert "microsites/query_service: enable TLS when monitoring commons-query" [puppet] - 10https://gerrit.wikimedia.org/r/981191 (https://phabricator.wikimedia.org/T352941)
[20:05:51] <urandom>	 !log bootstrap Cassandra/restbase2030-a — T352468
[20:05:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:05] <stashbot>	 T352468: Provision new RESTBase cluster nodes: restbase20[28-35] - https://phabricator.wikimedia.org/T352468
[20:06:07] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1168 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T353020 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:06:11] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1082.eqiad.wmnet with reason: host reimage
[20:06:12] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10ops-monitoring-bot)
[20:07:12] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.243:9042 on restbase2030 is CRITICAL: connect to address 10.192.16.243 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[20:09:20] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.243:7000 on restbase2030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:09:48] <icinga-wm>	 PROBLEM - Check systemd state on restbase2030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:11:44] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:12:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T343198)', diff saved to https://phabricator.wikimedia.org/P54283 and previous config saved to /var/cache/conftool/dbconfig/20231207-201234-ladsgroup.json
[20:12:52] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[20:14:04] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.16.244:9042 on restbase2030 is CRITICAL: connect to address 10.192.16.244 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[20:14:54] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:16:30] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.16.244:7000 on restbase2030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:16:30] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102 (10Ottomata)
[20:18:06] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:18:16] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2030 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:18:26] <icinga-wm>	 RECOVERY - Check systemd state on restbase2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:19:00] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:19:04] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.243:7000 on restbase2030 is OK: SSL OK - Certificate restbase2030-a valid until 2025-12-06 17:50:13 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:20:02] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:20:30] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:21:26] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.16.245:9042 on restbase2030 is CRITICAL: connect to address 10.192.16.245 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[20:22:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10Marostegui)
[20:23:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10Marostegui) @wiki_willy I guess this host isn't under warranty anymore? Still can we get a disk for it? Thanks!
[20:23:58] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.245:7000 on restbase2030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:25:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "microsites/query_service: enable TLS when monitoring commons-query" [puppet] - 10https://gerrit.wikimedia.org/r/981191 (https://phabricator.wikimedia.org/T352941) (owner: 10Dzahn)
[20:25:52] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Ottomata) Pasting some relevant discussion points [[ https://wikimedia.slack.com/archives/C055QGPTC69/p1700728492406809 | from slack ]]:  @brouberol  > I was...
[20:26:24] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:27:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[20:27:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P54285 and previous config saved to /var/cache/conftool/dbconfig/20231207-202740-ladsgroup.json
[20:30:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cp4037.ulsfo.wmnet
[20:38:53] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[20:40:15] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:42:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P54286 and previous config saved to /var/cache/conftool/dbconfig/20231207-204247-ladsgroup.json
[20:43:00] <wikibugs>	 (03PS3) 10Ottomata: Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977075 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx)
[20:46:21] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977075 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx)
[20:47:04] <wikibugs>	 (03Merged) 10jenkins-bot: Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977075 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx)
[20:47:26] <wikibugs>	 (03PS3) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337)
[20:47:55] <wikibugs>	 (03CR) 10Kimberly Sarabia: Remove readability survey tool (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia)
[20:49:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:50:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10wiki_willy) Definitely.  @Jclark-ctr & @VRiley-WMF - can you check if we have any spare drives from a decommissioned host?  If not, we'll purchase one via @RobH).   Thanks, Willy  >>! In T353020#9391723, @Marost...
[20:54:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:54:32] <ottomata>	 It is really too bad we can't scap sync-file more than one file at once...especially since it takes so long to do sync-file now!
[20:56:07] <logmsgbot>	 !log otto@deploy2002 Synchronized wmf-config/ext-EventLogging.php: Config: [[gerrit:977075|Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit EventLogging config (T329718)]] (duration: 07m 07s)
[20:56:13] <stashbot>	 T329718: Decommission the SpecialMuteSubmit instrument - https://phabricator.wikimedia.org/T329718
[20:57:15] <RhinosF1>	 ottomata: probably better to ask releng
[20:57:41] <ottomata>	 you sure i shouldn't just gripe at the wall here?  
[20:57:47] <ottomata>	 :)
[20:57:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T343198)', diff saved to https://phabricator.wikimedia.org/P54287 and previous config saved to /var/cache/conftool/dbconfig/20231207-205753-ladsgroup.json
[20:57:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[20:57:57] <taavi>	 ottomata: `scap sync-world` (or even `scap backport`) is the future!
[20:57:58] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[20:58:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[20:58:06] <RhinosF1>	 ottomata: nah it's a genuine question
[20:58:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[20:58:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T343198)', diff saved to https://phabricator.wikimedia.org/P54288 and previous config saved to /var/cache/conftool/dbconfig/20231207-205817-ladsgroup.json
[20:58:27] <RhinosF1>	 That might be solved by something like scap backport as taavi pointed out
[20:58:50] <ottomata>	 well I look forward to it! :D 
[20:59:19] <taavi>	 why look forward when you can use it today?
[20:59:20] <RhinosF1>	 I just fear jinxer-wm and wikibugs have better ideas that helping you get an answer if people are busy and don't see here
[21:00:06] <jouncebot>	 TheresNoTime: gettimeofday() says it's time for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T2100)
[21:00:06] <jouncebot>	 jan_Drewniak and kostajh: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:22] <kostajh>	 hi
[21:00:34] <jan_drewniak>	 o/
[21:01:15] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "Deployed!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977075 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx)
[21:01:53] <kostajh>	 ottomata: I believe you could do `scap backport 977075` for those two files to go out at the same time
[21:02:27] <ottomata>	 wut
[21:02:29] <ottomata>	 really?!
[21:02:29] <kostajh>	 TheresNoTime: are you around?
[21:02:45] <ottomata>	 btw, i have a scap sync-file currently running. should be done in a couple of mins.  i'm done then.
[21:02:56] <dcausse>	 !log restarting blazegraph on wdqs2017 (BlazegraphFreeAllocatorsDecreasingRapidly) 
[21:02:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:19] <jan_drewniak>	 cool. kostajh if it's just us two for the backport, I can deploy the config changes.
[21:03:19] <ottomata>	 taavi: kostajh i was following these docs https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#operations/mediawiki-config_2
[21:03:23] <ottomata>	 do they need updated?
[21:03:29] <kostajh>	 ottomata: https://wikitech.wikimedia.org/wiki/Scap#Backport_Deployments 
[21:03:55] <ottomata>	 wow
[21:04:06] <jinxer-wm>	 (ProbeDown) resolved: (2) Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:04:12] <kostajh>	 jan_drewniak: sounds good
[21:04:16] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:04:35] <jan_drewniak>	 ottomata: you can even do `scap backport 980951 981337` and do two at once
[21:04:45] <ottomata>	 my jaw is getting closer to the floor
[21:05:02] <Kizule>	 Hi, can I add my patch to this deployment window? :)
[21:05:48] <Kizule>	 It's config related, for enabling action blocks in Serbian Wikipedia.
[21:05:52] <jan_drewniak>	 Kizule: yeah if it's something simple
[21:05:59] <Kizule>	 Niharika has approved it. Yup, totally simple.
[21:06:31] <logmsgbot>	 !log otto@deploy2002 Synchronized wmf-config/ext-EventStreamConfig.php: Config: [[gerrit:977075|Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit EventStreamConfig (T329718)]] (duration: 09m 16s)
[21:06:40] <stashbot>	 T329718: Decommission the SpecialMuteSubmit instrument - https://phabricator.wikimedia.org/T329718
[21:07:52] <taavi>	 ottomata: T353024
[21:07:52] <stashbot>	 T353024: Update [[wikitech:Backport_windows/Deployers]] for scap backport - https://phabricator.wikimedia.org/T353024
[21:07:56] <ottomata>	 my scap sync file is done
[21:08:07] <ottomata>	 taavi: thank you!
[21:08:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:08:54] <Kizule>	 jan_drewniak: I've added my patch to the Deployments page.
[21:10:33] <jan_drewniak>	 Kizule kostajh okydoke. I'll do all three at once,  `scap backport 980951 981337  976911` I'll let you know when they're ready to test
[21:11:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980951 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson)
[21:11:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos)
[21:11:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21)
[21:11:35] <wikibugs>	 (03PS3) 10Jdrewniak: Enable action blocks in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21)
[21:11:35] <kostajh>	 thx
[21:11:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980951 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson)
[21:11:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos)
[21:11:42] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21)
[21:11:52] <kostajh>	 jan_drewniak: mine will just show up in beta labs whenever that sync happens, so no need to wait for me
[21:12:02] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Vector beta feature for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980951 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson)
[21:12:05] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] ores-extension: enable revertrisk model for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos)
[21:12:45] <wikibugs>	 (03PS4) 10Jdrewniak: Enable action blocks in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21)
[21:12:53] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21)
[21:12:54] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:13:23] <jan_drewniak>	 (scap backport not as great when each patch requires a rebase)
[21:13:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:13:35] <wikibugs>	 (03Merged) 10jenkins-bot: Enable action blocks in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21)
[21:13:52] <logmsgbot>	 !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:980951|Enable Vector beta feature for all wikis (T351339)]], [[gerrit:981337|[beta] ores-extension: enable revertrisk model for enwiki (T348298)]], [[gerrit:976911|Enable action blocks in Serbian Wikipedia (T351873)]]
[21:13:58] <stashbot>	 T351339: Deploy client preferences to production beta features - https://phabricator.wikimedia.org/T351339
[21:13:59] <stashbot>	 T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298
[21:13:59] <stashbot>	 T351873: Enable action blocks in Serbian Wikipedia - https://phabricator.wikimedia.org/T351873
[21:15:13] <logmsgbot>	 !log jdrewniak@deploy2002 zoranzoki21 and isaranto and jdlrobson and jdrewniak: Backport for [[gerrit:980951|Enable Vector beta feature for all wikis (T351339)]], [[gerrit:981337|[beta] ores-extension: enable revertrisk model for enwiki (T348298)]], [[gerrit:976911|Enable action blocks in Serbian Wikipedia (T351873)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:15:31] <jan_drewniak>	 Kizule: the change is ready for testing on mwdebug
[21:15:49] <Kizule>	 jan_drewniak: Alright, give me a moment.
[21:16:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:16:41] <Kizule>	 Good to go
[21:17:12] <jan_drewniak>	 Kizule: alrighty, we're syncing :) 
[21:17:15] <logmsgbot>	 !log jdrewniak@deploy2002 zoranzoki21 and isaranto and jdlrobson and jdrewniak: Continuing with sync
[21:19:28] <Kizule>	 Thanks!
[21:20:56] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:23:46] <logmsgbot>	 !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:980951|Enable Vector beta feature for all wikis (T351339)]], [[gerrit:981337|[beta] ores-extension: enable revertrisk model for enwiki (T348298)]], [[gerrit:976911|Enable action blocks in Serbian Wikipedia (T351873)]] (duration: 09m 54s)
[21:23:53] <stashbot>	 T351339: Deploy client preferences to production beta features - https://phabricator.wikimedia.org/T351339
[21:23:53] <stashbot>	 T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298
[21:23:53] <stashbot>	 T351873: Enable action blocks in Serbian Wikipedia - https://phabricator.wikimedia.org/T351873
[21:24:33] <Kizule>	 Works in production, thanks!
[21:25:16] <jan_drewniak>	 alright, Kizule kostajh changes have been deployed :) 
[21:25:40] <kostajh>	 jan_drewniak: thank you
[21:25:47] <Kizule>	 Thank you!
[21:26:24] <Kizule>	 Now that we have ~30 minutes, I think we could use this.. Amir1: Can we test namespaceDupes.php on smaller Serbian projects? My task for running it still needs to be done. :)
[21:26:38] <kostajh>	 jan_drewniak: I need to revert mine, after it synced to beta, I can see it is causing errors.
[21:26:52] <kostajh>	 jan_drewniak: https://beta-logs.wmcloud.org/app/discover#/doc/5f0c9be0-0b6f-11ec-9cde-3f4490e09a26/logstash-mediawiki-1-7.0.0-1-2023.12.07?id=RlcrRowBhFLoKHIVRB8m
[21:27:28] <jan_drewniak>	 kostajh: no problem, we can revert that
[21:27:29] <kostajh>	 jan_drewniak: can you do `scap backport --revert 981337`?
[21:28:42] <kostajh>	 jan_drewniak: hmm, maybe we just need to run update.php on beta?
[21:28:43] <kostajh>	 I am not sure
[21:29:19] <Kizule>	 I don't think that we have ever run it?
[21:29:37] <jan_drewniak>	 kostajh: I don't know how to run update scripts on beta (though I feel like that should be done automatically)
[21:30:57] <RhinosF1>	 It's done automagically
[21:31:13] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[21:31:15] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1082.eqiad.wmnet with OS bullseye
[21:31:23] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1082.eqiad.wmnet with OS bullseye completed: - ms-be...
[21:31:41] <jan_drewniak>	 kostajh: since that's the case, maybe I'll revert? 
[21:31:50] <RhinosF1>	 https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/
[21:31:57] <RhinosF1>	 jan_drewniak: ^
[21:32:09] <RhinosF1>	 Last run was 10 minutes ago
[21:32:14] <RhinosF1>	 One due in 50
[21:32:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr)
[21:32:27] <RhinosF1>	 Someone might be able to trigger manually
[21:32:42] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr)
[21:33:06] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) 05Open→03Resolved
[21:34:56] <kostajh>	 jan_drewniak: yeah, let's just revert
[21:35:09] <kostajh>	 I am too tired to debug that now :) we can try again next week
[21:35:11] <kostajh>	 thanks!
[21:35:31] <jan_drewniak>	 kostajh: or we can wait an hour and see if the errors subside (I won't be here to revert in an hour if that's necessary though)
[21:36:09] <kostajh>	 jan_drewniak: I suspect we are missing some config
[21:36:25] <RhinosF1>	 I suggest a revert if unsure, it's the last window of the week
[21:36:49] <kostajh>	 yeah, let's revert please.
[21:36:54] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] install_server: partman recipe for new sessionstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/975913 (https://phabricator.wikimedia.org/T349875) (owner: 10Eevans)
[21:37:37] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "[beta] ores-extension: enable revertrisk model for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981395
[21:37:39] <wikibugs>	 (03CR) 10TrainBranchBot: "jdrewniak@deploy2002 created a revert of this change as Id628e04a35adbd748824437c8cc921f1e08e9371" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos)
[21:37:53] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@049cf03]: (no justification provided)
[21:38:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981395 (owner: 10TrainBranchBot)
[21:38:21] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@049cf03]: (no justification provided) (duration: 00m 28s)
[21:38:42] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "[beta] ores-extension: enable revertrisk model for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981395 (owner: 10TrainBranchBot)
[21:39:39] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission rdb1009, rdb1010 - https://phabricator.wikimedia.org/T352547 (10Jclark-ctr) a:03Jclark-ctr
[21:39:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10Jclark-ctr) a:03Jclark-ctr Server is out of warranty.  I can check tomorrow morning I am  pretty sure i have a spare drive from decommissioned host but will verify
[21:39:45] <jan_drewniak>	 alrighty, kostajh change reverted :) 
[21:40:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 (10Jclark-ctr) a:03Jclark-ctr
[21:41:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T343198)', diff saved to https://phabricator.wikimedia.org/P54289 and previous config saved to /var/cache/conftool/dbconfig/20231207-214114-ladsgroup.json
[21:41:18] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[21:42:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Eevans) >>! In T349876#9391164, @Jhancock.wm wrote: > having an issue with all the new sessionstore servers that I think stems from the HBA355i Fnt card. >  > When t...
[21:43:35] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10Jclark-ctr) @ayounsi if you wouldn't mind messaging me a time that works best with you so we can fix this
[21:46:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10Jclark-ctr)
[21:51:04] <wikibugs>	 (03PS1) 10Jclark-ctr: Add ganeti103[5-8] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/981396 (https://phabricator.wikimedia.org/T349925)
[21:51:38] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] Add ganeti103[5-8] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/981396 (https://phabricator.wikimedia.org/T349925) (owner: 10Jclark-ctr)
[21:51:42] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[21:52:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10Jclark-ctr)
[21:56:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P54290 and previous config saved to /var/cache/conftool/dbconfig/20231207-215620-ladsgroup.json
[21:57:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1059.mgmt.eqiad.wmnet with reboot policy FORCED
[21:57:07] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1060.mgmt.eqiad.wmnet with reboot policy FORCED
[21:57:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1061.mgmt.eqiad.wmnet with reboot policy FORCED
[21:57:11] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1062.mgmt.eqiad.wmnet with reboot policy FORCED
[21:57:35] <wikibugs>	 (03PS7) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604)
[21:58:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr)
[22:01:18] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[22:03:56] <wikibugs>	 (03PS1) 10Jclark-ctr: add kubernetes10[59-62] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/981399 (https://phabricator.wikimedia.org/T349874)
[22:04:31] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] add kubernetes10[59-62] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/981399 (https://phabricator.wikimedia.org/T349874) (owner: 10Jclark-ctr)
[22:05:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr)
[22:05:48] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Enable consumer-devnull in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981401
[22:07:20] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Enable consumer-devnull in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981401 (owner: 10Ebernhardson)
[22:08:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[22:08:11] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Enable consumer-devnull in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981401 (owner: 10Ebernhardson)
[22:09:48] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:10:20] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[22:10:28] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:11:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P54291 and previous config saved to /var/cache/conftool/dbconfig/20231207-221127-ladsgroup.json
[22:13:10] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Ensure correct image name is provided to consumer-devnull [deployment-charts] - 10https://gerrit.wikimedia.org/r/981402
[22:14:18] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1059.mgmt.eqiad.wmnet with reboot policy FORCED
[22:14:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1062.mgmt.eqiad.wmnet with reboot policy FORCED
[22:14:30] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1060.mgmt.eqiad.wmnet with reboot policy FORCED
[22:14:35] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1061.mgmt.eqiad.wmnet with reboot policy FORCED
[22:15:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1059.eqiad.wmnet with OS bullseye
[22:15:38] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus updater: Ensure correct image name is provided to consumer-devnull [deployment-charts] - 10https://gerrit.wikimedia.org/r/981402
[22:15:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1059.eqiad.wmnet with OS bullseye
[22:16:28] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1060.eqiad.wmnet with OS bullseye
[22:16:32] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1061.eqiad.wmnet with OS bullseye
[22:16:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1062.eqiad.wmnet with OS bullseye
[22:16:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1060.eqiad.wmnet with OS bullseye
[22:16:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1061.eqiad.wmnet with OS bullseye
[22:16:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1062.eqiad.wmnet with OS bullseye
[22:17:16] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Ensure correct image name is provided to consumer-devnull [deployment-charts] - 10https://gerrit.wikimedia.org/r/981402 (owner: 10Ebernhardson)
[22:18:05] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Ensure correct image name is provided to consumer-devnull [deployment-charts] - 10https://gerrit.wikimedia.org/r/981402 (owner: 10Ebernhardson)
[22:19:16] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[22:19:16] <wikibugs>	 (03PS4) 10Jdlrobson: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia)
[22:19:32] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "No rush on this one. Whenever we can get round to it!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia)
[22:19:32] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:19:50] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] Remove readability survey tool (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia)
[22:20:22] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[22:20:32] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:22:26] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[22:22:33] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:26:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T343198)', diff saved to https://phabricator.wikimedia.org/P54292 and previous config saved to /var/cache/conftool/dbconfig/20231207-222633-ladsgroup.json
[22:26:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[22:26:38] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[22:26:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[22:26:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T343198)', diff saved to https://phabricator.wikimedia.org/P54293 and previous config saved to /var/cache/conftool/dbconfig/20231207-222656-ladsgroup.json
[22:29:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1059.eqiad.wmnet with reason: host reimage
[22:30:23] <wikibugs>	 (03PS1) 10Ebernhardson: admin_ng: Use same limits in cirrus-streaming-updater as rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/981403
[22:30:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1060.eqiad.wmnet with reason: host reimage
[22:30:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1062.eqiad.wmnet with reason: host reimage
[22:31:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1061.eqiad.wmnet with reason: host reimage
[22:33:10] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1059.eqiad.wmnet with reason: host reimage
[22:35:32] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1062.eqiad.wmnet with reason: host reimage
[22:35:56] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1060.eqiad.wmnet with reason: host reimage
[22:37:32] <wikibugs>	 (03PS5) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337)
[22:38:53] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1061.eqiad.wmnet with reason: host reimage
[22:48:37] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[22:53:36] <wikibugs>	 (03PS5) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355)
[22:53:53] <wikibugs>	 (03CR) 10Kimberly Sarabia: Remove readability survey tool (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia)
[22:53:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[22:55:37] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Papaul) @Volans did the test 4 times. the first 2 times the server did pxe boot but the last 2 times it didn't
[22:55:47] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cp4037.ulsfo.wmnet
[22:58:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[23:00:03] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Papaul) @Jhancock.wm did you read my comment on Wed, Dec 6, 2:53 PM?
[23:00:25] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:05:12] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[23:07:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T343198)', diff saved to https://phabricator.wikimedia.org/P54294 and previous config saved to /var/cache/conftool/dbconfig/20231207-230749-ladsgroup.json
[23:07:54] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[23:08:38] <wikibugs>	 (03PS2) 10Ryan Kemper: admin_ng: Use same limits in cirrus-streaming-updater as rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/981403 (owner: 10Ebernhardson)
[23:09:07] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] admin_ng: Use same limits in cirrus-streaming-updater as rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/981403 (owner: 10Ebernhardson)
[23:09:51] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] admin_ng: Use same limits in cirrus-streaming-updater as rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/981403 (owner: 10Ebernhardson)
[23:12:32] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Use same limits in cirrus-streaming-updater as rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/981403 (owner: 10Ebernhardson)
[23:15:22] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[23:17:18] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:21:05] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[23:21:06] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[23:21:44] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[23:22:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P54295 and previous config saved to /var/cache/conftool/dbconfig/20231207-232256-ladsgroup.json
[23:23:17] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[23:23:43] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[23:23:54] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:35:39] <wikibugs>	 (03PS1) 10Andrea Denisse: prometheus: Ensure prometheus-icinga has a listening address [puppet] - 10https://gerrit.wikimedia.org/r/981407 (https://phabricator.wikimedia.org/T333615)
[23:38:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P54296 and previous config saved to /var/cache/conftool/dbconfig/20231207-233802-ladsgroup.json
[23:38:57] <wikibugs>	 (03PS3) 10Andrea Denisse: klaxon: Ensure the klaxon user has a home directory [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615)
[23:39:21] <wikibugs>	 (03CR) 10Andrea Denisse: klaxon: Ensure the klaxon user has a home directory (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[23:39:35] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Undeploy consumer-devnull from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981409
[23:42:34] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Undeploy consumer-devnull from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981409 (owner: 10Ebernhardson)
[23:43:18] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Undeploy consumer-devnull from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981409 (owner: 10Ebernhardson)
[23:46:21] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron)
[23:47:16] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[23:47:23] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi)
[23:47:37] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:52:16] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[23:52:18] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[23:52:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1060.eqiad.wmnet with OS bullseye
[23:52:21] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[23:52:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1059.eqiad.wmnet with OS bullseye
[23:52:27] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[23:52:27] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1062.eqiad.wmnet with OS bullseye
[23:52:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1060.eqiad.wmnet with OS bullseye completed: - kubernetes1060 (**WARN**)...
[23:52:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1059.eqiad.wmnet with OS bullseye completed: - kubernetes1059 (**PASS**)...
[23:52:32] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1061.eqiad.wmnet with OS bullseye
[23:52:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1062.eqiad.wmnet with OS bullseye completed: - kubernetes1062 (**WARN**)...
[23:52:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1061.eqiad.wmnet with OS bullseye completed: - kubernetes1061 (**WARN**)...
[23:52:41] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.243:9042 on restbase2030 is OK: TCP OK - 0.040 second response time on 10.192.16.243 port 9042 https://phabricator.wikimedia.org/T93886
[23:53:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T343198)', diff saved to https://phabricator.wikimedia.org/P54297 and previous config saved to /var/cache/conftool/dbconfig/20231207-235310-ladsgroup.json
[23:53:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[23:53:14] <stashbot>	 T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198
[23:53:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[23:53:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T343198)', diff saved to https://phabricator.wikimedia.org/P54298 and previous config saved to /var/cache/conftool/dbconfig/20231207-235333-ladsgroup.json
[23:53:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr
[23:54:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) 05Open→03Resolved
[23:56:54] <wikibugs>	 (03PS1) 10Papaul: Rename ceph to cephosd [puppet] - 10https://gerrit.wikimedia.org/r/981413 (https://phabricator.wikimedia.org/T349934)