[00:23:42] (03PS1) 10Clare Ming: Add stream config for Android article instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980963 (https://phabricator.wikimedia.org/T351292) [00:29:50] (03PS1) 10Jforrester: [BETA CLUSTER] Switch log host from mwlog01 to mwlog02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980965 (https://phabricator.wikimedia.org/T345566) [00:30:12] (03CR) 10Clare Ming: "hi all - just need a single +1 if this lgtu" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980963 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [00:33:28] (03PS1) 10Jforrester: deployment-prep: Switch mwlog01 to mwlog02 [puppet] - 10https://gerrit.wikimedia.org/r/980966 (https://phabricator.wikimedia.org/T345566) [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/980832 [00:38:30] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/980832 (owner: 10TrainBranchBot) [00:39:53] PROBLEM - Check systemd state on centrallog2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:47] PROBLEM - Check systemd state on centrallog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:59] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:23] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:53:14] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [00:53:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1080.eqiad.wmnet with OS bullseye [00:53:19] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1081.eqiad.wmnet with OS bullseye [00:53:24] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1080.eqiad.wmnet with OS bullseye completed: - ms-be... [00:53:27] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1081.eqiad.wmnet with OS bullseye completed: - ms-be... [00:53:42] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1082.eqiad.wmnet with OS bullseye [00:53:49] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1082.eqiad.wmnet with OS bullseye [00:57:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/980832 (owner: 10TrainBranchBot) [01:14:53] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:22:31] (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] Switch log host from mwlog01 to mwlog02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980965 (https://phabricator.wikimedia.org/T345566) (owner: 10Jforrester) [01:23:45] (03Merged) 10jenkins-bot: [BETA CLUSTER] Switch log host from mwlog01 to mwlog02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980965 (https://phabricator.wikimedia.org/T345566) (owner: 10Jforrester) [01:31:45] RECOVERY - cassandra-b CQL 10.192.16.241:9042 on restbase2029 is OK: TCP OK - 0.035 second response time on 10.192.16.241 port 9042 https://phabricator.wikimedia.org/T93886 [01:44:41] RECOVERY - cassandra-c SSL 10.192.16.242:7000 on restbase2029 is OK: SSL OK - Certificate restbase2029-c valid until 2025-12-05 16:11:15 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [01:44:49] RECOVERY - cassandra-c service on restbase2029 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:49:59] (03CR) 10Ori: [V: 03+2 C: 03+2] "Cherry-picked on beta cluster. Verified by running tcpdump on a beta MediaWiki host:" [puppet] - 10https://gerrit.wikimedia.org/r/980966 (https://phabricator.wikimedia.org/T345566) (owner: 10Jforrester) [01:56:03] (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:05:39] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [02:22:08] (03CR) 10Sharvaniharan: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980963 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [02:39:06] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:51] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [02:59:16] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [03:00:05] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [03:00:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:01:44] mutante, James_F, I merged your puppet patches [03:09:06] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:06] (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:11] (03PS3) 10Andrew Bogott: Horizon: allow image uploading via horizon for users with glance admin [puppet] - 10https://gerrit.wikimedia.org/r/980021 (https://phabricator.wikimedia.org/T326818) [03:21:13] (03PS3) 10Andrew Bogott: cloud-init: make puppet optional [puppet] - 10https://gerrit.wikimedia.org/r/980079 (https://phabricator.wikimedia.org/T326818) [03:21:25] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980079 (https://phabricator.wikimedia.org/T326818) (owner: 10Andrew Bogott) [03:45:53] PROBLEM - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [250000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [03:56:09] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 368.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:31:58] (03PS4) 10Andrew Bogott: cloud-init: make puppet optional [puppet] - 10https://gerrit.wikimedia.org/r/980079 (https://phabricator.wikimedia.org/T326818) [04:35:49] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:34:11] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:35:45] RECOVERY - cassandra-c CQL 10.192.16.242:9042 on restbase2029 is OK: TCP OK - 0.032 second response time on 10.192.16.242 port 9042 https://phabricator.wikimedia.org/T93886 [05:38:09] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:56:03] (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:05:21] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:08:37] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:22:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:23:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:51] (03PS1) 10Marostegui: Revert "wmnet: Failover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/980484 [06:32:00] (03PS2) 10Marostegui: Revert "wmnet: Failover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/980484 [06:32:15] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:14] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Failover m2-master" [dns] - 10https://gerrit.wikimedia.org/r/980484 (owner: 10Marostegui) [06:35:01] (03PS1) 10Marostegui: wmnet: Failover m5-master to dbproxy1027 [dns] - 10https://gerrit.wikimedia.org/r/980978 (https://phabricator.wikimedia.org/T351864) [06:35:55] !log Failover m5-master from dbproxy1021 to dbproxy1027 T351864 [06:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:59] T351864: Migrate dbproxy hosts to Bookworm - https://phabricator.wikimedia.org/T351864 [06:36:33] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master to dbproxy1027 [dns] - 10https://gerrit.wikimedia.org/r/980978 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [06:37:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:41:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:41:09] Thanks for the failover marostegui I'm around for a bit [06:44:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [06:44:38] (03PS1) 10Marostegui: mariadb: Decommission db1119 [puppet] - 10https://gerrit.wikimedia.org/r/980979 (https://phabricator.wikimedia.org/T337206) [06:44:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1119.eqiad.wmnet [06:48:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1119 [puppet] - 10https://gerrit.wikimedia.org/r/980979 (https://phabricator.wikimedia.org/T337206) (owner: 10Marostegui) [06:50:19] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:52:16] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1119.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [06:53:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1119.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [06:53:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:53:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1119.eqiad.wmnet [06:55:29] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 (10Marostegui) a:05Marostegui→03None [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T0700) [07:00:05] kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T0700). [07:00:24] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:02:26] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 (10Marostegui) This is ready for DC-Ops [07:02:32] 10Puppet, 10SRE, 10Patch-For-Review: Add humorous redirect for fox.wikimedia.org - https://phabricator.wikimedia.org/T352870 (10Joe) 05Open→03Declined Or not :) [07:03:39] 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 (10Marostegui) [07:05:24] (03CR) 10KartikMistry: Provide python3-bookworm image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [07:17:08] (03PS1) 10Marostegui: mariadb: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/980984 (https://phabricator.wikimedia.org/T343674) [07:18:24] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] mariadb: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/980984 (https://phabricator.wikimedia.org/T343674) (owner: 10Marostegui) [07:18:40] (03CR) 10Marostegui: "db2131 is pooled, it should have notifications enabled" [puppet] - 10https://gerrit.wikimedia.org/r/980984 (https://phabricator.wikimedia.org/T343674) (owner: 10Marostegui) [07:19:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/980984 (https://phabricator.wikimedia.org/T343674) (owner: 10Marostegui) [07:20:10] (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:31:17] (03PS2) 10Ayounsi: BGPPeers: add codfw racks A1 to B8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893) [07:37:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:42:00] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:58:20] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:05] Amir1, apergos, and jnuche: Dear deployers, time to do the UTC morning backport and config training deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T0800). [08:00:17] morning. no patches scheduled for deployment, no trainees signed up to learn, and that's a wrap. see you next time... [08:01:43] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:05] (03PS1) 10Slyngshede: Add instance to summary for NTP [alerts] - 10https://gerrit.wikimedia.org/r/981175 [08:07:06] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:13:57] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/output/980927/846/" [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [08:16:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [08:18:00] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:21:06] (03PS3) 10Muehlenhoff: Enable requestctl-based block list for nftables on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/980868 (https://phabricator.wikimedia.org/T348734) [08:21:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980868 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [08:21:43] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:22:07] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I don't like the idea to allow people to do this as it's a terrible footgun, but also I think in this case the implementation is wrong - y" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/980864 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto) [08:22:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [08:25:02] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:25:24] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:25:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org [08:27:52] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:29:26] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:43] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org [08:32:25] (03CR) 10Ilias Sarantopoulos: [C: 03+1] api-gateway: Add entry for recommendation-api-ng on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [08:32:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:52] (03CR) 10Muehlenhoff: [C: 03+2] Enable requestctl-based block list for nftables on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/980868 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [08:34:22] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:35:38] 10SRE, 10Traffic, 10Patch-For-Review: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Vgutierrez) 05Open→03Resolved [08:37:22] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:38:22] (03CR) 10Brouberol: [C: 03+1] "Thank you for spearheading the work on this!" [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [08:40:48] (03PS1) 10Giuseppe Lavagetto: wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 [08:42:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:43:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.386 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:44:46] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:45:02] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:33] (03CR) 10MVernon: [C: 03+1] "This looks like a good approach to me, thanks." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (owner: 10Giuseppe Lavagetto) [08:46:18] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:46:43] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:06] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:10] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:51:14] (03PS2) 10Giuseppe Lavagetto: Add asyncio implementation [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/980918 (https://phabricator.wikimedia.org/T338297) [08:52:16] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 31 days, 0:00:00 on sretest1001.eqiad.wmnet with reason: WIP nftables [08:52:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 31 days, 0:00:00 on sretest1001.eqiad.wmnet with reason: WIP nftables [08:53:58] PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 27433MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [08:57:14] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:43] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:07:22] (03Abandoned) 10Jelto: add optional install_recommends to apt_install [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/980864 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto) [09:07:25] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:09:09] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:23] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:43] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:01] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:14:11] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:20:32] (03CR) 10Filippo Giunchedi: [C: 03+1] ircecho: Migrate the ircecho script from Python 2 to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [09:20:42] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10JMeybohm) [09:21:12] (03PS13) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [09:21:14] (03PS1) 10Muehlenhoff: nftables requestctl: Some tweaks and fixes [puppet] - 10https://gerrit.wikimedia.org/r/981280 [09:21:38] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [09:21:53] (03CR) 10Filippo Giunchedi: [C: 03+1] Add instance to summary for NTP [alerts] - 10https://gerrit.wikimedia.org/r/981175 (owner: 10Slyngshede) [09:23:15] (03PS2) 10Muehlenhoff: nftables requestctl: Some tweaks and fixes [puppet] - 10https://gerrit.wikimedia.org/r/981280 (https://phabricator.wikimedia.org/T348734) [09:29:21] RECOVERY - Check systemd state on centrallog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:07] RECOVERY - Check systemd state on centrallog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:11] (03PS1) 10Alexandros Kosiaris: cirrusSearchCheckerJob: Revert to baremetal [deployment-charts] - 10https://gerrit.wikimedia.org/r/981282 (https://phabricator.wikimedia.org/T352906) [09:31:54] (03CR) 10JMeybohm: [C: 03+1] cirrusSearchCheckerJob: Revert to baremetal [deployment-charts] - 10https://gerrit.wikimedia.org/r/981282 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [09:34:33] RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [09:35:03] (03CR) 10Jelto: "one comment regarding the templating" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (owner: 10Giuseppe Lavagetto) [09:35:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/981280 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [09:36:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] cirrusSearchCheckerJob: Revert to baremetal [deployment-charts] - 10https://gerrit.wikimedia.org/r/981282 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [09:37:27] (03Merged) 10jenkins-bot: cirrusSearchCheckerJob: Revert to baremetal [deployment-charts] - 10https://gerrit.wikimedia.org/r/981282 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [09:37:28] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Volans) @Papaul another test we could do is use the dhcp cookbook and then try to reboot into PXE using remote IPMI like the cookbook does. The co... [09:39:55] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [09:40:12] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [09:40:41] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [09:41:05] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [09:41:34] (03CR) 10Ayounsi: [C: 03+2] Netbox: remove SECURE_PROXY_SSL_HEADER [puppet] - 10https://gerrit.wikimedia.org/r/980815 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:42:00] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [09:42:15] (03Abandoned) 10Ayounsi: netbox, profile::netbox: Switch to CAS authentication [puppet] - 10https://gerrit.wikimedia.org/r/668753 (https://phabricator.wikimedia.org/T244849) (owner: 10CRusnov) [09:42:22] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [09:42:26] (03PS1) 10Muehlenhoff: Remove now obsolete cergen Ganeti certs [puppet] - 10https://gerrit.wikimedia.org/r/981285 (https://phabricator.wikimedia.org/T350686) [09:48:19] (03CR) 10Btullis: [V: 03+1 C: 03+2] Update the refinery version used by the refine test jobs [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis) [09:48:32] (03CR) 10Btullis: [V: 03+1 C: 03+2] Update the refinery version used by the refine test jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) (owner: 10Btullis) [09:48:59] RECOVERY - Mediawiki CirrusSearch Saneitizer Weekly Fix Rate on alert1001 is OK: OK: Less than 1.00% above the threshold [100000.0] https://wikitech.wikimedia.org/wiki/Search%23Saneitizer_%28background_repair_process%29 https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now [09:50:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/981285 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:56:04] (ConfdResourceFailed) firing: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:56:10] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fixes" [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [10:02:09] (03CR) 10Effie Mouzeli: [C: 03+1] [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) (owner: 10EoghanGaffney) [10:02:19] (03CR) 10Effie Mouzeli: [C: 03+1] [admin] Add ehughes ldap account [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) (owner: 10EoghanGaffney) [10:04:57] (03PS1) 10Filippo Giunchedi: rsyslog: add receiver action names [puppet] - 10https://gerrit.wikimedia.org/r/981287 (https://phabricator.wikimedia.org/T351710) [10:05:47] (ConfdResourceFailed) resolved: (2) confd resource _etc_ferm_conf.d_00_defs_requestctl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:06:09] (03CR) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [10:06:28] (03PS5) 10Ayounsi: Netbox module: add get/set for primary IPs and access vlan [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) [10:10:26] (03CR) 10Effie Mouzeli: [C: 03+1] [admin] Add user account for xiaoxiao to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979389 (https://phabricator.wikimedia.org/T352098) (owner: 10EoghanGaffney) [10:13:55] (03CR) 10EoghanGaffney: [C: 03+2] [admin] Add user account for xiaoxiao to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979389 (https://phabricator.wikimedia.org/T352098) (owner: 10EoghanGaffney) [10:16:21] (03PS1) 10Muehlenhoff: Fix handling of Ferm's 00_defs_requestctl when changing firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/981288 (https://phabricator.wikimedia.org/T348734) [10:16:24] (03PS1) 10Slyngshede: Blackbox alerting for urldownloaders [alerts] - 10https://gerrit.wikimedia.org/r/981289 (https://phabricator.wikimedia.org/T350694) [10:20:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10eoghan) [10:20:15] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: add receiver action names [puppet] - 10https://gerrit.wikimedia.org/r/981287 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [10:21:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/981288 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [10:22:07] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [10:22:21] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [10:22:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:23:01] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:23:36] (03CR) 10Ayounsi: Netbox: add generic function to execute a Netbox script (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [10:24:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10eoghan) 05Open→03Resolved Hi @XiaoXiao-WMF , this should be done! Please reach out if you're having any problems! [10:27:43] !log brouberol@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [10:28:08] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980834 [10:28:37] (03PS3) 10Ayounsi: Netbox: add generic function to execute a Netbox script [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) [10:30:52] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980834 (owner: 10PipelineBot) [10:31:42] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980834 (owner: 10PipelineBot) [10:32:05] (03CR) 10Kamila Součková: [C: 03+2] mw-api-int: increase replicas by 33% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980888 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [10:32:55] (03Merged) 10jenkins-bot: mw-api-int: increase replicas by 33% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980888 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [10:32:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [10:33:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [10:33:22] (03CR) 10Klausman: [V: 03+2 C: 03+2] api-gateway: Add entry for recommendation-api-ng on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [10:33:29] ⎋/query hnowlan [10:33:32] oops :) [10:33:38] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:33:53] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:34:08] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:34:36] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:34:52] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [10:35:07] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [10:35:34] (03Merged) 10jenkins-bot: api-gateway: Add entry for recommendation-api-ng on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/980865 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [10:36:35] (03CR) 10Kamila Součková: [C: 03+2] mobileapps: 60% to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/976222 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [10:36:51] (03Abandoned) 10Elukey: profile::thanos: improve istio sli recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974486 (owner: 10Elukey) [10:38:38] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:38:55] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:42:16] (03PS1) 10Muehlenhoff: Bump standards version [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981293 [10:44:31] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) I've added some more visibility into how much we are writing to local files, compared to the amount of logs we are receiving. Turns out we re... [10:45:19] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: cluster::management [10:47:30] 10SRE-swift-storage, 10observability, 10SRE Observability (FY2023/2024-Q2), 10User-fgiunchedi: Stop sending swift access logs to centrallog for non state-changing requests - https://phabricator.wikimedia.org/T352968 (10fgiunchedi) [10:47:31] (03PS1) 10Muehlenhoff: Switch cluster::management to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/981294 (https://phabricator.wikimedia.org/T349619) [10:49:11] (03CR) 10Muehlenhoff: [C: 03+2] Switch cluster::management to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/981294 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:50:37] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/980817/847/build2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/980817 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [10:50:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:51:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:51:39] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [10:51:53] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [10:52:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] docker_pkg: install convenience symlink [puppet] - 10https://gerrit.wikimedia.org/r/980817 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [10:52:40] (03CR) 10Filippo Giunchedi: [C: 03+2] docker_pkg: install convenience symlink [puppet] - 10https://gerrit.wikimedia.org/r/980817 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [10:53:16] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [10:54:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: cluster::management [10:58:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:58:16] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:00:04] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1100). nyaa~ [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1100) [11:00:25] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:01:03] !log brouberol@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [11:03:15] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [11:03:55] (03PS1) 10Muehlenhoff: Revert "Switch cluster::management to Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/981296 [11:05:01] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Switch cluster::management to Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/981296 (owner: 10Muehlenhoff) [11:09:26] (03CR) 10Cathal Mooney: [C: 03+1] "Thanks, and LGTM. However I'm unsure exactly when we should merge, does this just affect initial config or will it cause current systems " [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [11:10:47] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [11:10:49] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [11:12:26] (03CR) 10Clément Goubert: [C: 04-1] Provide python3-bookworm image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [11:12:56] !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [11:13:26] !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [11:13:55] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:14:09] (03PS5) 10EoghanGaffney: [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) [11:14:14] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:14:26] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [11:14:51] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) [11:14:59] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) p:05Triage→03Medium [11:17:05] !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [11:17:27] (03CR) 10Effie Mouzeli: [C: 03+1] [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) (owner: 10EoghanGaffney) [11:17:27] !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [11:18:17] (03CR) 10EoghanGaffney: [C: 03+2] [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) (owner: 10EoghanGaffney) [11:19:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:20:10] (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:46] (03CR) 10Effie Mouzeli: mcrouter: add chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:21:42] (03PS28) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:21:48] (03CR) 10Effie Mouzeli: mcrouter: add chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:22:13] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:25:16] (03PS1) 10Filippo Giunchedi: swift: write to local files and ban before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T351710) [11:25:24] (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981293 (owner: 10Muehlenhoff) [11:30:24] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:30:28] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [11:30:30] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [11:30:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [11:33:23] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:33:53] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:34:03] (03CR) 10Btullis: [C: 03+1] Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:34:24] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10eoghan) [11:35:39] 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10eoghan) 05Open→03Resolved This is done, please reach out if there's any issues! [11:40:59] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10MoritzMuehlenhoff) cumin1001 has been reverted to Puppet 5, but cumin2002 is on Puppet 7 and can be used to reproduce. [11:45:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete cergen Ganeti certs [puppet] - 10https://gerrit.wikimedia.org/r/981285 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [11:48:20] !log btullis@deploy2002 Started deploy [analytics/refinery@b6499b1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@b6499b17] [11:48:28] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [11:50:34] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977227 (owner: 10PipelineBot) [11:50:40] (03Abandoned) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/976958 (owner: 10PipelineBot) [11:50:49] (03PS1) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [11:51:28] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) db1124 can be used for testing. It is a test host running puppet 7. It can be restarted, rebooted, reimaged, whatever is needed [11:51:37] !log btullis@deploy2002 Finished deploy [analytics/refinery@b6499b1] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@b6499b17] (duration: 03m 17s) [11:57:39] (03PS1) 10Muehlenhoff: Remove obsolete dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/981301 (https://phabricator.wikimedia.org/T350686) [12:01:35] (03PS1) 10Muehlenhoff: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [12:04:33] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) Just took a quick look: ` # db-mysql db1133 ERROR 2026 (HY000): SSL connection error: self signed certificate in certificate chain `... [12:08:26] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) This has more implications, as orchestrator cannot see these hosts (db1124, db1133) (with the changed cert). So this really needs lo... [12:09:00] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) ` 15 dborch1001 orchestrator[425]: 2023-12-07 12:07:15 ERROR ReadTopologyInstance(db1124.eqiad.wmnet:3306) show global status like '... [12:12:15] (03PS1) 10Jgiannelos: Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981188 [12:12:25] (03CR) 10KartikMistry: Provide python3-bookworm image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [12:13:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1001.eqiad.wmnet [12:14:20] (03CR) 10Jgiannelos: [C: 03+2] Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981188 (owner: 10Jgiannelos) [12:15:11] (03Merged) 10jenkins-bot: Revert "mobileapps: pipeline bot promote" [deployment-charts] - 10https://gerrit.wikimedia.org/r/981188 (owner: 10Jgiannelos) [12:16:17] (03PS1) 10Muehlenhoff: Switch cloudcephosd1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/981304 (https://phabricator.wikimedia.org/T349619) [12:16:49] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:17:15] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:17:23] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:17:56] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:18:06] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:18:44] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:23:24] (03CR) 10Muehlenhoff: [C: 03+2] Switch cloudcephosd1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/981304 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:27:36] (03PS1) 10Alexandros Kosiaris: service_proxy/mesh: Bump to newer version globally [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) [12:30:03] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Thanks for the good explanation I didn't grok the reason for it looking at the change alone." [puppet] - 10https://gerrit.wikimedia.org/r/981288 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [12:36:00] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/981312 (owner: 10L10n-bot) [12:38:06] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980838 [12:38:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1001.eqiad.wmnet [12:45:17] (03PS2) 10EoghanGaffney: [admin] Add ehughes ldap account [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) [12:46:09] (03CR) 10CI reject: [V: 04-1] [admin] Add ehughes ldap account [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) (owner: 10EoghanGaffney) [12:46:23] (03CR) 10JMeybohm: "We should test with one service first to be sure nothing breaks" [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [12:46:50] (03PS3) 10EoghanGaffney: [admin] Add ehughes ldap account [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) [12:47:14] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:47:17] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:48:44] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:48:57] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:49:33] (03PS4) 10EoghanGaffney: [admin] Add ehughes shell account with no ssh key [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) [12:52:23] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:52:26] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:55:39] (03PS1) 10JMeybohm: ml-staging: Enable certmanager for mesh certs by default [puppet] - 10https://gerrit.wikimedia.org/r/981325 (https://phabricator.wikimedia.org/T300033) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1300) [13:04:30] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove obsolete dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/981301 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:07:08] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:07:36] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:08:41] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:09:36] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:09:44] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:09:46] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [13:09:51] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [13:10:21] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [13:13:06] (03CR) 10Volans: [C: 03+1] "Code LGTM! Time to write the tests now ;)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/979121 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:14:43] (03PS1) 10Jgiannelos: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981326 [13:16:41] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981326 (owner: 10Jgiannelos) [13:17:35] (03Merged) 10jenkins-bot: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981326 (owner: 10Jgiannelos) [13:18:47] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:18:57] jouncebot: nowandnext [13:18:58] For the next 0 hour(s) and 41 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1300) [13:18:58] In 0 hour(s) and 41 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1400) [13:19:03] awesome [13:19:29] (03CR) 10Ladsgroup: [C: 03+2] api: Only force backlink namespace index when there is one ns only [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980483 (https://phabricator.wikimedia.org/T351237) (owner: 10Jforrester) [13:19:52] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:21:58] (03CR) 10Elukey: [C: 03+2] ml-staging: Enable certmanager for mesh certs by default [puppet] - 10https://gerrit.wikimedia.org/r/981325 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:24:42] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:24:46] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:24:51] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: sync [13:25:06] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:25:09] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:25:10] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [13:27:30] (03PS3) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) [13:27:31] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [13:27:51] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:29:58] (03CR) 10CI reject: [V: 04-1] [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [13:31:40] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:31:52] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:32:10] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:32:13] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:33:04] (03CR) 10Muehlenhoff: [C: 03+1] "Let me know if you want to me to puppet-merge this when you feel it's ready." [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [13:33:42] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:34:07] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:34:18] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:34:39] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:36:58] (03Merged) 10jenkins-bot: api: Only force backlink namespace index when there is one ns only [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980483 (https://phabricator.wikimedia.org/T351237) (owner: 10Jforrester) [13:37:24] (03PS1) 10Ladsgroup: Drop python2 from tox [puppet] - 10https://gerrit.wikimedia.org/r/981329 [13:38:50] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:980483|api: Only force backlink namespace index when there is one ns only (T351237)]] [13:38:51] (03CR) 10Muehlenhoff: [C: 03+2] Fix handling of Ferm's 00_defs_requestctl when changing firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/981288 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [13:38:54] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [13:40:25] !log ladsgroup@deploy2002 jforrester and ladsgroup: Backport for [[gerrit:980483|api: Only force backlink namespace index when there is one ns only (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:42:22] (03PS1) 10Jgiannelos: mobileapps: Use parsoid via restbase on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981330 [13:42:56] !log ladsgroup@deploy2002 jforrester and ladsgroup: Continuing with sync [13:46:10] (03PS4) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) [13:48:42] (03CR) 10CI reject: [V: 04-1] [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [13:48:51] (03CR) 10Jgiannelos: "This only affects staging. Production uses restbase either way." [deployment-charts] - 10https://gerrit.wikimedia.org/r/981330 (owner: 10Jgiannelos) [13:49:46] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:980483|api: Only force backlink namespace index when there is one ns only (T351237)]] (duration: 10m 55s) [13:49:49] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [13:51:04] (03PS1) 10Alexandros Kosiaris: citoid: Set service_mesh version to 1.23.10-2-s4-20231203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981331 (https://phabricator.wikimedia.org/T352906) [13:51:51] (03PS1) 10JMeybohm: Add new istio module version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981332 (https://phabricator.wikimedia.org/T300033) [13:51:53] (03Abandoned) 10Ladsgroup: Drop python2 from tox [puppet] - 10https://gerrit.wikimedia.org/r/981329 (owner: 10Ladsgroup) [13:51:55] (03PS1) 10JMeybohm: ingress.istio: Remove trust for every SAN but the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) [13:52:08] TheresNoTime, sorry I wasn't around for the deployment yesterday. I've rescheduled for today [13:52:12] (03PS2) 10Alexandros Kosiaris: service_proxy/mesh: Bump to newer version globally [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) [13:52:24] (03CR) 10Alexandros Kosiaris: "Good point, doing so in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/981331" [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [13:52:30] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:52:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:54:59] (03PS2) 10JMeybohm: ingress.istio: Remove trust for every SAN but the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) [13:55:31] (03Abandoned) 10Majavah: openstack: codfw1dev: designate: listen-on only the new address [puppet] - 10https://gerrit.wikimedia.org/r/929740 (https://phabricator.wikimedia.org/T338938) (owner: 10Arturo Borrero Gonzalez) [13:58:13] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:58:27] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:58:49] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: upload_puppet_facts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1400). [14:00:04] No Gerrit patches in the queue for this window AFAICS. [14:04:53] (03CR) 10Hnowlan: [C: 03+1] citoid: Set service_mesh version to 1.23.10-2-s4-20231203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981331 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [14:15:17] (03PS3) 10JMeybohm: ingress.istio: Remove trust for every SAN but the default [deployment-charts] - 10https://gerrit.wikimedia.org/r/981333 (https://phabricator.wikimedia.org/T300033) [14:15:19] (03PS1) 10JMeybohm: function-orchestrator: Update to ingress.istio:1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981336 (https://phabricator.wikimedia.org/T300033) [14:16:23] (03CR) 10Muehlenhoff: [C: 03+2] nftables requestctl: Some tweaks and fixes [puppet] - 10https://gerrit.wikimedia.org/r/981280 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [14:19:31] (03PS1) 10Ilias Sarantopoulos: [beta] ores-extension: enable revertrisk model for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) [14:24:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] mobileapps: Use parsoid via restbase on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981330 (owner: 10Jgiannelos) [14:24:34] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Use parsoid via restbase on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981330 (owner: 10Jgiannelos) [14:25:28] (03Merged) 10jenkins-bot: mobileapps: Use parsoid via restbase on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981330 (owner: 10Jgiannelos) [14:26:07] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:26:36] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:26:59] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:27:34] (03PS2) 10Giuseppe Lavagetto: wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 [14:27:38] (03PS2) 10Filippo Giunchedi: swift: write to local files and ban before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) [14:27:40] (03CR) 10Giuseppe Lavagetto: wmf-debci: also install recommended dependencies (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (owner: 10Giuseppe Lavagetto) [14:29:32] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:30:15] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:31:32] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:32:40] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:34:12] (03PS1) 10Muehlenhoff: Fix syntax for drop rule [puppet] - 10https://gerrit.wikimedia.org/r/981339 [14:36:00] (03CR) 10Filippo Giunchedi: "This functionality is provided out of the (black)box by prometheus::blackbox::check::http, see 'team' parameter" [alerts] - 10https://gerrit.wikimedia.org/r/981289 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:38:25] (03CR) 10Muehlenhoff: [C: 03+2] Fix syntax for drop rule [puppet] - 10https://gerrit.wikimedia.org/r/981339 (owner: 10Muehlenhoff) [14:39:06] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:38] (03PS3) 10EoghanGaffney: [apt-staging] Add script to pull artifacts from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/979912 [14:39:40] (03CR) 10EoghanGaffney: [apt-staging] Add script to pull artifacts from gitlab (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979912 (owner: 10EoghanGaffney) [14:41:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cp4037.ulsfo.wmnet [14:42:11] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Our current scaffolding system allows you to only select components you need. This patch includes the service mesh Service that is definit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:42:28] (03PS2) 10Alexandros Kosiaris: citoid: Set service_mesh version to 1.23.10-2-s4-20231203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981331 (https://phabricator.wikimedia.org/T352906) [14:42:30] (03PS1) 10Alexandros Kosiaris: mesh: Ship new configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/981340 (https://phabricator.wikimedia.org/T352906) [14:42:32] (03PS1) 10Alexandros Kosiaris: mesh: Use ca-certificates instead of wmf-ca-certificates [deployment-charts] - 10https://gerrit.wikimedia.org/r/981341 (https://phabricator.wikimedia.org/T352906) [14:43:55] (03PS4) 10Ayounsi: Get the server's BGP peer info from Netbox [homer/public] - 10https://gerrit.wikimedia.org/r/979381 (https://phabricator.wikimedia.org/T306649) [14:46:06] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Papaul) @Volans after i enter the mgmt password the only line i get it ` Set Boot Device to force_pxe ` [14:48:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2004.codfw.wmnet with OS bullseye [14:48:55] (03PS1) 10Klausman: API GW: Add ingress endpoints on Lift Wing to allowed destinations [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) [14:48:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2005.codfw.wmnet with OS bullseye [14:48:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sessionstore2006.codfw.wmnet with OS bullseye [14:49:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2004.codfw.wmnet with OS bullseye [14:49:06] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2005.codfw.wmnet with OS bullseye [14:49:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2006.codfw.wmnet with OS bullseye [14:49:16] (03PS6) 10Ayounsi: Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) [14:50:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bullseye [14:50:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [14:50:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd2002.codfw.wmnet with OS bullseye [14:50:48] (03CR) 10Kosta Harlan: [C: 03+1] [beta] ores-extension: enable revertrisk model for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [14:50:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye executed with errors: - ce... [14:51:07] RECOVERY - Check systemd state on mw1350 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:11] PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 27308MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [14:51:26] (03CR) 10Hnowlan: [C: 03+1] API GW: Add ingress endpoints on Lift Wing to allowed destinations [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:52:03] (03CR) 10Klausman: [C: 03+2] API GW: Add ingress endpoints on Lift Wing to allowed destinations [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:52:55] (03Merged) 10jenkins-bot: API GW: Add ingress endpoints on Lift Wing to allowed destinations [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:53:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bullseye [14:53:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [14:53:14] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:53:16] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:53:30] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:53:45] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:54:06] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:16] (03CR) 10Herron: [C: 03+1] swift: write to local files and ban before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi) [14:55:53] (03PS1) 10Klausman: APIGW: add missing /32 to egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) [14:57:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] citoid: Set service_mesh version to 1.23.10-2-s4-20231203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981331 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [14:58:19] (03Merged) 10jenkins-bot: citoid: Set service_mesh version to 1.23.10-2-s4-20231203 [deployment-charts] - 10https://gerrit.wikimedia.org/r/981331 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [15:00:25] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:00:58] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [15:01:14] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [15:01:40] (03CR) 10Elukey: API GW: Add ingress endpoints on Lift Wing to allowed destinations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [15:01:49] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [15:02:14] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [15:02:49] (03CR) 10Hnowlan: [C: 03+1] APIGW: add missing /32 to egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [15:03:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) [15:03:55] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [15:03:56] (03PS2) 10Klausman: APIGW: add missing /32 to egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) [15:04:16] (03PS3) 10Klausman: APIGW: add missing /32 to egress rules and fix port [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) [15:04:24] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [15:04:43] (03CR) 10Elukey: [C: 03+1] APIGW: add missing /32 to egress rules and fix port [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [15:05:05] (03CR) 10Klausman: [C: 03+2] APIGW: add missing /32 to egress rules and fix port [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [15:05:56] (03Merged) 10jenkins-bot: APIGW: add missing /32 to egress rules and fix port [deployment-charts] - 10https://gerrit.wikimedia.org/r/981344 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [15:06:26] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:06:40] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:07:25] (03CR) 10Ayounsi: BGPPeers: add codfw racks A1 to B8 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [15:07:26] !log klausman@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:07:29] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [15:07:33] !log klausman@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:07:44] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [15:07:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54270 and previous config saved to /var/cache/conftool/dbconfig/20231207-150750-arnaudb.json [15:07:58] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:08:01] !log klausman@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:08:11] !log klausman@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:09:09] (03PS1) 10Ayounsi: Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) [15:10:00] (03CR) 10Klausman: [C: 03+2] API GW: Add ingress endpoints on Lift Wing to allowed destinations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/981342 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [15:10:44] (03PS2) 10Ayounsi: Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) [15:11:37] RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [15:11:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54271 and previous config saved to /var/cache/conftool/dbconfig/20231207-151152-arnaudb.json [15:13:20] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/980839 [15:14:59] (03CR) 10CI reject: [V: 04-1] Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [15:17:25] (03PS3) 10Ayounsi: Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) [15:18:00] (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980885 (owner: 10Jgiannelos) [15:18:49] (03PS5) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) [15:19:01] 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10Ottomata) Recent flink-app based deployments should use envoy. Not sure about the older rdf-streaming-updater, but there are plans to move that to flink-app chart. [15:19:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cp4037.ulsfo.wmnet [15:20:10] (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:21] (03CR) 10CI reject: [V: 04-1] [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [15:21:49] arnaudb: I just wanted to say thank you for all the work on the img_size schema change :) [15:22:16] (03CR) 10Jgiannelos: [C: 03+2] tegola: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980885 (owner: 10Jgiannelos) [15:23:22] (03Merged) 10jenkins-bot: tegola: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980885 (owner: 10Jgiannelos) [15:23:26] thanks bawolff ! <3 [15:24:16] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [15:26:12] (03PS3) 10Cathal Mooney: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) [15:26:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P54272 and previous config saved to /var/cache/conftool/dbconfig/20231207-152659-arnaudb.json [15:27:04] (03CR) 10CI reject: [V: 04-1] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) (owner: 10Cathal Mooney) [15:27:20] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [15:27:44] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [15:27:49] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [15:28:47] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [15:28:54] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [15:29:28] (03PS6) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) [15:29:31] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [15:31:31] (03PS4) 10Cathal Mooney: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) [15:32:21] (03CR) 10CI reject: [V: 04-1] Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) (owner: 10Cathal Mooney) [15:34:56] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [15:35:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [15:36:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2005.codfw.wmnet with OS bullseye [15:36:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2005.codfw.wmnet with OS bullseye executed with errors: -... [15:36:50] (03PS5) 10Cathal Mooney: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) [15:37:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sessionstore2006.codfw.wmnet with OS bullseye [15:37:10] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2006.codfw.wmnet with OS bullseye executed with errors: -... [15:37:43] (03CR) 10Ottomata: [C: 03+1] profile::cache::kafka::webrequest: allow to customize the format [puppet] - 10https://gerrit.wikimedia.org/r/980911 (https://phabricator.wikimedia.org/T346463) (owner: 10Elukey) [15:38:51] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [15:39:57] (03PS7) 10Ladsgroup: Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) [15:42:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P54273 and previous config saved to /var/cache/conftool/dbconfig/20231207-154205-arnaudb.json [15:42:17] (03PS4) 10Cathal Mooney: Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980948 (https://phabricator.wikimedia.org/T352918) [15:42:58] (03PS1) 10Ottomata: webrequest varnishkafka - Add to X-Analytics the Sec-Purpose HTTP header [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) [15:43:50] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [15:44:03] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:44:21] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:44:58] !log klausman@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:45:15] !log klausman@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:45:30] !log milimetric@deploy2002 Started deploy [analytics/refinery@8b8f178]: hotfix: sqoop [15:48:01] !log clear out dns6001 resolv.conf to check for SSH config-based authdns-update [15:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:06] (03PS8) 10Ladsgroup: Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) [15:50:17] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [15:50:26] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) >>! In T351710#9385748, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/I... [15:50:32] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) a:03ABran-WMF [15:50:50] (03CR) 10Jelto: [C: 03+1] "lgtm now" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (owner: 10Giuseppe Lavagetto) [15:51:19] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) 05Open→03In progress [15:51:29] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10ABran-WMF) [15:52:30] (03CR) 10Alexandros Kosiaris: service_proxy/mesh: Bump to newer version globally (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [15:53:15] PROBLEM - AuthDNS-over-TLS Works on dns6001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [15:53:23] (03CR) 10Clément Goubert: [C: 04-1] Provide python3-bookworm image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/980860 (https://phabricator.wikimedia.org/T352733) (owner: 10KartikMistry) [15:53:24] ^ expected [15:53:32] !log running authdns-update with broken resolv.conf on dns6001 [15:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:58] (03CR) 10Ottomata: "Alternative:" [puppet] - 10https://gerrit.wikimedia.org/r/980912 (https://phabricator.wikimedia.org/T346463) (owner: 10Elukey) [15:55:38] !log milimetric@deploy2002 Finished deploy [analytics/refinery@8b8f178]: hotfix: sqoop (duration: 10m 08s) [15:55:38] 10SRE, 10Observability-Alerting: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996 (10fgiunchedi) Note for later and reworked for an alertmanager/prometheus world: we should extend `netops::prometheus::hosts` to also probe for ipv6, this way we'll have smoke probes also testin... [15:55:54] 10SRE, 10Observability-Alerting: Probe for ipv6 host reachability - https://phabricator.wikimedia.org/T163996 (10fgiunchedi) [15:56:07] RECOVERY - AuthDNS-over-TLS Works on dns6001 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [15:57:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54274 and previous config saved to /var/cache/conftool/dbconfig/20231207-155712-arnaudb.json [15:57:17] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:57:38] (03CR) 10Clément Goubert: [C: 03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/980925 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [15:58:10] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [15:59:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10dcaro) [16:00:30] !log milimetric@deploy2002 Started deploy [analytics/refinery@8b8f178] (thin): hotfix: sqoop [16:00:37] !log milimetric@deploy2002 Finished deploy [analytics/refinery@8b8f178] (thin): hotfix: sqoop (duration: 00m 07s) [16:01:39] (03PS12) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [16:01:56] (03PS1) 10Majavah: netops: prometheus::hosts: also probe ipv6 if available [puppet] - 10https://gerrit.wikimedia.org/r/981358 (https://phabricator.wikimedia.org/T163996) [16:02:45] !log run dummy authdns-update on dns6001 [16:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/981358 (https://phabricator.wikimedia.org/T163996) (owner: 10Majavah) [16:08:20] (03CR) 10JMeybohm: [C: 03+1] service_proxy/mesh: Bump to newer version globally [puppet] - 10https://gerrit.wikimedia.org/r/981309 (https://phabricator.wikimedia.org/T352906) (owner: 10Alexandros Kosiaris) [16:09:13] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:09:27] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:16:18] 10SRE, 10Observability-Alerting: Reminders for unhandled/unacked alerts - https://phabricator.wikimedia.org/T307958 (10fgiunchedi) This is essentially what https://alerts.wikimedia.org/triage/ displays now, for `hide_alerts_older_than: '1200h'` alerts. The app also offers the user a button to open a task [16:16:29] (03PS10) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) [16:16:55] (03CR) 10CI reject: [V: 04-1] Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [16:17:23] (03PS11) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) [16:17:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10jijiki) @ehughes please sign the L3 Acknowledgement of Wikimedia Server Access Responsibilities Documen... [16:18:03] (03CR) 10CI reject: [V: 04-1] Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) (owner: 10Brouberol) [16:20:20] (03PS12) 10Brouberol: Define the spark-history chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/978629 (https://phabricator.wikimedia.org/T351722) [16:20:36] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10User-jbond: Thanos and Grafana lose the session after an hour - https://phabricator.wikimedia.org/T268233 (10fgiunchedi) Untagging o11y here, since we moved thanos to oauth2-proxy I believe this should not apply to thanos anymore (though might still apply t... [16:22:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-whole-mediawiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:56] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:24:00] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:24:49] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:25:03] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:26:23] (03PS11) 10Herron: grafana: add dashboard datasource usage (graphite) exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [16:26:35] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:26:48] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:27:20] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:27:33] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:28:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:31] (03PS1) 10Hnowlan: api-gateway: set host and ingress for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/981359 (https://phabricator.wikimedia.org/T347263) [16:37:04] (03CR) 10Volans: Move git search related classes to __init__ (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [16:38:27] !log brouberol@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [16:39:06] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd2002.codfw.wmnet with OS bullseye [16:39:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye executed with errors: - ce... [16:39:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cephosd2002.codfw.wmnet with OS bullseye [16:39:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye [16:43:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [16:44:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [16:44:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [16:45:01] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [16:52:41] (03Abandoned) 10Jdlrobson: References previews is no longer a beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980901 (https://phabricator.wikimedia.org/T282999) (owner: 10Jdlrobson) [16:52:53] (03CR) 10Jdlrobson: [C: 03+1] Remove BetaFeature code related to ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976650 (https://phabricator.wikimedia.org/T351708) (owner: 10WMDE-Fisch) [16:54:57] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to wmf and analytics-privatedata-users for EHughes (superset access with no server access) - https://phabricator.wikimedia.org/T351387 (10jijiki) a:05eoghan→03ehughes [17:00:04] jhathaway and rzl: Dear deployers, time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:03:36] (03PS1) 10Brouberol: Define an echoserver namespace for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/981363 (https://phabricator.wikimedia.org/T353004) [17:05:51] !log herron@cumin1001 START - Cookbook sre.dns.netbox [17:08:41] !log herron@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cleanup logstash/kibana records T299700 - herron@cumin1001" [17:08:45] T299700: Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 [17:09:35] !log herron@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cleanup logstash/kibana records T299700 - herron@cumin1001" [17:09:35] !log herron@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:49] 10SRE, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) [17:13:53] 10SRE, 10Traffic, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q2): Remove legacy ELK LVS entries - https://phabricator.wikimedia.org/T299700 (10herron) 05Open→03Resolved >>! In T299700#9242375, @Volans wrote: > FYI the service IPs are still allocated in Netbox: > https://netbox.wikimedia.org/i... [17:18:01] (03PS2) 10Brouberol: Define an echoserver namespace for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/981363 (https://phabricator.wikimedia.org/T353004) [17:18:03] (03PS1) 10Brouberol: Define a simple echoserver chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/981367 (https://phabricator.wikimedia.org/T353004) [17:18:05] (03PS1) 10Brouberol: Define deployment helmfiles for echoserver in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/981368 (https://phabricator.wikimedia.org/T353004) [17:18:57] (03CR) 10CI reject: [V: 04-1] Define a simple echoserver chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/981367 (https://phabricator.wikimedia.org/T353004) (owner: 10Brouberol) [17:18:59] (03CR) 10CI reject: [V: 04-1] Define an echoserver namespace for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/981363 (https://phabricator.wikimedia.org/T353004) (owner: 10Brouberol) [17:19:05] (03CR) 10CI reject: [V: 04-1] Define deployment helmfiles for echoserver in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/981368 (https://phabricator.wikimedia.org/T353004) (owner: 10Brouberol) [17:23:08] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephosd2002.codfw.wmnet with OS bullseye [17:23:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cephosd2002.codfw.wmnet with OS bullseye executed with errors: - ce... [17:23:15] (03PS2) 10Brouberol: Define a simple echoserver chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/981367 (https://phabricator.wikimedia.org/T353004) [17:23:17] (03PS3) 10Brouberol: Define an echoserver namespace for the dse-k8s-eqiad cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/981363 (https://phabricator.wikimedia.org/T353004) [17:23:19] (03PS2) 10Brouberol: Define deployment helmfiles for echoserver in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/981368 (https://phabricator.wikimedia.org/T353004) [17:29:24] (03PS1) 10Eevans: restbase: set production role and add config for restbase2030 [puppet] - 10https://gerrit.wikimedia.org/r/981371 (https://phabricator.wikimedia.org/T352468) [17:29:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) having an issue with cephosd2001 and 2002. cephosd2001 fails at this part. [39/50, retrying in 117.00s] Attempt to run 'cookbooks.sre.hosts.reimage.... [17:34:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) having an issue with all the new sessionstore servers that I think stems from the HBA355i Fnt card. When the install gets to partitioning the drives, I... [17:34:06] (03CR) 10Klausman: [C: 03+1] api-gateway: set host and ingress for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/981359 (https://phabricator.wikimedia.org/T347263) (owner: 10Hnowlan) [17:34:21] (03PS2) 10Hnowlan: api-gateway: set host and ingress for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/981359 (https://phabricator.wikimedia.org/T347263) [17:36:18] (03CR) 10Hnowlan: [C: 03+2] api-gateway: set host and ingress for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/981359 (https://phabricator.wikimedia.org/T347263) (owner: 10Hnowlan) [17:37:25] (03Merged) 10jenkins-bot: api-gateway: set host and ingress for recommendation-api-ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/981359 (https://phabricator.wikimedia.org/T347263) (owner: 10Hnowlan) [17:37:46] 10SRE, 10SRE Observability: certspotter failures on alert1001 - https://phabricator.wikimedia.org/T318911 (10herron) 05Open→03Resolved a:03herron I'm reviewing the backlog today (almost exactly one year since the last update!) and I think we're ok to close this since certspotter failures were addressed,... [17:38:42] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [17:39:07] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [17:40:09] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [17:40:28] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [17:43:06] (03CR) 10BCornwall: [C: 03+1] wmf-debci: also install recommended dependencies [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/981178 (owner: 10Giuseppe Lavagetto) [17:45:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:52:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:57:57] !log bking@cumin1001 START - Cookbook sre.hosts.remove-downtime for wdqs1024.eqiad.wmnet [17:57:58] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wdqs1024.eqiad.wmnet [18:00:05] bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1800). [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T1800) [18:02:00] nothing from me today [18:03:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [18:04:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [18:04:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:04:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:04:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T343198)', diff saved to https://phabricator.wikimedia.org/P54277 and previous config saved to /var/cache/conftool/dbconfig/20231207-180427-ladsgroup.json [18:04:31] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:05:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [18:05:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [18:05:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:05:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:08:52] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [18:19:08] (03CR) 10Bernard Wang: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980951 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson) [18:31:18] (03CR) 10Dzahn: [C: 03+1] "lgtm! (for now, if possible we should fix TLS later though)" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [18:33:17] (03CR) 10Bking: [C: 03+2] wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [18:37:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:38:26] (03PS1) 10Bking: Revert "wdqs: monitor ldf endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/981190 [18:39:26] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981380 [18:42:03] !log puppetmaster1001 - revoke cert for miscweb.discovery.wmnet [18:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:40] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981380 (owner: 10Ebernhardson) [18:44:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T343198)', diff saved to https://phabricator.wikimedia.org/P54278 and previous config saved to /var/cache/conftool/dbconfig/20231207-184406-ladsgroup.json [18:44:14] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:44:27] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981380 (owner: 10Ebernhardson) [18:45:32] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:45:56] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:59:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P54279 and previous config saved to /var/cache/conftool/dbconfig/20231207-185913-ladsgroup.json [19:00:25] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:14:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P54280 and previous config saved to /var/cache/conftool/dbconfig/20231207-191420-ladsgroup.json [19:16:44] (SystemdUnitFailed) firing: wdqs-blazegraph.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:21:43] (SystemdUnitFailed) resolved: wdqs-blazegraph.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:31] (03PS1) 10Ebernhardson: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981385 [19:28:19] PROBLEM - MariaDB Replica Lag: s6 on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 940.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:29:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T343198)', diff saved to https://phabricator.wikimedia.org/P54281 and previous config saved to /var/cache/conftool/dbconfig/20231207-192926-ladsgroup.json [19:29:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [19:29:31] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [19:29:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [19:29:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T343198)', diff saved to https://phabricator.wikimedia.org/P54282 and previous config saved to /var/cache/conftool/dbconfig/20231207-192949-ladsgroup.json [19:32:00] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981385 (owner: 10Ebernhardson) [19:32:57] (03Merged) 10jenkins-bot: cirrus updater: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/981385 (owner: 10Ebernhardson) [19:35:21] (03PS1) 10DDesouza: research-landing-page: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/981386 (https://phabricator.wikimedia.org/T219903) [19:35:24] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: graph split experiments T350106 [19:35:31] T350106: Implement a spark job that converts a RDF triples table into a RDF file format - https://phabricator.wikimedia.org/T350106 [19:35:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wdqs[1022-1024].eqiad.wmnet with reason: graph split experiments T350106 [19:38:21] (03PS1) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355) [19:38:22] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:38:31] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:38:46] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:40:58] (03CR) 10Bking: [C: 03+2] Revert "wdqs: monitor ldf endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/981190 (owner: 10Bking) [19:43:21] (03PS1) 10Ebernhardson: cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981388 [19:44:32] (03PS2) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355) [19:45:52] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981388 (owner: 10Ebernhardson) [19:46:37] (03Merged) 10jenkins-bot: cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981388 (owner: 10Ebernhardson) [19:48:02] (03PS1) 10Gmodena: mw-page-content-enrich: version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/980359 (https://phabricator.wikimedia.org/T345806) [19:49:10] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1082.eqiad.wmnet with OS bullseye [19:49:17] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1082.eqiad.wmnet with OS bullseye [19:49:42] (03PS3) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355) [19:50:18] (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2030 [puppet] - 10https://gerrit.wikimedia.org/r/981371 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [19:51:41] (03PS4) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355) [19:52:20] (03PS1) 10Ebernhardson: cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981390 [19:52:28] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:55:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [19:58:10] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981390 (owner: 10Ebernhardson) [19:58:58] (03Merged) 10jenkins-bot: cirrus updater: Allow non restored state in consumer-search-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981390 (owner: 10Ebernhardson) [19:59:17] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:59:26] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:01:23] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:01:35] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:02:16] (03PS1) 10Dzahn: Revert "microsites/query_service: enable TLS when monitoring commons-query" [puppet] - 10https://gerrit.wikimedia.org/r/981191 [20:02:44] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1082.eqiad.wmnet with reason: host reimage [20:02:46] (03PS2) 10Dzahn: Revert "microsites/query_service: enable TLS when monitoring commons-query" [puppet] - 10https://gerrit.wikimedia.org/r/981191 (https://phabricator.wikimedia.org/T352941) [20:05:51] !log bootstrap Cassandra/restbase2030-a — T352468 [20:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:05] T352468: Provision new RESTBase cluster nodes: restbase20[28-35] - https://phabricator.wikimedia.org/T352468 [20:06:07] ACKNOWLEDGEMENT - MegaRAID on db1168 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T353020 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:06:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1082.eqiad.wmnet with reason: host reimage [20:06:12] 10SRE, 10ops-eqiad: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10ops-monitoring-bot) [20:07:12] PROBLEM - cassandra-a CQL 10.192.16.243:9042 on restbase2030 is CRITICAL: connect to address 10.192.16.243 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:09:20] PROBLEM - cassandra-a SSL 10.192.16.243:7000 on restbase2030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:09:48] PROBLEM - Check systemd state on restbase2030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:44] PROBLEM - cassandra-a service on restbase2030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:12:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T343198)', diff saved to https://phabricator.wikimedia.org/P54283 and previous config saved to /var/cache/conftool/dbconfig/20231207-201234-ladsgroup.json [20:12:52] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [20:14:04] PROBLEM - cassandra-b CQL 10.192.16.244:9042 on restbase2030 is CRITICAL: connect to address 10.192.16.244 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:14:54] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:30] PROBLEM - cassandra-b SSL 10.192.16.244:7000 on restbase2030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:16:30] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102 (10Ottomata) [20:18:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:18:16] RECOVERY - cassandra-a service on restbase2030 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:18:26] RECOVERY - Check systemd state on restbase2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:00] PROBLEM - cassandra-b service on restbase2030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:19:04] RECOVERY - cassandra-a SSL 10.192.16.243:7000 on restbase2030 is OK: SSL OK - Certificate restbase2030-a valid until 2025-12-06 17:50:13 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:20:02] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:20:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:21:26] PROBLEM - cassandra-c CQL 10.192.16.245:9042 on restbase2030 is CRITICAL: connect to address 10.192.16.245 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:22:28] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10Marostegui) [20:23:29] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10Marostegui) @wiki_willy I guess this host isn't under warranty anymore? Still can we get a disk for it? Thanks! [20:23:58] PROBLEM - cassandra-c SSL 10.192.16.245:7000 on restbase2030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:25:44] (03CR) 10Dzahn: [C: 03+2] Revert "microsites/query_service: enable TLS when monitoring commons-query" [puppet] - 10https://gerrit.wikimedia.org/r/981191 (https://phabricator.wikimedia.org/T352941) (owner: 10Dzahn) [20:25:52] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10Event-Platform: Discovery for Kafka cluster brokers - https://phabricator.wikimedia.org/T213561 (10Ottomata) Pasting some relevant discussion points [[ https://wikimedia.slack.com/archives/C055QGPTC69/p1700728492406809 | from slack ]]: @brouberol > I was... [20:26:24] PROBLEM - cassandra-c service on restbase2030 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:27:02] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:27:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P54285 and previous config saved to /var/cache/conftool/dbconfig/20231207-202740-ladsgroup.json [20:30:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.dhcp for host cp4037.ulsfo.wmnet [20:38:53] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [20:40:15] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:42:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P54286 and previous config saved to /var/cache/conftool/dbconfig/20231207-204247-ladsgroup.json [20:43:00] (03PS3) 10Ottomata: Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977075 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx) [20:46:21] (03CR) 10Ottomata: [C: 03+2] Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977075 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx) [20:47:04] (03Merged) 10jenkins-bot: Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977075 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx) [20:47:26] (03PS3) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) [20:47:55] (03CR) 10Kimberly Sarabia: Remove readability survey tool (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia) [20:49:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:50:51] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10wiki_willy) Definitely. @Jclark-ctr & @VRiley-WMF - can you check if we have any spare drives from a decommissioned host? If not, we'll purchase one via @RobH). Thanks, Willy >>! In T353020#9391723, @Marost... [20:54:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:54:32] It is really too bad we can't scap sync-file more than one file at once...especially since it takes so long to do sync-file now! [20:56:07] !log otto@deploy2002 Synchronized wmf-config/ext-EventLogging.php: Config: [[gerrit:977075|Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit EventLogging config (T329718)]] (duration: 07m 07s) [20:56:13] T329718: Decommission the SpecialMuteSubmit instrument - https://phabricator.wikimedia.org/T329718 [20:57:15] ottomata: probably better to ask releng [20:57:41] you sure i shouldn't just gripe at the wall here? [20:57:47] :) [20:57:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T343198)', diff saved to https://phabricator.wikimedia.org/P54287 and previous config saved to /var/cache/conftool/dbconfig/20231207-205753-ladsgroup.json [20:57:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [20:57:57] ottomata: `scap sync-world` (or even `scap backport`) is the future! [20:57:58] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [20:58:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [20:58:06] ottomata: nah it's a genuine question [20:58:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [20:58:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T343198)', diff saved to https://phabricator.wikimedia.org/P54288 and previous config saved to /var/cache/conftool/dbconfig/20231207-205817-ladsgroup.json [20:58:27] That might be solved by something like scap backport as taavi pointed out [20:58:50] well I look forward to it! :D [20:59:19] why look forward when you can use it today? [20:59:20] I just fear jinxer-wm and wikibugs have better ideas that helping you get an answer if people are busy and don't see here [21:00:06] TheresNoTime: gettimeofday() says it's time for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231207T2100) [21:00:06] jan_Drewniak and kostajh: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:22] hi [21:00:34] o/ [21:01:15] (03CR) 10Ottomata: [C: 03+2] "Deployed!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977075 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx) [21:01:53] ottomata: I believe you could do `scap backport 977075` for those two files to go out at the same time [21:02:27] wut [21:02:29] really?! [21:02:29] TheresNoTime: are you around? [21:02:45] btw, i have a scap sync-file currently running. should be done in a couple of mins. i'm done then. [21:02:56] !log restarting blazegraph on wdqs2017 (BlazegraphFreeAllocatorsDecreasingRapidly) [21:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:19] cool. kostajh if it's just us two for the backport, I can deploy the config changes. [21:03:19] taavi: kostajh i was following these docs https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#operations/mediawiki-config_2 [21:03:23] do they need updated? [21:03:29] ottomata: https://wikitech.wikimedia.org/wiki/Scap#Backport_Deployments [21:03:55] wow [21:04:06] (ProbeDown) resolved: (2) Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:04:12] jan_drewniak: sounds good [21:04:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:35] ottomata: you can even do `scap backport 980951 981337` and do two at once [21:04:45] my jaw is getting closer to the floor [21:05:02] Hi, can I add my patch to this deployment window? :) [21:05:48] It's config related, for enabling action blocks in Serbian Wikipedia. [21:05:52] Kizule: yeah if it's something simple [21:05:59] Niharika has approved it. Yup, totally simple. [21:06:31] !log otto@deploy2002 Synchronized wmf-config/ext-EventStreamConfig.php: Config: [[gerrit:977075|Remove eventlogging_FeaturePolicyViolation and _SpecialMuteSubmit EventStreamConfig (T329718)]] (duration: 09m 16s) [21:06:40] T329718: Decommission the SpecialMuteSubmit instrument - https://phabricator.wikimedia.org/T329718 [21:07:52] ottomata: T353024 [21:07:52] T353024: Update [[wikitech:Backport_windows/Deployers]] for scap backport - https://phabricator.wikimedia.org/T353024 [21:07:56] my scap sync file is done [21:08:07] taavi: thank you! [21:08:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:08:54] jan_drewniak: I've added my patch to the Deployments page. [21:10:33] Kizule kostajh okydoke. I'll do all three at once, `scap backport 980951 981337 976911` I'll let you know when they're ready to test [21:11:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980951 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson) [21:11:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [21:11:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21) [21:11:35] (03PS3) 10Jdrewniak: Enable action blocks in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21) [21:11:35] thx [21:11:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980951 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson) [21:11:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [21:11:42] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21) [21:11:52] jan_drewniak: mine will just show up in beta labs whenever that sync happens, so no need to wait for me [21:12:02] (03Merged) 10jenkins-bot: Enable Vector beta feature for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980951 (https://phabricator.wikimedia.org/T351339) (owner: 10Jdlrobson) [21:12:05] (03Merged) 10jenkins-bot: [beta] ores-extension: enable revertrisk model for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [21:12:45] (03PS4) 10Jdrewniak: Enable action blocks in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21) [21:12:53] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21) [21:12:54] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:13:23] (scap backport not as great when each patch requires a rebase) [21:13:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:13:35] (03Merged) 10jenkins-bot: Enable action blocks in Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976911 (https://phabricator.wikimedia.org/T351873) (owner: 10Zoranzoki21) [21:13:52] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:980951|Enable Vector beta feature for all wikis (T351339)]], [[gerrit:981337|[beta] ores-extension: enable revertrisk model for enwiki (T348298)]], [[gerrit:976911|Enable action blocks in Serbian Wikipedia (T351873)]] [21:13:58] T351339: Deploy client preferences to production beta features - https://phabricator.wikimedia.org/T351339 [21:13:59] T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 [21:13:59] T351873: Enable action blocks in Serbian Wikipedia - https://phabricator.wikimedia.org/T351873 [21:15:13] !log jdrewniak@deploy2002 zoranzoki21 and isaranto and jdlrobson and jdrewniak: Backport for [[gerrit:980951|Enable Vector beta feature for all wikis (T351339)]], [[gerrit:981337|[beta] ores-extension: enable revertrisk model for enwiki (T348298)]], [[gerrit:976911|Enable action blocks in Serbian Wikipedia (T351873)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:15:31] Kizule: the change is ready for testing on mwdebug [21:15:49] jan_drewniak: Alright, give me a moment. [21:16:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:41] Good to go [21:17:12] Kizule: alrighty, we're syncing :) [21:17:15] !log jdrewniak@deploy2002 zoranzoki21 and isaranto and jdlrobson and jdrewniak: Continuing with sync [21:19:28] Thanks! [21:20:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:23:46] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:980951|Enable Vector beta feature for all wikis (T351339)]], [[gerrit:981337|[beta] ores-extension: enable revertrisk model for enwiki (T348298)]], [[gerrit:976911|Enable action blocks in Serbian Wikipedia (T351873)]] (duration: 09m 54s) [21:23:53] T351339: Deploy client preferences to production beta features - https://phabricator.wikimedia.org/T351339 [21:23:53] T348298: Add revertrisk-language-agnostic to RecentChanges filters - https://phabricator.wikimedia.org/T348298 [21:23:53] T351873: Enable action blocks in Serbian Wikipedia - https://phabricator.wikimedia.org/T351873 [21:24:33] Works in production, thanks! [21:25:16] alright, Kizule kostajh changes have been deployed :) [21:25:40] jan_drewniak: thank you [21:25:47] Thank you! [21:26:24] Now that we have ~30 minutes, I think we could use this.. Amir1: Can we test namespaceDupes.php on smaller Serbian projects? My task for running it still needs to be done. :) [21:26:38] jan_drewniak: I need to revert mine, after it synced to beta, I can see it is causing errors. [21:26:52] jan_drewniak: https://beta-logs.wmcloud.org/app/discover#/doc/5f0c9be0-0b6f-11ec-9cde-3f4490e09a26/logstash-mediawiki-1-7.0.0-1-2023.12.07?id=RlcrRowBhFLoKHIVRB8m [21:27:28] kostajh: no problem, we can revert that [21:27:29] jan_drewniak: can you do `scap backport --revert 981337`? [21:28:42] jan_drewniak: hmm, maybe we just need to run update.php on beta? [21:28:43] I am not sure [21:29:19] I don't think that we have ever run it? [21:29:37] kostajh: I don't know how to run update scripts on beta (though I feel like that should be done automatically) [21:30:57] It's done automagically [21:31:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [21:31:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1082.eqiad.wmnet with OS bullseye [21:31:23] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1082.eqiad.wmnet with OS bullseye completed: - ms-be... [21:31:41] kostajh: since that's the case, maybe I'll revert? [21:31:50] https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/ [21:31:57] jan_drewniak: ^ [21:32:09] Last run was 10 minutes ago [21:32:14] One due in 50 [21:32:26] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) [21:32:27] Someone might be able to trigger manually [21:32:42] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) [21:33:06] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Jclark-ctr) 05Open→03Resolved [21:34:56] jan_drewniak: yeah, let's just revert [21:35:09] I am too tired to debug that now :) we can try again next week [21:35:11] thanks! [21:35:31] kostajh: or we can wait an hour and see if the errors subside (I won't be here to revert in an hour if that's necessary though) [21:36:09] jan_drewniak: I suspect we are missing some config [21:36:25] I suggest a revert if unsure, it's the last window of the week [21:36:49] yeah, let's revert please. [21:36:54] (03CR) 10Eevans: [C: 03+2] install_server: partman recipe for new sessionstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/975913 (https://phabricator.wikimedia.org/T349875) (owner: 10Eevans) [21:37:37] (03PS1) 10TrainBranchBot: Revert "[beta] ores-extension: enable revertrisk model for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981395 [21:37:39] (03CR) 10TrainBranchBot: "jdrewniak@deploy2002 created a revert of this change as Id628e04a35adbd748824437c8cc921f1e08e9371" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981337 (https://phabricator.wikimedia.org/T348298) (owner: 10Ilias Sarantopoulos) [21:37:53] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@049cf03]: (no justification provided) [21:38:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981395 (owner: 10TrainBranchBot) [21:38:21] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@049cf03]: (no justification provided) (duration: 00m 28s) [21:38:42] (03Merged) 10jenkins-bot: Revert "[beta] ores-extension: enable revertrisk model for enwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/981395 (owner: 10TrainBranchBot) [21:39:39] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission rdb1009, rdb1010 - https://phabricator.wikimedia.org/T352547 (10Jclark-ctr) a:03Jclark-ctr [21:39:42] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1168 - https://phabricator.wikimedia.org/T353020 (10Jclark-ctr) a:03Jclark-ctr Server is out of warranty. I can check tomorrow morning I am pretty sure i have a spare drive from decommissioned host but will verify [21:39:45] alrighty, kostajh change reverted :) [21:40:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1119.eqiad.wmnet - https://phabricator.wikimedia.org/T337206 (10Jclark-ctr) a:03Jclark-ctr [21:41:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T343198)', diff saved to https://phabricator.wikimedia.org/P54289 and previous config saved to /var/cache/conftool/dbconfig/20231207-214114-ladsgroup.json [21:41:18] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [21:42:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Eevans) >>! In T349876#9391164, @Jhancock.wm wrote: > having an issue with all the new sessionstore servers that I think stems from the HBA355i Fnt card. > > When t... [21:43:35] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10Jclark-ctr) @ayounsi if you wouldn't mind messaging me a time that works best with you so we can fix this [21:46:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10Jclark-ctr) [21:51:04] (03PS1) 10Jclark-ctr: Add ganeti103[5-8] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/981396 (https://phabricator.wikimedia.org/T349925) [21:51:38] (03CR) 10Jclark-ctr: [C: 03+2] Add ganeti103[5-8] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/981396 (https://phabricator.wikimedia.org/T349925) (owner: 10Jclark-ctr) [21:51:42] (03CR) 10JHathaway: [C: 03+2] apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [21:52:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q2:rack/setup/install ganeti103[5-8] - https://phabricator.wikimedia.org/T349925 (10Jclark-ctr) [21:56:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P54290 and previous config saved to /var/cache/conftool/dbconfig/20231207-215620-ladsgroup.json [21:57:05] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1059.mgmt.eqiad.wmnet with reboot policy FORCED [21:57:07] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1060.mgmt.eqiad.wmnet with reboot policy FORCED [21:57:09] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1061.mgmt.eqiad.wmnet with reboot policy FORCED [21:57:11] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1062.mgmt.eqiad.wmnet with reboot policy FORCED [21:57:35] (03PS7) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) [21:58:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) [22:01:18] (03CR) 10JHathaway: [C: 03+2] apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [22:03:56] (03PS1) 10Jclark-ctr: add kubernetes10[59-62] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/981399 (https://phabricator.wikimedia.org/T349874) [22:04:31] (03CR) 10Jclark-ctr: [C: 03+2] add kubernetes10[59-62] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/981399 (https://phabricator.wikimedia.org/T349874) (owner: 10Jclark-ctr) [22:05:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) [22:05:48] (03PS1) 10Ebernhardson: cirrus updater: Enable consumer-devnull in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981401 [22:07:20] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Enable consumer-devnull in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981401 (owner: 10Ebernhardson) [22:08:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2017:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [22:08:11] (03Merged) 10jenkins-bot: cirrus updater: Enable consumer-devnull in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981401 (owner: 10Ebernhardson) [22:09:48] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:10:20] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:10:28] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:11:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P54291 and previous config saved to /var/cache/conftool/dbconfig/20231207-221127-ladsgroup.json [22:13:10] (03PS1) 10Ebernhardson: cirrus updater: Ensure correct image name is provided to consumer-devnull [deployment-charts] - 10https://gerrit.wikimedia.org/r/981402 [22:14:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1059.mgmt.eqiad.wmnet with reboot policy FORCED [22:14:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1062.mgmt.eqiad.wmnet with reboot policy FORCED [22:14:30] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1060.mgmt.eqiad.wmnet with reboot policy FORCED [22:14:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1061.mgmt.eqiad.wmnet with reboot policy FORCED [22:15:37] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1059.eqiad.wmnet with OS bullseye [22:15:38] (03PS2) 10Ebernhardson: cirrus updater: Ensure correct image name is provided to consumer-devnull [deployment-charts] - 10https://gerrit.wikimedia.org/r/981402 [22:15:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1059.eqiad.wmnet with OS bullseye [22:16:28] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1060.eqiad.wmnet with OS bullseye [22:16:32] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1061.eqiad.wmnet with OS bullseye [22:16:34] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1062.eqiad.wmnet with OS bullseye [22:16:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1060.eqiad.wmnet with OS bullseye [22:16:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1061.eqiad.wmnet with OS bullseye [22:16:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1062.eqiad.wmnet with OS bullseye [22:17:16] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Ensure correct image name is provided to consumer-devnull [deployment-charts] - 10https://gerrit.wikimedia.org/r/981402 (owner: 10Ebernhardson) [22:18:05] (03Merged) 10jenkins-bot: cirrus updater: Ensure correct image name is provided to consumer-devnull [deployment-charts] - 10https://gerrit.wikimedia.org/r/981402 (owner: 10Ebernhardson) [22:19:16] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:19:16] (03PS4) 10Jdlrobson: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia) [22:19:32] (03CR) 10Jdlrobson: [C: 03+1] "No rush on this one. Whenever we can get round to it!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia) [22:19:32] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:19:50] (03CR) 10Jdlrobson: [C: 04-1] Remove readability survey tool (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia) [22:20:22] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:20:32] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:22:26] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:22:33] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:26:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T343198)', diff saved to https://phabricator.wikimedia.org/P54292 and previous config saved to /var/cache/conftool/dbconfig/20231207-222633-ladsgroup.json [22:26:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [22:26:38] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [22:26:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [22:26:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T343198)', diff saved to https://phabricator.wikimedia.org/P54293 and previous config saved to /var/cache/conftool/dbconfig/20231207-222656-ladsgroup.json [22:29:31] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1059.eqiad.wmnet with reason: host reimage [22:30:23] (03PS1) 10Ebernhardson: admin_ng: Use same limits in cirrus-streaming-updater as rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/981403 [22:30:33] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1060.eqiad.wmnet with reason: host reimage [22:30:50] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1062.eqiad.wmnet with reason: host reimage [22:31:10] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1061.eqiad.wmnet with reason: host reimage [22:33:10] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1059.eqiad.wmnet with reason: host reimage [22:35:32] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1062.eqiad.wmnet with reason: host reimage [22:35:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1060.eqiad.wmnet with reason: host reimage [22:37:32] (03PS5) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) [22:38:53] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1061.eqiad.wmnet with reason: host reimage [22:48:37] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [22:53:36] (03PS5) 10Bking: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/981387 (https://phabricator.wikimedia.org/T347355) [22:53:53] (03CR) 10Kimberly Sarabia: Remove readability survey tool (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia) [22:53:53] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [22:55:37] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Papaul) @Volans did the test 4 times. the first 2 times the server did pxe boot but the last 2 times it didn't [22:55:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cp4037.ulsfo.wmnet [22:58:59] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:00:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Papaul) @Jhancock.wm did you read my comment on Wed, Dec 6, 2:53 PM? [23:00:25] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:05:12] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:07:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T343198)', diff saved to https://phabricator.wikimedia.org/P54294 and previous config saved to /var/cache/conftool/dbconfig/20231207-230749-ladsgroup.json [23:07:54] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:08:38] (03PS2) 10Ryan Kemper: admin_ng: Use same limits in cirrus-streaming-updater as rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/981403 (owner: 10Ebernhardson) [23:09:07] (03CR) 10Ryan Kemper: [C: 03+1] admin_ng: Use same limits in cirrus-streaming-updater as rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/981403 (owner: 10Ebernhardson) [23:09:51] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] admin_ng: Use same limits in cirrus-streaming-updater as rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/981403 (owner: 10Ebernhardson) [23:12:32] (03Merged) 10jenkins-bot: admin_ng: Use same limits in cirrus-streaming-updater as rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/981403 (owner: 10Ebernhardson) [23:15:22] !log ryankemper@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [23:17:18] !log ryankemper@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:21:05] !log ryankemper@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [23:21:06] !log ryankemper@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [23:21:44] !log ryankemper@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [23:22:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P54295 and previous config saved to /var/cache/conftool/dbconfig/20231207-232256-ladsgroup.json [23:23:17] !log ryankemper@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [23:23:43] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [23:23:54] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:35:39] (03PS1) 10Andrea Denisse: prometheus: Ensure prometheus-icinga has a listening address [puppet] - 10https://gerrit.wikimedia.org/r/981407 (https://phabricator.wikimedia.org/T333615) [23:38:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P54296 and previous config saved to /var/cache/conftool/dbconfig/20231207-233802-ladsgroup.json [23:38:57] (03PS3) 10Andrea Denisse: klaxon: Ensure the klaxon user has a home directory [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615) [23:39:21] (03CR) 10Andrea Denisse: klaxon: Ensure the klaxon user has a home directory (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980921 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [23:39:35] (03PS1) 10Ebernhardson: cirrus updater: Undeploy consumer-devnull from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981409 [23:42:34] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Undeploy consumer-devnull from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981409 (owner: 10Ebernhardson) [23:43:18] (03Merged) 10jenkins-bot: cirrus updater: Undeploy consumer-devnull from staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/981409 (owner: 10Ebernhardson) [23:46:21] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) (owner: 10Herron) [23:47:16] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [23:47:23] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/981298 (https://phabricator.wikimedia.org/T352968) (owner: 10Filippo Giunchedi) [23:47:37] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:52:16] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:52:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:52:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1060.eqiad.wmnet with OS bullseye [23:52:21] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:52:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1059.eqiad.wmnet with OS bullseye [23:52:27] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [23:52:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1062.eqiad.wmnet with OS bullseye [23:52:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1060.eqiad.wmnet with OS bullseye completed: - kubernetes1060 (**WARN**)... [23:52:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1059.eqiad.wmnet with OS bullseye completed: - kubernetes1059 (**PASS**)... [23:52:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1061.eqiad.wmnet with OS bullseye [23:52:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1062.eqiad.wmnet with OS bullseye completed: - kubernetes1062 (**WARN**)... [23:52:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1061.eqiad.wmnet with OS bullseye completed: - kubernetes1061 (**WARN**)... [23:52:41] RECOVERY - cassandra-a CQL 10.192.16.243:9042 on restbase2030 is OK: TCP OK - 0.040 second response time on 10.192.16.243 port 9042 https://phabricator.wikimedia.org/T93886 [23:53:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T343198)', diff saved to https://phabricator.wikimedia.org/P54297 and previous config saved to /var/cache/conftool/dbconfig/20231207-235310-ladsgroup.json [23:53:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [23:53:14] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:53:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [23:53:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T343198)', diff saved to https://phabricator.wikimedia.org/P54298 and previous config saved to /var/cache/conftool/dbconfig/20231207-235333-ladsgroup.json [23:53:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr [23:54:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) 05Open→03Resolved [23:56:54] (03PS1) 10Papaul: Rename ceph to cephosd [puppet] - 10https://gerrit.wikimedia.org/r/981413 (https://phabricator.wikimedia.org/T349934)