Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-operations

Filter:
Start date
End date

Displaying 1098 items:

2024-08-07 00:00:10 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1279.eqiad.wmnet with reason: host reimage
2024-08-07 00:07:07 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1060183 (owner: ''TrainBranchBot)'
2024-08-07 00:10:03 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1283.eqiad.wmnet with reason: host reimage
2024-08-07 00:16:45 <jinxer-wm> FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2024-08-07 00:20:03 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1282.eqiad.wmnet with reason: host reimage
2024-08-07 00:26:36 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1281.eqiad.wmnet with reason: host reimage
2024-08-07 00:27:03 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1280.eqiad.wmnet with reason: host reimage
2024-08-07 00:30:41 <jinxer-wm> FIRING: [4x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-08-07 00:32:02 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on wikikube-worker1284.eqiad.wmnet with reason: host reimage
2024-08-07 00:33:19 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:34:25 <jinxer-wm> FIRING: SystemdUnitFailed: prometheus-ipmi-exporter.service on wikikube-worker1282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-08-07 00:37:46 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:37:47 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1283.eqiad.wmnet with OS bullseye
2024-08-07 00:38:00 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047161 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1283.eqiad.wmnet with OS bullseye...'
2024-08-07 00:38:04 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:39:22 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:39:23 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1279.eqiad.wmnet with OS bullseye
2024-08-07 00:39:25 <jinxer-wm> RESOLVED: SystemdUnitFailed: prometheus-ipmi-exporter.service on wikikube-worker1282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-08-07 00:39:29 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047162 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1279.eqiad.wmnet with OS bullseye...'
2024-08-07 00:41:07 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:41:25 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:41:27 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1282.eqiad.wmnet with OS bullseye
2024-08-07 00:41:32 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047163 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1282.eqiad.wmnet with OS bullseye...'
2024-08-07 00:41:45 <jinxer-wm> RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2024-08-07 00:43:54 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:44:50 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:44:51 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1281.eqiad.wmnet with OS bullseye
2024-08-07 00:44:57 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047164 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1281.eqiad.wmnet with OS bullseye...'
2024-08-07 00:45:01 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:46:19 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:46:20 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1280.eqiad.wmnet with OS bullseye
2024-08-07 00:46:30 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047165 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1280.eqiad.wmnet with OS bullseye...'
2024-08-07 00:47:52 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047178 (''Jclark-ctr)'
2024-08-07 00:48:22 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:50:11 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047179 (''Jclark-ctr)'
2024-08-07 00:50:19 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 00:50:20 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1284.eqiad.wmnet with OS bullseye
2024-08-07 00:50:28 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047180 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1284.eqiad.wmnet with OS bullseye...'
2024-08-07 00:50:31 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047181 (''Jclark-ctr)'
2024-08-07 00:54:54 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1295.eqiad.wmnet with OS bullseye
2024-08-07 00:55:04 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1288.eqiad.wmnet with OS bullseye
2024-08-07 00:55:04 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047186 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1295.eqiad.wmnet with OS bull...'
2024-08-07 00:55:10 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047187 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1288.eqiad.wmnet with OS bull...'
2024-08-07 00:55:13 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1289.eqiad.wmnet with OS bullseye
2024-08-07 00:55:19 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047188 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1289.eqiad.wmnet with OS bull...'
2024-08-07 00:56:06 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1290.eqiad.wmnet with OS bullseye
2024-08-07 00:56:09 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1291.eqiad.wmnet with OS bullseye
2024-08-07 00:56:11 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047189 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1290.eqiad.wmnet with OS bull...'
2024-08-07 00:56:15 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047190 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1291.eqiad.wmnet with OS bull...'
2024-08-07 00:56:51 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1292.eqiad.wmnet with OS bullseye
2024-08-07 00:57:01 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047191 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1292.eqiad.wmnet with OS bull...'
2024-08-07 00:57:39 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1293.eqiad.wmnet with OS bullseye
2024-08-07 00:57:48 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047192 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1293.eqiad.wmnet with OS bull...'
2024-08-07 00:58:14 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1294.eqiad.wmnet with OS bullseye
2024-08-07 00:58:25 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047193 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1294.eqiad.wmnet with OS bull...'
2024-08-07 00:58:55 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1287.eqiad.wmnet with OS bullseye
2024-08-07 00:59:03 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047194 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1287.eqiad.wmnet with OS bull...'
2024-08-07 01:01:48 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1286.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 01:02:06 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1286.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 01:02:26 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 01:02:38 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 01:11:45 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1295.eqiad.wmnet with reason: host reimage
2024-08-07 01:11:46 <wikibugs> ('PS1) ''Arlolra: Enabled KartographerParsoidSupport on (cs|hi|shn|ps|tr)wikivoyage [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060186 (https://phabricator.wikimedia.org/T371936)'
2024-08-07 01:11:50 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1288.eqiad.wmnet with reason: host reimage
2024-08-07 01:12:03 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1289.eqiad.wmnet with reason: host reimage
2024-08-07 01:13:02 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1290.eqiad.wmnet with reason: host reimage
2024-08-07 01:13:05 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1291.eqiad.wmnet with reason: host reimage
2024-08-07 01:13:46 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1292.eqiad.wmnet with reason: host reimage
2024-08-07 01:14:07 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1293.eqiad.wmnet with reason: host reimage
2024-08-07 01:14:59 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1294.eqiad.wmnet with reason: host reimage
2024-08-07 01:15:12 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1295.eqiad.wmnet with reason: host reimage
2024-08-07 01:15:40 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1287.eqiad.wmnet with reason: host reimage
2024-08-07 01:18:00 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1291.eqiad.wmnet with reason: host reimage
2024-08-07 01:19:41 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1285.eqiad.wmnet with OS bullseye
2024-08-07 01:19:49 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047215 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1285.eqiad.wmnet with OS bullseye...'
2024-08-07 01:19:50 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1286.eqiad.wmnet with OS bullseye
2024-08-07 01:19:54 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047216 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1286.eqiad.wmnet with OS bullseye...'
2024-08-07 01:21:30 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1293.eqiad.wmnet with reason: host reimage
2024-08-07 01:24:39 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1292.eqiad.wmnet with reason: host reimage
2024-08-07 01:25:43 <wikibugs> 'ops-eqiad, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949 (''phaultfinder) ''NEW'
2024-08-07 01:26:48 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1287.eqiad.wmnet with reason: host reimage
2024-08-07 01:33:06 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:33:22 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1290.eqiad.wmnet with reason: host reimage
2024-08-07 01:33:39 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:33:40 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1295.eqiad.wmnet with OS bullseye
2024-08-07 01:33:52 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047238 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1295.eqiad.wmnet with OS bullseye...'
2024-08-07 01:35:23 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:35:46 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047240 (''phaultfinder)'
2024-08-07 01:36:12 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1289.eqiad.wmnet with reason: host reimage
2024-08-07 01:36:43 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:36:44 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1291.eqiad.wmnet with OS bullseye
2024-08-07 01:36:57 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047241 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1291.eqiad.wmnet with OS bullseye...'
2024-08-07 01:39:34 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1294.eqiad.wmnet with reason: host reimage
2024-08-07 01:39:44 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:40:01 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:40:01 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1293.eqiad.wmnet with OS bullseye
2024-08-07 01:40:11 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047242 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1293.eqiad.wmnet with OS bullseye...'
2024-08-07 01:41:13 <wikibugs> ('PS1) ''Andrew Bogott: Make cloudcephosd103[578] into ceph osd nodes [puppet] - ''https://gerrit.wikimedia.org/r/1060188 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 01:41:41 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:42:08 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:42:09 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1292.eqiad.wmnet with OS bullseye
2024-08-07 01:42:19 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047247 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1292.eqiad.wmnet with OS bullseye...'
2024-08-07 01:42:43 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bullseye
2024-08-07 01:42:54 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047250 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bull...'
2024-08-07 01:43:11 <wikibugs> ('PS2) ''Andrew Bogott: Make cloudcephosd103[578] into ceph osd nodes [puppet] - ''https://gerrit.wikimedia.org/r/1060188 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 01:43:22 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060188 (https://phabricator.wikimedia.org/T363344) (owner: ''Andrew Bogott)'
2024-08-07 01:43:37 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:44:02 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:44:03 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1287.eqiad.wmnet with OS bullseye
2024-08-07 01:44:13 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047251 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1287.eqiad.wmnet with OS bullseye...'
2024-08-07 01:44:29 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1288.eqiad.wmnet with reason: host reimage
2024-08-07 01:48:28 <wikibugs> ('CR) ''Andrew Bogott: [C:''+2] "pcc failures are because the systems are new" [puppet] - ''https://gerrit.wikimedia.org/r/1060188 (https://phabricator.wikimedia.org/T363344) (owner: ''Andrew Bogott)'
2024-08-07 01:51:05 <wikibugs> ('CR) ''Andrew Bogott: [C:''+2] "<narrator> It wasn't." [puppet] - ''https://gerrit.wikimedia.org/r/1060188 (https://phabricator.wikimedia.org/T363344) (owner: ''Andrew Bogott)'
2024-08-07 01:51:21 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:52:08 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:52:09 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1290.eqiad.wmnet with OS bullseye
2024-08-07 01:52:20 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047253 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1290.eqiad.wmnet with OS bullseye...'
2024-08-07 01:53:07 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:55:33 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:55:34 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1289.eqiad.wmnet with OS bullseye
2024-08-07 01:55:43 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047254 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1289.eqiad.wmnet with OS bullseye...'
2024-08-07 01:56:49 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:57:47 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 01:57:48 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1294.eqiad.wmnet with OS bullseye
2024-08-07 01:57:54 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047255 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1294.eqiad.wmnet with OS bullseye...'
2024-08-07 02:02:16 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 02:02:29 <wikibugs> ('PS1) ''Andrew Bogott: Add ceph config for cloudcephosd103[5-8] [puppet] - ''https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 02:02:33 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
2024-08-07 02:02:34 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1288.eqiad.wmnet with OS bullseye
2024-08-07 02:02:46 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047258 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1288.eqiad.wmnet with OS bullseye...'
2024-08-07 02:02:57 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047259 (''Jclark-ctr)'
2024-08-07 02:03:11 <wikibugs> ('PS2) ''Andrew Bogott: Add ceph config for cloudcephosd103[5-8] [puppet] - ''https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 02:03:41 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344) (owner: ''Andrew Bogott)'
2024-08-07 02:04:57 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 02:06:10 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 02:06:42 <wikibugs> ('CR) ''Andrew Bogott: [C:''+2] Add ceph config for cloudcephosd103[5-8] [puppet] - ''https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344) (owner: ''Andrew Bogott)'
2024-08-07 02:21:44 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1296.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 02:21:59 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1296.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 02:35:45 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047267 (''phaultfinder)'
2024-08-07 02:39:23 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 02:40:46 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047268 (''phaultfinder)'
2024-08-07 02:45:48 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047271 (''phaultfinder)'
2024-08-07 02:50:48 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047272 (''phaultfinder)'
2024-08-07 02:59:23 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 03:02:58 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1296.eqiad.wmnet with OS bullseye
2024-08-07 03:03:10 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047274 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bullseye...'
2024-08-07 03:35:43 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047278 (''phaultfinder)'
2024-08-07 03:40:44 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047279 (''phaultfinder)'
2024-08-07 03:45:46 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047280 (''phaultfinder)'
2024-08-07 03:50:49 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047281 (''phaultfinder)'
2024-08-07 03:55:48 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047283 (''phaultfinder)'
2024-08-07 04:00:49 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10047284 (''phaultfinder)'
2024-08-07 04:34:23 <jinxer-wm> FIRING: [4x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-08-07 05:15:52 <wikibugs> ('PS1) ''Giuseppe Lavagetto: haproxy: fallback to global requestctl rules [puppet] - ''https://gerrit.wikimedia.org/r/1060194'
2024-08-07 05:19:09 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C:''+2] haproxy: fallback to global requestctl rules [puppet] - ''https://gerrit.wikimedia.org/r/1060194 (owner: ''Giuseppe Lavagetto)'
2024-08-07 05:23:33 <wikibugs> ('PS1) ''Giuseppe Lavagetto: haproxy: fix text/template [puppet] - ''https://gerrit.wikimedia.org/r/1060195'
2024-08-07 05:23:57 <wikibugs> ('CR) ''Giuseppe Lavagetto: [V:''+2 C:''+2] haproxy: fix text/template [puppet] - ''https://gerrit.wikimedia.org/r/1060195 (owner: ''Giuseppe Lavagetto)'
2024-08-07 05:35:10 <jinxer-wm> FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
2024-08-07 05:49:15 <wikibugs> ('PS1) ''Giuseppe Lavagetto: haproxy: change behaviour for requestctl filters [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745)'
2024-08-07 06:00:04 <jouncebot> Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T0600)
2024-08-07 06:04:21 <jinxer-wm> FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
2024-08-07 06:05:10 <jinxer-wm> RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
2024-08-07 06:09:21 <jinxer-wm> RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
2024-08-07 06:24:25 <wikibugs> ('PS1) ''Jelto: gerrit: disable logging for nftables rules [puppet] - ''https://gerrit.wikimedia.org/r/1060334 (https://phabricator.wikimedia.org/T371951)'
2024-08-07 06:26:22 <wikibugs> ('CR) ''Jelto: [V:''+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3570/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1060334 (https://phabricator.wikimedia.org/T371951) (owner: ''Jelto)'
2024-08-07 06:28:00 <wikibugs> ('CR) ''Jelto: [V:''+1 C:''+2] gerrit: disable logging for nftables rules [puppet] - ''https://gerrit.wikimedia.org/r/1060334 (https://phabricator.wikimedia.org/T371951) (owner: ''Jelto)'
2024-08-07 06:40:08 <wikibugs> ('CR) ''Fabfur: "Do we need to define this in the http frontend too?" [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 06:44:10 <wikibugs> ('CR) ''Giuseppe Lavagetto: "I don't think so, the http frontend just does redirects, adding this would only add unneeded complexity IMHO. We can revisit later." [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 06:47:17 <wikibugs> ('CR) ''Fabfur: [C:''+1] haproxy: change behaviour for requestctl filters [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 06:48:04 <wikibugs> ('CR) ''Fabfur: [C:''+2] haproxy: remove template switch for benthos extended logging [puppet] - ''https://gerrit.wikimedia.org/r/1059358 (https://phabricator.wikimedia.org/T370741) (owner: ''Fabfur)'
2024-08-07 06:49:21 <wikibugs> ('CR) ''Ayounsi: [C:''+2] Netbox prometheus: replace exporter script with plugin [puppet] - ''https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) (owner: ''Ayounsi)'
2024-08-07 06:59:23 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 06:59:52 <wikibugs> ('CR) ''David Caro: Add ceph config for cloudcephosd103[5-8] (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1060190 (https://phabricator.wikimedia.org/T363344) (owner: ''Andrew Bogott)'
2024-08-07 07:00:05 <jouncebot> Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T0700).
2024-08-07 07:00:05 <jouncebot> No Gerrit patches in the queue for this window AFAICS.
2024-08-07 07:00:41 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 07:01:54 <wikibugs> ('CR) ''Vgutierrez: [C:''-1] haproxy: change behaviour for requestctl filters (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 07:05:41 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 07:05:52 <wikibugs> ('CR) ''Fabfur: "thanks @slyngshede@wikimedia.org for taking care of this!" [puppet] - ''https://gerrit.wikimedia.org/r/1059371 (https://phabricator.wikimedia.org/T371689) (owner: ''Fabfur)'
2024-08-07 07:08:13 <wikibugs> ('PS1) ''David Caro: ceph.osd: move the new 103[5-8] nodes to the per-rack ip blocks [puppet] - ''https://gerrit.wikimedia.org/r/1060337 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 07:08:43 <wikibugs> ('CR) ''David Caro: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060337 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 07:09:23 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 07:09:59 <wikibugs> ('Abandoned) ''David Caro: ceph: add new cloudcephosd1035 [puppet] - ''https://gerrit.wikimedia.org/r/1060146 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 07:10:41 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 07:11:57 <wikibugs> ('CR) ''David Caro: [C:''+2] ceph.osd: move the new 103[5-8] nodes to the per-rack ip blocks [puppet] - ''https://gerrit.wikimedia.org/r/1060337 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 07:12:16 <wikibugs> ('CR) ''David Caro: [C:''+2] "PCC looks good, no more duplicated ips, and each host has ip on it's own rack's block" [puppet] - ''https://gerrit.wikimedia.org/r/1060337 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 07:12:39 <wikibugs> ('CR) ''Fabfur: [C:''+1] "ok for me" [puppet] - ''https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) (owner: ''Filippo Giunchedi)'
2024-08-07 07:14:24 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 07:15:34 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+2] benthos: add ensure support [puppet] - ''https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) (owner: ''Filippo Giunchedi)'
2024-08-07 07:15:41 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 07:19:05 <wikibugs> ('PS2) ''Fabfur: hiera:benthos: partially revert benthos removal [puppet] - ''https://gerrit.wikimedia.org/r/1059355 (https://phabricator.wikimedia.org/T371492)'
2024-08-07 07:20:02 <wikibugs> ('CR) ''Ayounsi: [C:''+2] Netbox: use standard STORAGE_BACKEND/CONFIG keys [puppet] - ''https://gerrit.wikimedia.org/r/983716 (https://phabricator.wikimedia.org/T310717) (owner: ''Ayounsi)'
2024-08-07 07:20:40 <wikibugs> ('CR) ''Filippo Giunchedi: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1059355 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 07:21:30 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10047409 (''SLyngshede-WMF)'
2024-08-07 07:21:37 <wikibugs> ('CR) ''Fabfur: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1059355 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 07:23:27 <wikibugs> ('PS1) ''Slyngshede: data.yaml: Add toyofuku to deployment group. [puppet] - ''https://gerrit.wikimedia.org/r/1060338 (https://phabricator.wikimedia.org/T371650)'
2024-08-07 07:24:23 <jinxer-wm> FIRING: [4x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 07:25:41 <jinxer-wm> FIRING: [4x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 07:28:40 <codders> hey hey. This is probably outside of ops scope, but does anyone here know where wikimedia.de stuff is hosted / runs? I can't seem to connect to anything wikimedia.de (wiki.wikimedia.de, mattermost.wikimedia.de, www,wikimedia.de)
2024-08-07 07:29:26 <wikibugs> ('CR) ''Giuseppe Lavagetto: haproxy: change behaviour for requestctl filters (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 07:32:28 <wikibugs> ('PS1) ''Kevin Bazira: ml-services: use cxserver host header in rec-api [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060377 (https://phabricator.wikimedia.org/T371465)'
2024-08-07 07:33:35 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Requesting access to deployment shell access for toyofuku - https://phabricator.wikimedia.org/T371650#10047442 (''SLyngshede-WMF) ''Open''In progress p:''Triage''High'
2024-08-07 07:33:37 <godog> codders: yes definitely a question for WMDE folks
2024-08-07 07:35:10 <codders> k - thanks!
2024-08-07 07:36:05 <wikibugs> ('CR) ''Ayounsi: [C:''+2] Netbox-hiera: add device role to mgmt_hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) (owner: ''Ayounsi)'
2024-08-07 07:36:31 <wikibugs> ('CR) ''Vgutierrez: [C:''-1] haproxy: change behaviour for requestctl filters (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 07:39:23 <wikibugs> ('CR) ''Fabfur: [C:''+2] hiera:benthos: partially revert benthos removal [puppet] - ''https://gerrit.wikimedia.org/r/1059355 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 07:40:12 <wikibugs> ('Merged) ''jenkins-bot: Netbox-hiera: add device role to mgmt_hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/1056880 (https://phabricator.wikimedia.org/T368513) (owner: ''Ayounsi)'
2024-08-07 07:43:37 <wikibugs> ('PS1) ''Kevin Bazira: ml-services: langid from src dir [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060378 (https://phabricator.wikimedia.org/T369344)'
2024-08-07 07:43:42 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add role to mgmt devices - ayounsi@cumin1002"
2024-08-07 07:44:23 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add role to mgmt devices - ayounsi@cumin1002"
2024-08-07 07:46:17 <wikibugs> ('CR) ''Cathal Mooney: [C:''+2] common: add dcaro user for access to cloudsw [homer/public] - ''https://gerrit.wikimedia.org/r/1060087 (owner: ''David Caro)'
2024-08-07 07:46:46 <wikibugs> ('Merged) ''jenkins-bot: common: add dcaro user for access to cloudsw [homer/public] - ''https://gerrit.wikimedia.org/r/1060087 (owner: ''David Caro)'
2024-08-07 07:49:38 <wikibugs> ('CR) ''Cathal Mooney: [C:''+2] common: add dcaro user for access to cloudsw (''2 comments) [homer/public] - ''https://gerrit.wikimedia.org/r/1060087 (owner: ''David Caro)'
2024-08-07 07:50:58 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+2] benthos: use fully qualified kafka cluster name [puppet] - ''https://gerrit.wikimedia.org/r/1060070 (owner: ''Filippo Giunchedi)'
2024-08-07 07:51:09 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+2] webperf: use fully qualified kafka cluster names [puppet] - ''https://gerrit.wikimedia.org/r/1060069 (owner: ''Filippo Giunchedi)'
2024-08-07 07:51:29 <wikibugs> ('PS1) ''Cathal Mooney: Remove taavi user from network devices [homer/public] - ''https://gerrit.wikimedia.org/r/1060379'
2024-08-07 07:51:40 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+2] pontoon: restore Benthos instances functionality [puppet] - ''https://gerrit.wikimedia.org/r/1060071 (owner: ''Filippo Giunchedi)'
2024-08-07 07:52:07 <wikibugs> ('CR) ''Ayounsi: [C:''+2] Prometheus SSH probe: ignore network devices [puppet] - ''https://gerrit.wikimedia.org/r/1056899 (https://phabricator.wikimedia.org/T368513) (owner: ''Ayounsi)'
2024-08-07 07:53:12 <wikibugs> ('CR) ''Cathal Mooney: [C:''+2] Remove taavi user from network devices [homer/public] - ''https://gerrit.wikimedia.org/r/1060379 (owner: ''Cathal Mooney)'
2024-08-07 07:53:43 <wikibugs> ('Merged) ''jenkins-bot: Remove taavi user from network devices [homer/public] - ''https://gerrit.wikimedia.org/r/1060379 (owner: ''Cathal Mooney)'
2024-08-07 07:56:01 <wikibugs> ('PS1) ''Fabfur: hiera:benthos: remove Benthos from ulsfo using benthos module [puppet] - ''https://gerrit.wikimedia.org/r/1060380 (https://phabricator.wikimedia.org/T371492)'
2024-08-07 07:56:47 <wikibugs> ('CR) ''Fabfur: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060380 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 08:00:05 <jouncebot> jnuche and brennen: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T0800).
2024-08-07 08:00:43 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] hiera:benthos: remove Benthos from ulsfo using benthos module [puppet] - ''https://gerrit.wikimedia.org/r/1060380 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 08:00:58 <jnuche> hi, I'll be deploying the train in a few minutes
2024-08-07 08:01:36 <wikibugs> ('PS1) ''David Caro: cloudceph.osd: remove 1036 as we are not adding it yet [puppet] - ''https://gerrit.wikimedia.org/r/1060381 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 08:02:39 <wikibugs> ('PS1) ''Ayounsi: Revert "Prometheus SSH probe: ignore network devices" [puppet] - ''https://gerrit.wikimedia.org/r/1060382'
2024-08-07 08:04:35 <wikibugs> ('CR) ''David Caro: [V:''+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3572/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1060381 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 08:05:39 <wikibugs> ('PS1) ''TrainBranchBot: group1 to 1.43.0-wmf.17 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060383 (https://phabricator.wikimedia.org/T366962)'
2024-08-07 08:05:41 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] group1 to 1.43.0-wmf.17 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060383 (https://phabricator.wikimedia.org/T366962) (owner: ''TrainBranchBot)'
2024-08-07 08:06:19 <wikibugs> ('Merged) ''jenkins-bot: group1 to 1.43.0-wmf.17 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060383 (https://phabricator.wikimedia.org/T366962) (owner: ''TrainBranchBot)'
2024-08-07 08:07:46 <wikibugs> ('CR) ''David Caro: [V:''+1] "pcc" [puppet] - ''https://gerrit.wikimedia.org/r/1060381 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 08:09:00 <wikibugs> ('PS2) ''Giuseppe Lavagetto: haproxy: change behaviour for requestctl filters [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745)'
2024-08-07 08:09:07 <wikibugs> ('CR) ''David Caro: [C:''+2] "pcc looks good" [puppet] - ''https://gerrit.wikimedia.org/r/1060381 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 08:09:40 <wikibugs> ('PS1) ''Ayounsi: Add role to type Netbox::Device::Location::BareMetal [puppet] - ''https://gerrit.wikimedia.org/r/1060385 (https://phabricator.wikimedia.org/T368513)'
2024-08-07 08:10:03 <wikibugs> ('CR) ''Giuseppe Lavagetto: haproxy: change behaviour for requestctl filters (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 08:10:28 <wikibugs> ('PS2) ''Ayounsi: Add role to type Netbox::Device::Location::BareMetal [puppet] - ''https://gerrit.wikimedia.org/r/1060385 (https://phabricator.wikimedia.org/T368513)'
2024-08-07 08:10:35 <wikibugs> ('CR) ''Ayounsi: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060385 (https://phabricator.wikimedia.org/T368513) (owner: ''Ayounsi)'
2024-08-07 08:11:03 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] Add role to type Netbox::Device::Location::BareMetal [puppet] - ''https://gerrit.wikimedia.org/r/1060385 (https://phabricator.wikimedia.org/T368513) (owner: ''Ayounsi)'
2024-08-07 08:12:45 <wikibugs> ('CR) ''Giuseppe Lavagetto: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3574/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 08:13:32 <wikibugs> ('CR) ''Vgutierrez: haproxy: change behaviour for requestctl filters (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 08:13:44 <wikibugs> ('CR) ''Giuseppe Lavagetto: [V:''+1] "PCC output: https://puppet-compiler.wmflabs.org/output/1060198/3574/cp4044.ulsfo.wmnet/index.html"; [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 08:16:12 <wikibugs> ('PS9) ''Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 08:17:19 <wikibugs> ('PS3) ''Hashar: cumin: set git::clone umask to match requested file mode [puppet] - ''https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277)'
2024-08-07 08:17:35 <wikibugs> ('CR) ''Ayounsi: [C:''+2] Add role to type Netbox::Device::Location::BareMetal [puppet] - ''https://gerrit.wikimedia.org/r/1060385 (https://phabricator.wikimedia.org/T368513) (owner: ''Ayounsi)'
2024-08-07 08:18:18 <logmsgbot> !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.43.0-wmf.17 refs T366962
2024-08-07 08:18:21 <stashbot> T366962: 1.43.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T366962
2024-08-07 08:18:45 <wikibugs> ('CR) ''Hashar: "I have rebased by mistake but there is no other change :)" [puppet] - ''https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) (owner: ''Hashar)'
2024-08-07 08:19:37 <wikibugs> ('PS1) ''Ayounsi: Revert "Add role to type Netbox::Device::Location::BareMetal" [puppet] - ''https://gerrit.wikimedia.org/r/1060386'
2024-08-07 08:20:19 <wikibugs> ('CR) ''Ayounsi: [C:''+2] Revert "Add role to type Netbox::Device::Location::BareMetal" [puppet] - ''https://gerrit.wikimedia.org/r/1060386 (owner: ''Ayounsi)'
2024-08-07 08:20:26 <wikibugs> ('CR) ''Ayounsi: [C:''+2] Revert "Prometheus SSH probe: ignore network devices" [puppet] - ''https://gerrit.wikimedia.org/r/1060382 (owner: ''Ayounsi)'
2024-08-07 08:20:45 <jinxer-wm> FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2024-08-07 08:21:23 <wikibugs> ('PS23) ''Effie Mouzeli: cronjobs : update modules to job 2.0.0 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885)'
2024-08-07 08:22:48 <wikibugs> ('CR) ''Giuseppe Lavagetto: [V:''+1] haproxy: change behaviour for requestctl filters (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 08:25:10 <wikibugs> ('PS3) ''Giuseppe Lavagetto: haproxy: change behaviour for requestctl filters [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745)'
2024-08-07 08:25:25 <wikibugs> ('PS10) ''Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 08:26:09 <wikibugs> ('CR) ''Giuseppe Lavagetto: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3575/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 08:31:26 <wikibugs> ('PS1) ''Ayounsi: Prometheus SSH probe: ignore network devices - try 2 [puppet] - ''https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513)'
2024-08-07 08:31:47 <elukey> !log openjdk-11 upgrades for bullseye rolled out to prod
2024-08-07 08:31:52 <wikibugs> ('CR) ''CI reject: [V:''-1] Prometheus SSH probe: ignore network devices - try 2 [puppet] - ''https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) (owner: ''Ayounsi)'
2024-08-07 08:33:55 <wikibugs> ('CR) ''Vgutierrez: [C:''+1] haproxy: change behaviour for requestctl filters [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 08:34:23 <jinxer-wm> FIRING: [4x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-08-07 08:34:35 <logmsgbot> !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T367856)', diff saved to https://phabricator.wikimedia.org/P67237 and previous config saved to /var/cache/conftool/dbconfig/20240807-083434-marostegui.json
2024-08-07 08:35:11 <wikibugs> ('CR) ''Elukey: [C:''+2] cumin: set git::clone umask to match requested file mode [puppet] - ''https://gerrit.wikimedia.org/r/1056985 (https://phabricator.wikimedia.org/T338277) (owner: ''Hashar)'
2024-08-07 08:38:47 <wikibugs> ('PS27) ''Elukey: git: remove umask from git::clone [puppet] - ''https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: ''Hashar)'
2024-08-07 08:39:23 <jinxer-wm> FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2024-08-07 08:41:26 <wikibugs> 'SRE-tools, ''Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10047547 (''elukey) Rolled out the change to the hadoop cluster, this is the only error that I got: ` [2024-08-07T08:38:59] Unable to update host 'an-worker110...'
2024-08-07 08:41:28 <wikibugs> ('PS2) ''Ayounsi: Prometheus SSH probe: ignore network devices - try 2 [puppet] - ''https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513)'
2024-08-07 08:42:01 <wikibugs> ('PS3) ''Ayounsi: Prometheus SSH probe: ignore network devices - try 2 [puppet] - ''https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513)'
2024-08-07 08:42:20 <wikibugs> ('CR) ''Ayounsi: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) (owner: ''Ayounsi)'
2024-08-07 08:44:05 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''cloud-services-team (Hardware), ''Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10047553 (''dcaro) cloudcephosd1035 has one drive that wrongly assigned as 'os raid': ` sdb...'
2024-08-07 08:45:28 <wikibugs> ('CR) ''Giuseppe Lavagetto: [V:''+1 C:''+2] haproxy: change behaviour for requestctl filters [puppet] - ''https://gerrit.wikimedia.org/r/1060198 (https://phabricator.wikimedia.org/T370745) (owner: ''Giuseppe Lavagetto)'
2024-08-07 08:46:22 <wikibugs> ('PS1) ''Ayounsi: Revert "Netbox-hiera: add device role to mgmt_hosts" [cookbooks] - ''https://gerrit.wikimedia.org/r/1060391'
2024-08-07 08:46:52 <wikibugs> ('CR) ''Elukey: Netbox script proxy: set to absent where possible (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: ''Ayounsi)'
2024-08-07 08:47:57 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] Revert "Netbox-hiera: add device role to mgmt_hosts" [cookbooks] - ''https://gerrit.wikimedia.org/r/1060391 (owner: ''Ayounsi)'
2024-08-07 08:49:23 <jinxer-wm> RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2024-08-07 08:49:42 <logmsgbot> !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P67238 and previous config saved to /var/cache/conftool/dbconfig/20240807-084942-marostegui.json
2024-08-07 08:50:11 <wikibugs> ('CR) ''Elukey: [C:''+1] "Left a couple of comments related to the #TODOs, but the rest looks good! Feel free to merge anytime" [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1059250 (owner: ''Ayounsi)'
2024-08-07 08:51:00 <wikibugs> 'SRE, ''collaboration-services, ''Continuous-Integration-Infrastructure, ''Jenkins, ''Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10047576 (''hashar)'
2024-08-07 08:51:19 <wikibugs> ('CR) ''Elukey: [C:''+1] ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1059349 (owner: ''Ayounsi)'
2024-08-07 08:51:39 <wikibugs> ('CR) ''Ayounsi: [C:''+2] Revert "Netbox-hiera: add device role to mgmt_hosts" [cookbooks] - ''https://gerrit.wikimedia.org/r/1060391 (owner: ''Ayounsi)'
2024-08-07 08:53:18 <wikibugs> 'SRE, ''collaboration-services, ''Continuous-Integration-Infrastructure, ''Jenkins, ''Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10047580 (''hashar) After discussing with Simon (`@SLyngshede-WMF`), the `jenkins-deploy` account hits so...'
2024-08-07 08:53:50 <wikibugs> ('CR) ''Ayounsi: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060388 (https://phabricator.wikimedia.org/T368513) (owner: ''Ayounsi)'
2024-08-07 08:54:32 <elukey> !log upgrade debmonitor-client to 0.4.0 fleetwide - T368744
2024-08-07 08:55:34 <wikibugs> ('CR) ''Effie Mouzeli: cronjobs : update modules to job 2.0.0 (''5 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: ''Effie Mouzeli)'
2024-08-07 08:55:37 <wikibugs> ('PS1) ''Btullis: Add a record of the kerberos enablement of ifrahkh [puppet] - ''https://gerrit.wikimedia.org/r/1060394 (https://phabricator.wikimedia.org/T371894)'
2024-08-07 08:55:45 <jinxer-wm> RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2024-08-07 08:57:08 <wikibugs> ('Merged) ''jenkins-bot: Revert "Netbox-hiera: add device role to mgmt_hosts" [cookbooks] - ''https://gerrit.wikimedia.org/r/1060391 (owner: ''Ayounsi)'
2024-08-07 08:58:54 <elukey> the debmonitor1003 failures are surely due to me rolling out the new debmonitor-client
2024-08-07 08:59:14 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "rollback adding role to mgmt devices - ayounsi@cumin1002"
2024-08-07 08:59:14 <elukey> it is updating a lot of things in the db (first time only that runs) and the server may suffer a bit
2024-08-07 08:59:23 <jinxer-wm> FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2024-08-07 08:59:40 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "rollback adding role to mgmt devices - ayounsi@cumin1002"
2024-08-07 09:00:41 <jinxer-wm> RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2024-08-07 09:02:32 <wikibugs> ('CR) ''Fabfur: [C:''+2] hiera:benthos: remove Benthos from ulsfo using benthos module [puppet] - ''https://gerrit.wikimedia.org/r/1060380 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 09:03:26 <wikibugs> ('PS1) ''Slyngshede: P:idp More precise base_dn for user lookup [puppet] - ''https://gerrit.wikimedia.org/r/1060396 (https://phabricator.wikimedia.org/T371930)'
2024-08-07 09:04:49 <logmsgbot> !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P67239 and previous config saved to /var/cache/conftool/dbconfig/20240807-090449-marostegui.json
2024-08-07 09:12:13 <wikibugs> ('PS2) ''Slyngshede: P:idp More precise base_dn for user lookup [puppet] - ''https://gerrit.wikimedia.org/r/1060396 (https://phabricator.wikimedia.org/T371930)'
2024-08-07 09:13:10 <wikibugs> ('CR) ''Slyngshede: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3577/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1060396 (https://phabricator.wikimedia.org/T371930) (owner: ''Slyngshede)'
2024-08-07 09:19:56 <logmsgbot> !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T367856)', diff saved to https://phabricator.wikimedia.org/P67240 and previous config saved to /var/cache/conftool/dbconfig/20240807-091956-marostegui.json
2024-08-07 09:19:58 <logmsgbot> !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance
2024-08-07 09:20:11 <logmsgbot> !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1219.eqiad.wmnet with reason: Maintenance
2024-08-07 09:20:18 <logmsgbot> !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T367856)', diff saved to https://phabricator.wikimedia.org/P67241 and previous config saved to /var/cache/conftool/dbconfig/20240807-092018-marostegui.json
2024-08-07 09:20:32 <wikibugs> ('CR) ''Slyngshede: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1060394 (https://phabricator.wikimedia.org/T371894) (owner: ''Btullis)'
2024-08-07 09:20:43 <wikibugs> ('CR) ''Brouberol: [C:''+1] "Access was authorized in https://phabricator.wikimedia.org/T366558"; [puppet] - ''https://gerrit.wikimedia.org/r/1060394 (https://phabricator.wikimedia.org/T371894) (owner: ''Btullis)'
2024-08-07 09:21:07 <wikibugs> ('PS11) ''Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537)'
2024-08-07 09:25:15 <wikibugs> ('CR) ''Stevemunene: [C:''+1] Add a record of the kerberos enablement of ifrahkh [puppet] - ''https://gerrit.wikimedia.org/r/1060394 (https://phabricator.wikimedia.org/T371894) (owner: ''Btullis)'
2024-08-07 09:27:56 <wikibugs> ('CR) ''Btullis: [V:''+2 C:''+2] Update the beta cluster scap targets for dumps [dumps/scap] - ''https://gerrit.wikimedia.org/r/1059891 (https://phabricator.wikimedia.org/T370465) (owner: ''Btullis)'
2024-08-07 09:28:13 <wikibugs> ('CR) ''Btullis: [C:''+2] Update the mediawiki-installation dsh group with new beta snapshot host [puppet] - ''https://gerrit.wikimedia.org/r/1059893 (https://phabricator.wikimedia.org/T370465) (owner: ''Btullis)'
2024-08-07 09:29:23 <jinxer-wm> FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2024-08-07 09:32:33 <wikibugs> ('PS12) ''Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537)'
2024-08-07 09:33:01 <wikibugs> ('PS13) ''Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537)'
2024-08-07 09:33:52 <wikibugs> ('PS1) ''Klausman: hiera/manifest/partman: Add configuration for new ML hosts in codfw [puppet] - ''https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521)'
2024-08-07 09:33:52 <wikibugs> ('CR) ''Klausman: "Feel free to redirect review to someone else." [puppet] - ''https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) (owner: ''Klausman)'
2024-08-07 09:34:23 <jinxer-wm> RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2024-08-07 09:36:36 <wikibugs> ('PS11) ''Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 09:40:41 <jinxer-wm> FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2024-08-07 09:41:54 <wikibugs> ('CR) ''Btullis: hiera/manifest/partman: Add configuration for new ML hosts in codfw (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) (owner: ''Klausman)'
2024-08-07 09:43:32 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts netbox2002.codfw.wmnet
2024-08-07 09:44:23 <jinxer-wm> RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2024-08-07 09:44:50 <wikibugs> ('PS1) ''David Caro: parted: add a recipe to autouse the two smaller disks [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 09:45:22 <wikibugs> ('CR) ''CI reject: [V:''-1] parted: add a recipe to autouse the two smaller disks [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 09:45:41 <jinxer-wm> FIRING: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2024-08-07 09:46:04 <logmsgbot> !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2036.mgmt.codfw.wmnet with reboot policy GRACEFUL
2024-08-07 09:46:47 <wikibugs> ('PS1) ''Ayounsi: Remove Netbox 3 from MariaDB ferm ACLs [puppet] - ''https://gerrit.wikimedia.org/r/1060403 (https://phabricator.wikimedia.org/T371957)'
2024-08-07 09:46:54 <wikibugs> ('PS2) ''Klausman: hiera/manifest/partman: Add configuration for new ML hosts in codfw [puppet] - ''https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521)'
2024-08-07 09:48:02 <wikibugs> ('CR) ''Klausman: hiera/manifest/partman: Add configuration for new ML hosts in codfw (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) (owner: ''Klausman)'
2024-08-07 09:49:16 <logmsgbot> !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2036.mgmt.codfw.wmnet with reboot policy GRACEFUL
2024-08-07 09:53:05 <wikibugs> ('CR) ''Elukey: [C:''+1] Remove Netbox 3 from MariaDB ferm ACLs [puppet] - ''https://gerrit.wikimedia.org/r/1060403 (https://phabricator.wikimedia.org/T371957) (owner: ''Ayounsi)'
2024-08-07 09:53:20 <wikibugs> ('CR) ''Ayounsi: [C:''+2] Remove Netbox 3 from MariaDB ferm ACLs [puppet] - ''https://gerrit.wikimedia.org/r/1060403 (https://phabricator.wikimedia.org/T371957) (owner: ''Ayounsi)'
2024-08-07 09:54:10 <logmsgbot> !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2037.mgmt.codfw.wmnet with reboot policy GRACEFUL
2024-08-07 09:55:18 <wikibugs> ('PS1) ''Giuseppe Lavagetto: haproxy: make indentation from go template more readable [puppet] - ''https://gerrit.wikimedia.org/r/1060406'
2024-08-07 09:57:03 <logmsgbot> !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2037.mgmt.codfw.wmnet with reboot policy GRACEFUL
2024-08-07 09:57:51 <wikibugs> ('CR) ''Btullis: [C:''+2] Add a record of the kerberos enablement of ifrahkh [puppet] - ''https://gerrit.wikimedia.org/r/1060394 (https://phabricator.wikimedia.org/T371894) (owner: ''Btullis)'
2024-08-07 09:58:07 <wikibugs> ('CR) ''Giuseppe Lavagetto: [V:''+2 C:''+2] haproxy: make indentation from go template more readable [puppet] - ''https://gerrit.wikimedia.org/r/1060406 (owner: ''Giuseppe Lavagetto)'
2024-08-07 09:58:46 <wikibugs> ('PS2) ''David Caro: partman: add a recipe for using the smallest 2 drives for cloudceph [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 09:59:59 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
2024-08-07 10:00:05 <jouncebot> Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1000)
2024-08-07 10:00:26 <wikibugs> ('PS3) ''Klausman: hiera/manifest/partman: Add configuration for new ML hosts in codfw [puppet] - ''https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521)'
2024-08-07 10:05:10 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
2024-08-07 10:05:17 <logmsgbot> !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2038.mgmt.codfw.wmnet with reboot policy GRACEFUL
2024-08-07 10:05:41 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 10:06:34 <wikibugs> 'SRE, ''LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958 (''ArthurTaylor) ''NEW'
2024-08-07 10:06:53 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
2024-08-07 10:06:53 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2024-08-07 10:06:53 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netbox2002.codfw.wmnet
2024-08-07 10:07:41 <wikibugs> ('CR) ''Hashar: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: ''Hashar)'
2024-08-07 10:07:47 <_joe_> jouncebot: now
2024-08-07 10:07:47 <jouncebot> For the next 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1000)
2024-08-07 10:08:12 <logmsgbot> !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2038.mgmt.codfw.wmnet with reboot policy GRACEFUL
2024-08-07 10:09:11 <wikibugs> ('CR) ''Btullis: [C:''+1] "Looks good to me." [puppet] - ''https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) (owner: ''Klausman)'
2024-08-07 10:09:13 <logmsgbot> !log dcausse@deploy1003 Started deploy [airflow-dags/search@5569f85]: search: bump rdf artifact to 0.3.146
2024-08-07 10:09:34 <logmsgbot> !log dcausse@deploy1003 Finished deploy [airflow-dags/search@5569f85]: search: bump rdf artifact to 0.3.146 (duration: 00m 21s)
2024-08-07 10:09:48 <wikibugs> ('CR) ''Klausman: [C:''+2] hiera/manifest/partman: Add configuration for new ML hosts in codfw [puppet] - ''https://gerrit.wikimedia.org/r/1060399 (https://phabricator.wikimedia.org/T366521) (owner: ''Klausman)'
2024-08-07 10:11:13 <logmsgbot> !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2039.mgmt.codfw.wmnet with reboot policy GRACEFUL
2024-08-07 10:11:55 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts netbox1002.eqiad.wmnet
2024-08-07 10:11:59 <wikibugs> ('PS1) ''Filippo Giunchedi: mw-jobrunner: bump limit/request for statsd-exporter [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060411 (https://phabricator.wikimedia.org/T371885)'
2024-08-07 10:12:06 <logmsgbot> !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2039.mgmt.codfw.wmnet with reboot policy GRACEFUL
2024-08-07 10:13:33 <wikibugs> ('PS1) ''Ayounsi: Remove netbox 3 references [puppet] - ''https://gerrit.wikimedia.org/r/1060412 (https://phabricator.wikimedia.org/T371957)'
2024-08-07 10:14:23 <jinxer-wm> RESOLVED: [2x] ProbeDown: Service debmonitor1003:443 has failed probes (http_debmonitor_client_download_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2024-08-07 10:14:53 <wikibugs> ('PS3) ''David Caro: partman: add a recipe for using the smallest 2 drives for cloudceph [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 10:15:19 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C:''+1] mw-jobrunner: bump limit/request for statsd-exporter [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060411 (https://phabricator.wikimedia.org/T371885) (owner: ''Filippo Giunchedi)'
2024-08-07 10:17:34 <wikibugs> ('CR) ''Ayounsi: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060412 (https://phabricator.wikimedia.org/T371957) (owner: ''Ayounsi)'
2024-08-07 10:18:25 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
2024-08-07 10:19:23 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 10:19:54 <wikibugs> ('CR) ''Btullis: [C:''+1] "Looks good to me." [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 10:20:20 <wikibugs> ('PS2) ''Kosta Harlan: AbuseFilter: Enable showcaptcha consequence everywhere [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110)'
2024-08-07 10:20:32 <wikibugs> ('CR) ''Dreamy Jazz: [C:''+1] AbuseFilter: Enable showcaptcha consequence everywhere [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) (owner: ''Kosta Harlan)'
2024-08-07 10:20:41 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 10:22:46 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
2024-08-07 10:22:49 <Dreamy_Jazz> jouncebot: nowandnext
2024-08-07 10:22:50 <jouncebot> For the next 0 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1000)
2024-08-07 10:22:50 <jouncebot> In 0 hour(s) and 37 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1100)
2024-08-07 10:23:21 <Dreamy_Jazz> Anyone mind if I deploy now?
2024-08-07 10:24:11 <wikibugs> ('PS4) ''David Caro: partman: add a recipe for using the smallest 2 drives for cloudceph [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 10:24:12 <wikibugs> ('PS1) ''Effie Mouzeli: mw-mcrouter: balance resources [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060413'
2024-08-07 10:24:23 <jinxer-wm> FIRING: [3x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 10:24:35 <logmsgbot> !log elukey@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-main-codfw cluster: Roll restart of jvm daemons.
2024-08-07 10:24:36 <wikibugs> ('CR) ''David Caro: "Tested on cloudcephosd1035, generates this:" [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 10:24:37 <wikibugs> ('PS1) ''Ayounsi: Remove "netbox4" upgrade flag [puppet] - ''https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957)'
2024-08-07 10:24:48 <effie> Dreamy_Jazz: lol you and I have a history of bumping into each other :p
2024-08-07 10:24:57 <effie> I want to attempt to rollout a mcrouter change
2024-08-07 10:24:59 <wikibugs> ('CR) ''Ayounsi: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957) (owner: ''Ayounsi)'
2024-08-07 10:25:10 <Dreamy_Jazz> Mine isn't particularly urgent, but would like to deploy today
2024-08-07 10:25:32 <logmsgbot> !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netbox1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
2024-08-07 10:25:32 <logmsgbot> !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
2024-08-07 10:25:32 <effie> lets see how mine will go
2024-08-07 10:25:33 <logmsgbot> !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts netbox1002.eqiad.wmnet
2024-08-07 10:25:41 <wikibugs> ('PS1) ''C. Scott Ananian: Turn on Parsoid support for Kartographer on Wikivoyage [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823)'
2024-08-07 10:25:50 <Dreamy_Jazz> So you'll want to do that first?
2024-08-07 10:26:37 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) (owner: ''C. Scott Ananian)'
2024-08-07 10:27:15 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) (owner: ''Kosta Harlan)'
2024-08-07 10:27:27 <Dreamy_Jazz> I'll schedule it in to the backport window in a few hours :)
2024-08-07 10:27:41 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove netbox1002 - ayounsi@cumin1002"
2024-08-07 10:27:45 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove netbox1002 - ayounsi@cumin1002"
2024-08-07 10:28:02 <effie> Dreamy_Jazz: yes please, it will take a while though
2024-08-07 10:28:44 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts netboxdb1002.eqiad.wmnet
2024-08-07 10:30:56 <logmsgbot> !log elukey@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-main-codfw cluster: Roll restart of jvm daemons.
2024-08-07 10:31:26 <Dreamy_Jazz> Sure. I've placed my change into the backport window.
2024-08-07 10:33:14 <wikibugs> ('CR) ''Isabelle Hurbain-Palatin: [C:''+1] Turn on Parsoid support for Kartographer on Wikivoyage [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) (owner: ''C. Scott Ananian)'
2024-08-07 10:34:19 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
2024-08-07 10:34:37 <wikibugs> ('PS2) ''Effie Mouzeli: mw-mcrouter: balance resources [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060413'
2024-08-07 10:36:16 <wikibugs> ('CR) ''JMeybohm: [C:''+1] mw-mcrouter: balance resources [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060413 (owner: ''Effie Mouzeli)'
2024-08-07 10:36:44 <wikibugs> ('CR) ''Effie Mouzeli: [C:''+2] mw-mcrouter: balance resources [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060413 (owner: ''Effie Mouzeli)'
2024-08-07 10:37:35 <wikibugs> ('Merged) ''jenkins-bot: mw-mcrouter: balance resources [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060413 (owner: ''Effie Mouzeli)'
2024-08-07 10:37:42 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netboxdb1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
2024-08-07 10:37:59 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netboxdb1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
2024-08-07 10:37:59 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2024-08-07 10:37:59 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netboxdb1002.eqiad.wmnet
2024-08-07 10:38:17 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts netboxdb2002.codfw.wmnet
2024-08-07 10:38:39 <logmsgbot> !log jiji@deploy1003 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply
2024-08-07 10:43:15 <jinxer-wm> FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
2024-08-07 10:43:35 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
2024-08-07 10:43:39 <effie> ths is me^
2024-08-07 10:47:01 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netboxdb2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
2024-08-07 10:48:15 <jinxer-wm> FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
2024-08-07 10:49:56 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netboxdb2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002"
2024-08-07 10:49:56 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2024-08-07 10:49:57 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netboxdb2002.codfw.wmnet
2024-08-07 10:50:38 <wikibugs> ('PS1) ''Fabfur: hiera:benthos: finally removing all hiera relative to Benthos [puppet] - ''https://gerrit.wikimedia.org/r/1060416 (https://phabricator.wikimedia.org/T371492)'
2024-08-07 10:50:47 <jinxer-wm> FIRING: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
2024-08-07 10:52:17 <wikibugs> ('CR) ''Fabfur: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060416 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 10:52:24 <logmsgbot> !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply
2024-08-07 10:53:58 <logmsgbot> !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
2024-08-07 10:54:05 <logmsgbot> !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
2024-08-07 10:55:47 <jinxer-wm> RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
2024-08-07 10:57:26 <wikibugs> ('PS12) ''Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 10:59:46 <wikibugs> ('PS1) ''Giuseppe Lavagetto: haproxy: improve management of x-requestctl [puppet] - ''https://gerrit.wikimedia.org/r/1060417'
2024-08-07 11:00:04 <jouncebot> mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1100). nyaa~
2024-08-07 11:01:35 <logmsgbot> !log btullis@deploy1003 Started deploy [dumps/dumps@0d1f9be]: (no justification provided)
2024-08-07 11:01:36 <logmsgbot> !log btullis@deploy1003 Finished deploy [dumps/dumps@0d1f9be]: (no justification provided) (duration: 00m 00s)
2024-08-07 11:03:17 <wikibugs> 'SRE, ''LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10047874 (''SLyngshede-WMF) p:''Triage''Medium'
2024-08-07 11:08:27 <wikibugs> 'SRE, ''LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10047902 (''SLyngshede-WMF) You shouldn't need access to the WMF group to access or contribute to repos/mediawiki. @dancy / @Jelto is there a mechanism in Gitlab to grant that access, or some alte...'
2024-08-07 11:10:35 <effie> Dreamy_Jazz: I am done, so you may want to use the current window
2024-08-07 11:10:41 <effie> which is not mine :p
2024-08-07 11:12:22 <logmsgbot> !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
2024-08-07 11:13:00 <logmsgbot> !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
2024-08-07 11:15:03 <wikibugs> ('PS13) ''Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 11:16:11 <wikibugs> ('CR) ''Btullis: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster (''3 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) (owner: ''Brouberol)'
2024-08-07 11:17:22 <wikibugs> ('PS14) ''Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 11:17:28 <wikibugs> ('CR) ''Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster (''2 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) (owner: ''Brouberol)'
2024-08-07 11:17:33 <wikibugs> 'SRE, ''LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10047933 (''Jelto) afaik there is [automation](https://gitlab.wikimedia.org/repos/releng/gitlab-settings/-/blob/main/group-management/sync-gitlab-group-with-ldap?ref_type=heads) which syncs ldap use...'
2024-08-07 11:17:45 <jinxer-wm> RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
2024-08-07 11:19:14 <wikibugs> 'SRE, ''LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10047936 (''SLyngshede-WMF) The WMF group is for staff and contractor, so I suspect there's another one.'
2024-08-07 11:29:30 <Dreamy_Jazz> Thanks!
2024-08-07 11:29:34 <Dreamy_Jazz> jouncebot: nowandnext
2024-08-07 11:29:34 <jouncebot> For the next 0 hour(s) and 30 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1100)
2024-08-07 11:29:34 <jouncebot> In 1 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1300)
2024-08-07 11:34:16 <wikibugs> ('PS2) ''Ayounsi: Remove netbox 3 references [puppet] - ''https://gerrit.wikimedia.org/r/1060412 (https://phabricator.wikimedia.org/T371957)'
2024-08-07 11:34:19 <wikibugs> ('CR) ''Ayounsi: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060412 (https://phabricator.wikimedia.org/T371957) (owner: ''Ayounsi)'
2024-08-07 11:34:27 <wikibugs> ('PS2) ''Ayounsi: Remove "netbox4" upgrade flag [puppet] - ''https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957)'
2024-08-07 11:34:29 <wikibugs> ('CR) ''Ayounsi: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060414 (https://phabricator.wikimedia.org/T371957) (owner: ''Ayounsi)'
2024-08-07 11:36:45 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) (owner: ''Kosta Harlan)'
2024-08-07 11:37:09 <wikibugs> ('CR) ''Ayounsi: raise AbortScript when needed (''2 comments) [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1059250 (owner: ''Ayounsi)'
2024-08-07 11:37:25 <wikibugs> ('Merged) ''jenkins-bot: AbuseFilter: Enable showcaptcha consequence everywhere [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1056146 (https://phabricator.wikimedia.org/T20110) (owner: ''Kosta Harlan)'
2024-08-07 11:37:29 <wikibugs> ('CR) ''Ayounsi: [C:''+2] raise AbortScript when needed [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1059250 (owner: ''Ayounsi)'
2024-08-07 11:37:59 <logmsgbot> !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1056146|AbuseFilter: Enable showcaptcha consequence everywhere (T20110)]]
2024-08-07 11:38:42 <wikibugs> ('Merged) ''jenkins-bot: raise AbortScript when needed [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1059250 (owner: ''Ayounsi)'
2024-08-07 11:40:05 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
2024-08-07 11:40:19 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
2024-08-07 11:41:53 <logmsgbot> !log dreamyjazz@deploy1003 kharlan, dreamyjazz: Backport for [[gerrit:1056146|AbuseFilter: Enable showcaptcha consequence everywhere (T20110)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
2024-08-07 11:42:44 <logmsgbot> !log dreamyjazz@deploy1003 kharlan, dreamyjazz: Continuing with sync
2024-08-07 11:47:12 <logmsgbot> !log dreamyjazz@deploy1003 Finished scap: Backport for [[gerrit:1056146|AbuseFilter: Enable showcaptcha consequence everywhere (T20110)]] (duration: 09m 13s)
2024-08-07 11:47:18 <Dreamy_Jazz> Done my deploy
2024-08-07 11:49:02 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
2024-08-07 11:49:28 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
2024-08-07 11:49:38 <wikibugs> ('CR) ''Ayounsi: [C:''+2] ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1059349 (owner: ''Ayounsi)'
2024-08-07 11:49:44 <wikibugs> ('CR) ''CI reject: [V:''-1] ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1059349 (owner: ''Ayounsi)'
2024-08-07 11:49:58 <wikibugs> ('PS2) ''Ayounsi: ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1059349'
2024-08-07 11:52:17 <wikibugs> ('CR) ''Ayounsi: "recheck" [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1059349 (owner: ''Ayounsi)'
2024-08-07 11:53:24 <wikibugs> ('Merged) ''jenkins-bot: ImportPuppetDB: Run Validate on VMs too [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1059349 (owner: ''Ayounsi)'
2024-08-07 11:53:56 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
2024-08-07 11:54:10 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
2024-08-07 11:56:44 <logmsgbot> !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
2024-08-07 11:57:11 <logmsgbot> !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
2024-08-07 12:03:01 <wikibugs> ('CR) ''FNegri: [C:''-1] "I would rename "partman/raid1-2dev-autodetect.cfg" to "partman/custom/cloudcephosd.cfg", for consistency with "partman/custom/cephosd.cfg"" [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 12:14:36 <wikibugs> ('CR) ''FNegri: [C:''-1] partman: add a recipe for using the smallest 2 drives for cloudceph (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 12:17:10 <wikibugs> ('CR) ''Hashar: "PCC https://puppet-compiler.wmflabs.org/output/927986/1626/"; [puppet] - ''https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: ''Hashar)'
2024-08-07 12:25:57 <wikibugs> ('CR) ''FNegri: [C:''-1] partman: add a recipe for using the smallest 2 drives for cloudceph (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 12:34:23 <jinxer-wm> FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-08-07 12:34:54 <wikibugs> ('PS5) ''David Caro: partman: add a recipe for using the smallest 2 drives for cloudceph [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 12:34:55 <wikibugs> ('CR) ''David Caro: partman: add a recipe for using the smallest 2 drives for cloudceph (''4 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 12:40:48 <wikibugs> ('CR) ''FNegri: [C:''-1] "the path should be partman/custom/cloudcephosd.cfg instead of partman/cloudcephosd.cfg" [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 12:40:52 <logmsgbot> !log elukey@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-main-eqiad cluster: Roll restart of jvm daemons.
2024-08-07 12:40:57 <_joe_> !log adding conftool 3.2.2 to apt
2024-08-07 12:42:28 <_joe_> uhm !log not working
2024-08-07 12:42:33 <_joe_> is stashbot down?
2024-08-07 12:47:12 <logmsgbot> !log elukey@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-main-eqiad cluster: Roll restart of jvm daemons.
2024-08-07 12:48:00 <sukhe> seems to be. can someone with the right perms restart it?
2024-08-07 12:54:00 <wikibugs> ('PS15) ''Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 12:54:49 <wikibugs> ('PS2) ''Giuseppe Lavagetto: haproxy: improve management of x-requestctl [puppet] - ''https://gerrit.wikimedia.org/r/1060417'
2024-08-07 12:55:23 <wikibugs> ('CR) ''Giuseppe Lavagetto: [V:''+2 C:''+2] haproxy: improve management of x-requestctl [puppet] - ''https://gerrit.wikimedia.org/r/1060417 (owner: ''Giuseppe Lavagetto)'
2024-08-07 12:56:20 <wikibugs> ('CR) ''Elukey: [C:''+2] git: remove umask from git::clone [puppet] - ''https://gerrit.wikimedia.org/r/927986 (https://phabricator.wikimedia.org/T338277) (owner: ''Hashar)'
2024-08-07 12:57:08 <wikibugs> ('PS1) ''Brouberol: cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 13:00:04 <jouncebot> Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1300). nyaa~
2024-08-07 13:00:04 <jouncebot> cscott and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2024-08-07 13:01:02 <wikibugs> ('PS1) ''Giuseppe Lavagetto: haproxy: remove redundant "end" stanza [puppet] - ''https://gerrit.wikimedia.org/r/1060425'
2024-08-07 13:01:13 <cscott> I'm here
2024-08-07 13:01:25 <hashar> I am going to restart jenkins
2024-08-07 13:01:47 <hashar> I am waiting for a couple jobs to finish ;)
2024-08-07 13:01:50 <wikibugs> ('PS1) ''Ayounsi: Add request argument to validate() method [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889)'
2024-08-07 13:01:51 <wikibugs> ('PS6) ''David Caro: partman: use the same recipe for cloudcephosd than cephosd [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 13:02:10 <cscott> (I'm also at wikimania)
2024-08-07 13:02:11 <wikibugs> ('CR) ''David Caro: partman: use the same recipe for cloudcephosd than cephosd (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 13:03:17 <wikibugs> ('PS2) ''Giuseppe Lavagetto: haproxy: remove redundant "end" stanza [puppet] - ''https://gerrit.wikimedia.org/r/1060425'
2024-08-07 13:03:59 <hashar> cscott: looks like you r change is in merge conflict https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1060415
2024-08-07 13:04:01 <wikibugs> ('CR) ''FNegri: [C:''+1] partman: use the same recipe for cloudcephosd than cephosd [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 13:04:08 <hashar> most probably cause some other patch touched InitialiseSettings.php
2024-08-07 13:05:04 <wikibugs> ('CR) ''David Caro: [C:''+2] partman: use the same recipe for cloudcephosd than cephosd [puppet] - ''https://gerrit.wikimedia.org/r/1060402 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 13:05:16 <wikibugs> ('CR) ''CI reject: [V:''-1] Add request argument to validate() method [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889) (owner: ''Ayounsi)'
2024-08-07 13:05:40 <wikibugs> ('PS2) ''Brouberol: cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 13:06:49 <wikibugs> ('PS2) ''C. Scott Ananian: Turn on Parsoid support for Kartographer on Wikivoyage [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823)'
2024-08-07 13:07:12 <wikibugs> ('CR) ''Hashar: "I have rebased the change since Gerrit marked it as being in conflict." [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) (owner: ''C. Scott Ananian)'
2024-08-07 13:08:13 <logmsgbot> !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2024-08-07 13:08:23 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''cloud-services-team (Hardware), ''Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048135 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1002 for host cloudcephosd1037.eqi...'
2024-08-07 13:08:58 <hashar> cscott: I am doing the backport
2024-08-07 13:09:02 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) (owner: ''C. Scott Ananian)'
2024-08-07 13:09:13 <logmsgbot> !log elukey@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons.
2024-08-07 13:10:11 <elukey> !log rollout openjdk-17 upgrades to prod
2024-08-07 13:10:25 <wikibugs> ('Merged) ''jenkins-bot: Turn on Parsoid support for Kartographer on Wikivoyage [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060415 (https://phabricator.wikimedia.org/T371823) (owner: ''C. Scott Ananian)'
2024-08-07 13:10:44 <logmsgbot> !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1060415|Turn on Parsoid support for Kartographer on Wikivoyage (T371823)]]
2024-08-07 13:11:23 <hashar> !log Restarting CI Jenkins
2024-08-07 13:11:37 <sukhe> stashbot is broken
2024-08-07 13:11:47 <sukhe> I am looking for someone with access to restart it
2024-08-07 13:11:51 <sukhe> if you are that person, please do it :)
2024-08-07 13:12:03 <hashar> this job is never ending
2024-08-07 13:12:05 <hashar> I don't have access :/
2024-08-07 13:12:15 <sukhe> hashar: sadly me neither, I just requested it as well
2024-08-07 13:12:26 <hashar> but I guess people in #wikimedia-cloud-admin would be able?
2024-08-07 13:12:38 <sukhe> good idea, going there
2024-08-07 13:12:48 <hashar> dont tell them I have sent you! ;-]
2024-08-07 13:13:12 <sukhe> I will tell them hashar told me to not tell them that hashar sent me
2024-08-07 13:13:15 <sukhe> :]
2024-08-07 13:13:22 <hashar> grins
2024-08-07 13:14:14 <wikibugs> ('CR) ''Btullis: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) (owner: ''Brouberol)'
2024-08-07 13:14:19 <logmsgbot> !log hashar@deploy1003 cscott, hashar: Backport for [[gerrit:1060415|Turn on Parsoid support for Kartographer on Wikivoyage (T371823)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
2024-08-07 13:14:35 <hashar> cscott: should be good now
2024-08-07 13:14:40 <hashar> well on debug servers
2024-08-07 13:14:47 <hashar> then I am not quite sure if anything has to be tested?
2024-08-07 13:15:02 <cscott> I can test hang on
2024-08-07 13:15:05 <hashar> !log Restarted CI Jenkins
2024-08-07 13:15:05 <stashbot> hashar: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:15:10 <sukhe> oops
2024-08-07 13:15:20 <sukhe> it did log it though, ha
2024-08-07 13:15:31 <hashar> https://sal.toolforge.org/log/HQ_6LJEBKFqumxvtlfYt
2024-08-07 13:15:32 <hashar> yeah
2024-08-07 13:15:33 <logmsgbot> !log elukey@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-eqiad cluster: Roll restart of jvm daemons.
2024-08-07 13:15:33 <stashbot> elukey@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:15:41 <wikibugs> ('CR) ''Brouberol: cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) (owner: ''Brouberol)'
2024-08-07 13:15:46 <hashar> but we lost fivish hours of `!log`
2024-08-07 13:16:01 <hashar> which should probably be logged
2024-08-07 13:16:05 <sukhe> yeah
2024-08-07 13:16:45 <hashar> the other message from stash bot is that it apparently cant write to wikitech.wikimedia.org
2024-08-07 13:17:52 <logmsgbot> !log elukey@cumin1002 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons.
2024-08-07 13:17:52 <stashbot> elukey@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:18:05 <wikibugs> ('PS1) ''DCausse: search: index stems for mul labels [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060430 (https://phabricator.wikimedia.org/T371401)'
2024-08-07 13:18:07 <hashar> !log stashbot got restarted since it was not processing anything
2024-08-07 13:18:07 <stashbot> hashar: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:18:58 <logmsgbot> !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp3073*} and A:cp for 9.2.5-1wm2
2024-08-07 13:18:59 <stashbot> sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:19:03 <hashar> sukhe: interestingly I am a member of `stashbot` so I could have restarted it ;)
2024-08-07 13:19:08 <sukhe> hashar: haha
2024-08-07 13:19:27 <cscott> Hashar: is my patch in codfw now?
2024-08-07 13:19:34 <wikibugs> ('PS3) ''Brouberol: cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 13:19:36 <wikibugs> ('CR) ''Btullis: [C:''+1] "Nice." [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240) (owner: ''Brouberol)'
2024-08-07 13:20:00 <wikibugs> ('CR) ''Brouberol: [C:''+2] cloudnative-pg-cluster: define a chart allowing users to provision a PG cluster (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060376 (https://phabricator.wikimedia.org/T368240) (owner: ''Brouberol)'
2024-08-07 13:20:23 <hashar> cscott: it is only on the mwdebug servers but there is one in codfw as well?
2024-08-07 13:20:41 <wikibugs> 'SRE-tools, ''Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10048179 (''elukey) Buster and Bookworm rollouts done, no big issues registered. The only drawback is that due to the high volume of writes to the db (since we...'
2024-08-07 13:21:38 <wikibugs> ('PS2) ''Ayounsi: Add request argument to validate() method [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889)'
2024-08-07 13:21:38 <wikibugs> ('PS1) ''Ayounsi: Add validators for console(server) and power ports [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590)'
2024-08-07 13:22:20 <logmsgbot> !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp3073*} and A:cp for 9.2.5-1wm2
2024-08-07 13:22:20 <stashbot> sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:22:28 <hashar> I am ignoring the stashbot error log, cause there is at least 3 tasks I could file as follows up
2024-08-07 13:22:36 <hashar> and well E_TOO_MANY_THINGS
2024-08-07 13:22:46 <sukhe> hashar: dhinus is on it
2024-08-07 13:22:53 <hashar> cool ;)
2024-08-07 13:23:05 <hashar> thank you!
2024-08-07 13:23:20 <cscott> Hashar: ok ship it, tested and looks good
2024-08-07 13:23:27 <logmsgbot> !log hashar@deploy1003 cscott, hashar: Continuing with sync
2024-08-07 13:23:27 <stashbot> hashar@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:23:30 <hashar> ships
2024-08-07 13:23:43 <dhinus> I restarted stashbot but it's only half alive :) -- it's now writing to sal.toolforge.org, but not to wiki SAL
2024-08-07 13:23:54 <wikibugs> ('PS1) ''DCausse: search: use the stem field when search mul labels [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401)'
2024-08-07 13:24:13 <logmsgbot> !log elukey@cumin1002 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-flink-codfw cluster: Roll restart of jvm daemons.
2024-08-07 13:24:13 <stashbot> elukey@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:24:24 <wikibugs> ('PS2) ''DCausse: search: use the stem field when searching mul labels [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401)'
2024-08-07 13:24:49 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C:''+2] haproxy: remove redundant "end" stanza [puppet] - ''https://gerrit.wikimedia.org/r/1060425 (owner: ''Giuseppe Lavagetto)'
2024-08-07 13:25:26 <wikibugs> ('CR) ''CI reject: [V:''-1] Add request argument to validate() method [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1060426 (https://phabricator.wikimedia.org/T371889) (owner: ''Ayounsi)'
2024-08-07 13:25:33 <wikibugs> ('PS1) ''Ayounsi: Enable validators on Netbox-next for console(server) and power ports [puppet] - ''https://gerrit.wikimedia.org/r/1060435 (https://phabricator.wikimedia.org/T310590)'
2024-08-07 13:25:35 <wikibugs> ('PS1) ''Ayounsi: Enable validators on Netbox for console(server) and power ports [puppet] - ''https://gerrit.wikimedia.org/r/1060436 (https://phabricator.wikimedia.org/T310590)'
2024-08-07 13:25:36 <wikibugs> ('CR) ''CI reject: [V:''-1] Add validators for console(server) and power ports [software/netbox-extras] - ''https://gerrit.wikimedia.org/r/1060431 (https://phabricator.wikimedia.org/T310590) (owner: ''Ayounsi)'
2024-08-07 13:25:45 <dhinus> I checked the stashbot error logs and the exception is "mwclient.errors.NoWriteApi"
2024-08-07 13:26:22 <wikibugs> 'SRE, ''Infrastructure-Foundations: Request access to servers Dcops group - https://phabricator.wikimedia.org/T360356#10048193 (''elukey) ''Open''Resolved a:''elukey'
2024-08-07 13:26:34 <logmsgbot> !log dcaro@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage
2024-08-07 13:26:34 <stashbot> dcaro@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:26:52 <godog> jouncebot: now and next
2024-08-07 13:26:52 <jouncebot> For the next 0 hour(s) and 33 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1300)
2024-08-07 13:27:12 <sukhe> !log sudo cumin "lvs3009*" 'disable-puppet "rebooting" && systemctl stop pybal.service'
2024-08-07 13:27:12 <stashbot> sukhe: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:28:10 <logmsgbot> !log hashar@deploy1003 Finished scap: Backport for [[gerrit:1060415|Turn on Parsoid support for Kartographer on Wikivoyage (T371823)]] (duration: 17m 26s)
2024-08-07 13:28:11 <stashbot> hashar@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:28:13 <wikibugs> ('PS1) ''AikoChou: ml-services: update readability model [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060437 (https://phabricator.wikimedia.org/T369712)'
2024-08-07 13:28:22 <stashbot> T371823: Turn on wgKartographerParsoidSupport on all wikivoyage wikis - https://phabricator.wikimedia.org/T371823
2024-08-07 13:28:51 <logmsgbot> !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage
2024-08-07 13:28:51 <stashbot> dcaro@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:29:24 <wikibugs> ('PS1) ''Tiziano Fogli: icinga: add Tiziano Fogli to authorized_for_system_information, authorized_for_configuration_information, authorized_for_all_service_commands, authorized_for_all_host_commands [puppet] - ''https://gerrit.wikimedia.org/r/1060438'
2024-08-07 13:30:06 <wikibugs> ('CR) ''CI reject: [V:''-1] icinga: add Tiziano Fogli to authorized_for_system_information, authorized_for_configuration_information, authorized_for_all_service_commands, authorized_for_all_host_commands [puppet] - ''https://gerrit.wikimedia.org/r/1060438 (owner: ''Tiziano Fogli)'
2024-08-07 13:30:18 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] hiera:benthos: finally removing all hiera relative to Benthos [puppet] - ''https://gerrit.wikimedia.org/r/1060416 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 13:30:45 <hashar> the other scheduled patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1056146 was deployed earlier today
2024-08-07 13:31:04 <hashar> by Dreamy_Jazz ;)
2024-08-07 13:31:15 <Dreamy_Jazz> Yeah. It was deployed already.
2024-08-07 13:31:19 <hashar> !log UTC afternoon backport window is completed
2024-08-07 13:31:20 <stashbot> hashar: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:31:23 <hashar> \o/
2024-08-07 13:32:56 <wikibugs> ('PS1) ''Fabfur: cache:benthos: remove Benthos references from cache files [puppet] - ''https://gerrit.wikimedia.org/r/1060441 (https://phabricator.wikimedia.org/T371492)'
2024-08-07 13:34:14 <godog> hnowlan and I have a statsd-exporter resource change to deploy to k8s then scap test, ok to do it now hashar even though the window hasn't closed yet technically ?
2024-08-07 13:34:23 <godog> this guy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1060411?usp=email
2024-08-07 13:35:45 <godog> I'll take that as a yes
2024-08-07 13:36:01 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+2] mw-jobrunner: bump limit/request for statsd-exporter [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060411 (https://phabricator.wikimedia.org/T371885) (owner: ''Filippo Giunchedi)'
2024-08-07 13:36:13 <wikibugs> ('CR) ''Fabfur: [C:''+2] hiera:benthos: finally removing all hiera relative to Benthos [puppet] - ''https://gerrit.wikimedia.org/r/1060416 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 13:38:07 <wikibugs> ('PS2) ''Tiziano Fogli: icinga: add Tiziano Fogli to ctrl variables [puppet] - ''https://gerrit.wikimedia.org/r/1060438'
2024-08-07 13:38:48 <logmsgbot> !log filippo@deploy1003 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
2024-08-07 13:38:49 <stashbot> filippo@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:39:00 <logmsgbot> !log filippo@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
2024-08-07 13:39:00 <stashbot> filippo@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:39:07 <logmsgbot> !log filippo@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
2024-08-07 13:39:08 <stashbot> filippo@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:39:17 <logmsgbot> !log filippo@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
2024-08-07 13:39:17 <stashbot> filippo@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:39:45 <wikibugs> ('CR) ''Fabfur: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1060441 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 13:39:48 <godog> ok stashbot is busted but we're good otherwise
2024-08-07 13:39:59 <wikibugs> ('CR) ''Klausman: [C:''+1] ml-services: update readability model [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060437 (https://phabricator.wikimedia.org/T369712) (owner: ''AikoChou)'
2024-08-07 13:42:25 <logmsgbot> !log hnowlan@deploy1003 Started scap sync-world: sync to test mw-jobrunner resource increase
2024-08-07 13:42:25 <stashbot> hnowlan@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:43:17 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] cache:benthos: remove Benthos references from cache files [puppet] - ''https://gerrit.wikimedia.org/r/1060441 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 13:43:55 <logmsgbot> !log hnowlan@deploy1003 Finished scap: sync to test mw-jobrunner resource increase (duration: 02m 22s)
2024-08-07 13:43:55 <stashbot> hnowlan@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:44:27 <wikibugs> 'Puppet, ''Release-Engineering-Team, ''Patch-For-Review: Puppet git::clone probably does not need `umask` parameter - https://phabricator.wikimedia.org/T338277#10048286 (''hashar) ''Open''Resolved The series of patch has led to the removal of `umask` from `git::clone` In roughly the order the patc...'
2024-08-07 13:45:57 <sukhe> grafana down?
2024-08-07 13:46:07 <sukhe> yeah
2024-08-07 13:46:20 <godog> curious, checking
2024-08-07 13:46:33 <godog> mmhh we're back ?
2024-08-07 13:46:39 <sukhe> back indeed yep
2024-08-07 13:46:54 <sukhe> spike in 503s
2024-08-07 13:46:57 <logmsgbot> !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bullseye
2024-08-07 13:46:57 <stashbot> dcaro@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:47:13 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''cloud-services-team (Hardware), ''Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048325 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1002 for host cloudcephosd1037.eqiad.w...'
2024-08-07 13:47:24 <wikibugs> ('CR) ''Fabfur: [C:''+2] cache:benthos: remove Benthos references from cache files [puppet] - ''https://gerrit.wikimedia.org/r/1060441 (https://phabricator.wikimedia.org/T371492) (owner: ''Fabfur)'
2024-08-07 13:47:34 <wikibugs> ('PS4) ''Brouberol: cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 13:50:58 <wikibugs> ('PS5) ''Brouberol: cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240)'
2024-08-07 13:51:02 <logmsgbot> !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3009.esams.wmnet
2024-08-07 13:51:02 <stashbot> sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:54:17 <logmsgbot> !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3009.esams.wmnet
2024-08-07 13:54:18 <stashbot> sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:55:43 <sukhe> !log start pybal on lvs3009
2024-08-07 13:55:43 <stashbot> sukhe: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 13:56:12 <wikibugs> ('PS1) ''Clare Ming: Fix labs config for Metrics Platform vars [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060445 (https://phabricator.wikimedia.org/T366234)'
2024-08-07 13:59:21 <wikibugs> ('CR) ''Brouberol: [C:''+2] cloudnative-pg: ensure Pods inherit prometheus annotations [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060424 (https://phabricator.wikimedia.org/T368240) (owner: ''Brouberol)'
2024-08-07 14:00:05 <wikibugs> ('PS1) ''DCausse: search: use mul fallback for manually-tuned search profiles [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401)'
2024-08-07 14:00:05 <jouncebot> Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1400)
2024-08-07 14:00:21 <wikibugs> ('PS1) ''David Caro: cloudcephosd: use the new partitions on the new hosts [puppet] - ''https://gerrit.wikimedia.org/r/1060450 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 14:00:56 <wikibugs> ('PS2) ''David Caro: cloudcephosd: use the new partitions on the new hosts [puppet] - ''https://gerrit.wikimedia.org/r/1060450 (https://phabricator.wikimedia.org/T363344)'
2024-08-07 14:01:40 <elukey> !log import Jenkins 2.462.1 on bullseye-wikimedia:thirdparty/ci
2024-08-07 14:01:41 <stashbot> elukey: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:01:51 <wikibugs> ('CR) ''David Caro: [C:''+2] cloudcephosd: use the new partitions on the new hosts [puppet] - ''https://gerrit.wikimedia.org/r/1060450 (https://phabricator.wikimedia.org/T363344) (owner: ''David Caro)'
2024-08-07 14:03:51 <logmsgbot> !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
2024-08-07 14:03:51 <stashbot> brouberol@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:04:02 <logmsgbot> !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
2024-08-07 14:04:04 <stashbot> brouberol@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:07:49 <Dreamy_Jazz> Why no server admin logs?
2024-08-07 14:10:19 <wikibugs> ('CR) ''Phuedx: [C:''+1] "Running" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060445 (https://phabricator.wikimedia.org/T366234) (owner: ''Clare Ming)'
2024-08-07 14:13:40 <wikibugs> ('PS1) ''Klausman: knative-serving: Switch components to use Calic Netpolicies [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060452'
2024-08-07 14:14:23 <wikibugs> ('PS2) ''Klausman: knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060452'
2024-08-07 14:21:28 <logmsgbot> !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided)
2024-08-07 14:21:29 <stashbot> jnuche@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:22:05 <wikibugs> 'Puppet, ''Infrastructure-Foundations, ''Release-Engineering-Team: Puppet git::clone should default mode to 0644 (read-only) instead of 0755 - https://phabricator.wikimedia.org/T371980 (''hashar) ''NEW'
2024-08-07 14:22:22 <logmsgbot> !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) (duration: 00m 53s)
2024-08-07 14:22:22 <stashbot> jnuche@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:22:56 <stw> Dreamy_Jazz: https://phabricator.wikimedia.org/T371977
2024-08-07 14:23:58 <Dreamy_Jazz> Thanks
2024-08-07 14:23:58 <wikibugs> ('Abandoned) ''Arlolra: Enabled KartographerParsoidSupport on (cs|hi|shn|ps|tr)wikivoyage [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060186 (https://phabricator.wikimedia.org/T371936) (owner: ''Arlolra)'
2024-08-07 14:24:00 <sukhe> !log sudo cumin "lvs3008*" 'disable-puppet "rebooting" && systemctl stop pybal.service'
2024-08-07 14:24:00 <stashbot> sukhe: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:24:23 <jinxer-wm> FIRING: JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 14:24:28 <wikibugs> ('PS8) ''Btullis: Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - ''https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T370354)'
2024-08-07 14:25:12 <logmsgbot> !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided)
2024-08-07 14:25:13 <stashbot> jnuche@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:26:24 <logmsgbot> !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@9b733de] (releasing): (no justification provided) (duration: 01m 12s)
2024-08-07 14:26:25 <stashbot> jnuche@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:27:39 <wikibugs> ('CR) ''Btullis: "I have updated the patch so that it links to the relevant ticket for the current work." [puppet] - ''https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T370354) (owner: ''Btullis)'
2024-08-07 14:29:23 <jinxer-wm> RESOLVED: JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 14:29:27 <wikibugs> ('CR) ''Btullis: [V:''+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3578/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T370354) (owner: ''Btullis)'
2024-08-07 14:31:28 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''Puppet-Infrastructure, ''Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10048445 (''elukey) Sent an email to all SREs, the move will happen on Aug 12th 13:00 UTC.'
2024-08-07 14:33:38 <logmsgbot> !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Openjdk upgrade - elukey@cumin1002
2024-08-07 14:33:39 <stashbot> elukey@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:39:23 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job benthos in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 14:44:42 <wikibugs> ('CR) ''Scott French: [C:''+1] mediawiki: Bump ttlSecondsAfterFinished for Jobs [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060184 (owner: ''RLazarus)'
2024-08-07 14:49:06 <wikibugs> ('PS2) ''Filippo Giunchedi: data-engineering: fix MediawikiPageContentChangeEnrichAvailability matching [alerts] - ''https://gerrit.wikimedia.org/r/1060061 (https://phabricator.wikimedia.org/T354255)'
2024-08-07 14:50:11 <logmsgbot> !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs3008.esams.wmnet
2024-08-07 14:50:11 <stashbot> sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:50:51 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T371923#10048515 (''Jhancock.wm) ''Open''Resolved a:''Jhancock.wm alerts cleared'
2024-08-07 14:53:19 <logmsgbot> !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3008.esams.wmnet
2024-08-07 14:53:20 <stashbot> sukhe@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:54:29 <wikibugs> ('PS1) ''Papaul: Add new payments node to DNS file [dns] - ''https://gerrit.wikimedia.org/r/1060457'
2024-08-07 14:56:06 <wikibugs> ('CR) ''Papaul: [C:''+2] Add new payments node to DNS file [dns] - ''https://gerrit.wikimedia.org/r/1060457 (owner: ''Papaul)'
2024-08-07 14:57:14 <wikibugs> ('CR) ''Klausman: [C:''+1] ml-services: use cxserver host header in rec-api [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060377 (https://phabricator.wikimedia.org/T371465) (owner: ''Kevin Bazira)'
2024-08-07 14:57:40 <wikibugs> ('CR) ''Klausman: [C:''+1] ml-services: langid from src dir [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060378 (https://phabricator.wikimedia.org/T369344) (owner: ''Kevin Bazira)'
2024-08-07 14:58:08 <sukhe> !log start pybal on lvs3008
2024-08-07 14:58:08 <stashbot> sukhe: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 14:59:23 <jinxer-wm> RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2024-08-07 15:02:20 <wikibugs> ('PS1) ''Cathal Mooney: Add mtr to standard packages for WMF hosts [puppet] - ''https://gerrit.wikimedia.org/r/1060458'
2024-08-07 15:02:53 <wikibugs> ('CR) ''Brouberol: [C:''+1] Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - ''https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T370354) (owner: ''Btullis)'
2024-08-07 15:03:55 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops, ''Patch-For-Review: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10048545 (''Papaul) a:''Papaul''Dwisehaupt @Dwisehaupt 2004 and 2005 are ready when. you get them online we can decom 2003 and rack/install 200...'
2024-08-07 15:11:06 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, August 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060445 (https://phabricator.wikimedia.org/T366234) (owner: ''Clare Ming)'
2024-08-07 15:12:17 <wikibugs> ('CR) ''Kevin Bazira: [C:''+2] ml-services: use cxserver host header in rec-api [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060377 (https://phabricator.wikimedia.org/T371465) (owner: ''Kevin Bazira)'
2024-08-07 15:13:24 <wikibugs> ('Merged) ''jenkins-bot: ml-services: use cxserver host header in rec-api [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060377 (https://phabricator.wikimedia.org/T371465) (owner: ''Kevin Bazira)'
2024-08-07 15:13:54 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Machine-Learning-Team: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872#10048594 (''Jhancock.wm) this one has been out of warranty for more than a half a year. We do have a spare DIMM on hand to repl...'
2024-08-07 15:14:55 <wikibugs> ('CR) ''Btullis: [V:''+1 C:''+2] Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - ''https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T370354) (owner: ''Btullis)'
2024-08-07 15:15:13 <logmsgbot> !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
2024-08-07 15:15:14 <stashbot> kevinbazira@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 15:20:30 <wikibugs> ('CR) ''Kevin Bazira: [C:''+2] ml-services: langid from src dir [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060378 (https://phabricator.wikimedia.org/T369344) (owner: ''Kevin Bazira)'
2024-08-07 15:21:17 <logmsgbot> !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1038.eqiad.wmnet with OS bullseye
2024-08-07 15:21:17 <stashbot> andrew@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 15:21:29 <wikibugs> 'ops-codfw, ''Data-Persistence, ''Data-Persistence-Backup, ''DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984 (''RobH) ''NEW'
2024-08-07 15:21:30 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''cloud-services-team (Hardware), ''Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048620 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eq...'
2024-08-07 15:21:35 <wikibugs> ('Merged) ''jenkins-bot: ml-services: langid from src dir [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060378 (https://phabricator.wikimedia.org/T369344) (owner: ''Kevin Bazira)'
2024-08-07 15:21:46 <wikibugs> 'ops-codfw, ''Data-Persistence, ''Data-Persistence-Backup, ''DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10048637 (''RobH)'
2024-08-07 15:23:23 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Machine-Learning-Team: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872#10048641 (''Jhancock.wm) a:''Jhancock.wm'
2024-08-07 15:25:08 <wikibugs> 'ops-codfw, ''Data-Persistence, ''Data-Persistence-Backup, ''DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10048645 (''RobH) a:''ABran-WMF @ABran-WMF, This racking task lists you as your teams point of contact. As this has now been escalated to order, the new wor...'
2024-08-07 15:25:11 <logmsgbot> !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' .
2024-08-07 15:25:11 <stashbot> kevinbazira@deploy1003: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 15:33:24 <wikibugs> ('CR) ''AikoChou: [C:''+2] ml-services: update readability model [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060437 (https://phabricator.wikimedia.org/T369712) (owner: ''AikoChou)'
2024-08-07 15:34:10 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Machine-Learning-Team: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872#10048669 (''klausman) @Jhancock.wm machine is drained, feel free to proceed.'
2024-08-07 15:34:23 <wikibugs> ('Merged) ''jenkins-bot: ml-services: update readability model [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060437 (https://phabricator.wikimedia.org/T369712) (owner: ''AikoChou)'
2024-08-07 15:36:13 <logmsgbot> !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1038.eqiad.wmnet with OS bullseye
2024-08-07 15:36:15 <stashbot> andrew@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 15:36:24 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''cloud-services-team (Hardware), ''Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048674 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad....'
2024-08-07 15:36:26 <jinxer-wm> FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
2024-08-07 15:37:05 <logmsgbot> !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1038.eqiad.wmnet with OS bullseye
2024-08-07 15:37:05 <stashbot> andrew@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
2024-08-07 15:37:19 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''cloud-services-team (Hardware), ''Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048675 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eq...'
2024-08-07 15:40:18 <brett> !log stop pybal on lvs2013 for server reboot
2024-08-07 15:40:19 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 15:43:09 <wikibugs> 'SRE, ''collaboration-services, ''Continuous-Integration-Infrastructure, ''Jenkins, ''Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10048688 (''Dzahn) All sounds good. Thank you! Also T371930#10047573 sounds like good progress is already...'
2024-08-07 15:43:40 <bd808> I have hacked stashbot to work around the problem from T371977 that this week's train has triggered in the mwclient python library. My hack is very hacky, but should be fine until a proper fix is introduced upstream.
2024-08-07 15:43:41 <stashbot> T371977: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977
2024-08-07 15:47:50 <jinxer-wm> FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2024-08-07 15:49:31 <wikibugs> ('CR) ''Dzahn: [C:''+1] "looks all good and approved to me!" [puppet] - ''https://gerrit.wikimedia.org/r/1060338 (https://phabricator.wikimedia.org/T371650) (owner: ''Slyngshede)'
2024-08-07 15:51:02 <wikibugs> ('CR) ''Elukey: "LGTM! Could you run the puppet compiler on some random nodes (including librenms etc..) so we double check that we are good?" [puppet] - ''https://gerrit.wikimedia.org/r/1060458 (owner: ''Cathal Mooney)'
2024-08-07 15:52:50 <jinxer-wm> RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2024-08-07 15:54:16 <logmsgbot> !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage
2024-08-07 15:54:41 <wikibugs> ('PS2) ''Dzahn: zuul: replace ferm::service with firewall::service [puppet] - ''https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677)'
2024-08-07 15:56:32 <wikibugs> ('CR) ''Dzahn: [V:''-1] "https://puppet-compiler.wmflabs.org/output/1057930/3581/contint2002.wikimedia.org/change.contint2002.wikimedia.org.err"; [puppet] - ''https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677) (owner: ''Dzahn)'
2024-08-07 15:57:01 <logmsgbot> !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage
2024-08-07 15:57:05 <wikibugs> ('PS1) ''Ahmon Dancy: mw-web: train-dev: Supply placeholder for STATSD_EXPORTER_PROMETHEUS_SERVICE_HOST [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060464'
2024-08-07 15:57:42 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Machine-Learning-Team: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872#10048716 (''klausman) ''Open''Resolved Machine has had DIMM replaced and is back in service.'
2024-08-07 15:58:38 <wikibugs> ('PS3) ''Dzahn: zuul: replace ferm::service with firewall::service [puppet] - ''https://gerrit.wikimedia.org/r/1057930 (https://phabricator.wikimedia.org/T370677)'
2024-08-07 15:59:53 <wikibugs> 'ops-eqiad, ''DC-Ops, ''serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987 (''RobH) ''NEW'
2024-08-07 16:00:42 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10048757 (''JMeybohm)'
2024-08-07 16:01:24 <logmsgbot> !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Openjdk upgrade - elukey@cumin1002
2024-08-07 16:01:49 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10048760 (''JMeybohm)'
2024-08-07 16:03:14 <wikibugs> 'ops-eqiad, ''DC-Ops, ''serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10048763 (''RobH) a:''jijiki Effie, The workflow for racking tasks has changed this quarter, once I create the racking task I assign it to the SRE sub-teams point of contact (for this task...'
2024-08-07 16:03:22 <wikibugs> 'SRE, ''LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10048767 (''dancy) >>! In T371958#10047933, @Jelto wrote: > afaik there is [automation](https://gitlab.wikimedia.org/repos/releng/gitlab-settings/-/blob/main/group-management/sync-gitlab-group-with-...'
2024-08-07 16:03:44 <wikibugs> 'ops-eqiad, ''DC-Ops, ''serviceops: Q1:rack/setup/install mc-misc100[12] - https://phabricator.wikimedia.org/T371987#10048782 (''RobH)'
2024-08-07 16:03:51 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10048783 (''JMeybohm) The nodes are not in service, so no need to schedule a maint-window from our side. Feel free to choose a time that suits you best.'
2024-08-07 16:03:54 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10048784 (''JMeybohm) The nodes are not in service, so no need to schedule a maint-window from our side. Feel free to choose a time that suits you best.'
2024-08-07 16:05:14 <wikibugs> ('CR) ''Vgutierrez: ACMEChiefConfig: Automated MarkMonitor domain sync (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1055232 (owner: ''Ncmonitor)'
2024-08-07 16:08:10 <logmsgbot> !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2013.codfw.wmnet
2024-08-07 16:09:59 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10048826 (''VRiley-WMF) a:''VRiley-WMF'
2024-08-07 16:11:12 <logmsgbot> !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2013.codfw.wmnet
2024-08-07 16:12:06 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10048831 (''VRiley-WMF) Warranty on server has expired. Located another SSD from Decommed servers. Swapped drive in slot 6 as per iDRAC error indicated.'
2024-08-07 16:13:21 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+1] "at last a simple one again: https://puppet-compiler.wmflabs.org/output/1057928/3583/contint2002.wikimedia.org/index.html"; [puppet] - ''https://gerrit.wikimedia.org/r/1057928 (https://phabricator.wikimedia.org/T370677) (owner: ''Dzahn)'
2024-08-07 16:13:21 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10048834 (''Ladsgroup) let me depool it. Let me know when you want it shut off.'
2024-08-07 16:14:52 <logmsgbot> !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P67246 and previous config saved to /var/cache/conftool/dbconfig/20240807-161452-ladsgroup.json
2024-08-07 16:15:26 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10048835 (''Ladsgroup) depooled.'
2024-08-07 16:15:35 <logmsgbot> !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1038.eqiad.wmnet with OS bullseye
2024-08-07 16:15:48 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''cloud-services-team (Hardware), ''Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10048836 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad....'
2024-08-07 16:17:11 <wikibugs> ('CR) ''Scott French: [C:''+2] "This seems like a reasonable fix, but also suggests a subtle difference in the inherited configuration between mw-web and mw-debug." [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060464 (owner: ''Ahmon Dancy)'
2024-08-07 16:18:13 <wikibugs> ('Merged) ''jenkins-bot: mw-web: train-dev: Supply placeholder for STATSD_EXPORTER_PROMETHEUS_SERVICE_HOST [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060464 (owner: ''Ahmon Dancy)'
2024-08-07 16:20:28 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927#10048839 (''Ladsgroup) I will do some checks before repooling'
2024-08-07 16:21:16 <wikibugs> ('PS1) ''BryanDavis: Revert "Drop writeapi flag from siteinfo API" [core] (wmf/1.43.0-wmf.17) - ''https://gerrit.wikimedia.org/r/1060468 (https://phabricator.wikimedia.org/T115414)'
2024-08-07 16:27:03 <brett> !log start pybal on lvs2013
2024-08-07 16:27:04 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 16:34:23 <jinxer-wm> FIRING: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-08-07 16:34:53 <logmsgbot> !log milimetric@deploy1003 Started deploy [analytics/refinery@0d25645]: Syncing browser general script, and refinery-source 0.2.45 apparently
2024-08-07 16:37:41 <mutante> !log puppetserver1002 systemctl start dump_ip_reputation
2024-08-07 16:37:42 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 16:39:23 <jinxer-wm> RESOLVED: SystemdUnitFailed: dump_ip_reputation.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-08-07 16:42:36 <brett> !log stop pybal on lvs2014 for server reboot
2024-08-07 16:42:37 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 16:52:49 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10048986 (''Jhancock.wm)'
2024-08-07 16:54:04 <brennen> jouncebot nowandnext
2024-08-07 16:54:05 <jouncebot> No deployments scheduled for the next 0 hour(s) and 5 minute(s)
2024-08-07 16:54:05 <jouncebot> In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1700)
2024-08-07 16:56:23 <brennen> i'm going to roll out https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1060468 for a train blocker
2024-08-07 16:56:28 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q1:rack/setup/install civi2002, frpig2002, frpm2002 - https://phabricator.wikimedia.org/T369937#10049000 (''Jhancock.wm) a:''Jhancock.wm''Papaul @Papaul ready for your part civi2002 ETH1 <> FASW-C8A eth-0/0/37 ETH2 <> FASW-C8B eth-1/0/37 frpig200...'
2024-08-07 16:56:51 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by brennen@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.17) - ''https://gerrit.wikimedia.org/r/1060468 (https://phabricator.wikimedia.org/T115414) (owner: ''BryanDavis)'
2024-08-07 16:58:44 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Machine-Learning-Team: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920#10049008 (''Jhancock.wm)'
2024-08-07 17:00:05 <jouncebot> Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1700)
2024-08-07 17:03:31 <wikibugs> ('PS2) ''DCausse: search: index stems for mul labels [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060430 (https://phabricator.wikimedia.org/T371401)'
2024-08-07 17:03:31 <wikibugs> ('PS3) ''DCausse: search: use the stem field when searching mul labels [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060433 (https://phabricator.wikimedia.org/T371401)'
2024-08-07 17:03:31 <wikibugs> ('PS2) ''DCausse: search: use mul fallback for manually-tuned search profiles [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401)'
2024-08-07 17:07:33 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bullseye
2024-08-07 17:07:41 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049048 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bull...'
2024-08-07 17:08:23 <logmsgbot> !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2014.codfw.wmnet
2024-08-07 17:11:18 <logmsgbot> !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2014.codfw.wmnet
2024-08-07 17:11:48 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] ci: replace ferm::service with firewall::service in data_rsync [puppet] - ''https://gerrit.wikimedia.org/r/1057928 (https://phabricator.wikimedia.org/T370677) (owner: ''Dzahn)'
2024-08-07 17:14:50 <brett> !log start pybal on lvs2014
2024-08-07 17:14:51 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 17:15:32 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] "noop - all that happens here is that a config file got renamed (underscore vs hyphen) - no change to actual firewall rules" [puppet] - ''https://gerrit.wikimedia.org/r/1057928 (https://phabricator.wikimedia.org/T370677) (owner: ''Dzahn)'
2024-08-07 17:16:06 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10049067 (''Jhancock.wm)'
2024-08-07 17:17:01 <brett> !log stop pybal on lvs1019 for server reboot
2024-08-07 17:17:02 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 17:17:58 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10049070 (''Jhancock.wm) a:''Jhancock.wm''Papaul @Papaul this one is ready for you. ETH1 <> FASW-C8A eth-0/0/36 ETH2 <> FASW-C8B eth-0/1/36'
2024-08-07 17:27:13 <wikibugs> ('Merged) ''jenkins-bot: Revert "Drop writeapi flag from siteinfo API" [core] (wmf/1.43.0-wmf.17) - ''https://gerrit.wikimedia.org/r/1060468 (https://phabricator.wikimedia.org/T115414) (owner: ''BryanDavis)'
2024-08-07 17:27:31 <logmsgbot> !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1060468|Revert "Drop writeapi flag from siteinfo API" (T115414 T294397 T371977)]]
2024-08-07 17:28:38 <stashbot> T115414: Remove the ability to disable the API with $wgEnableAPI - https://phabricator.wikimedia.org/T115414
2024-08-07 17:28:38 <stashbot> T294397: Drop writeapi MediaWiki right - https://phabricator.wikimedia.org/T294397
2024-08-07 17:28:38 <stashbot> T371977: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977
2024-08-07 17:29:14 <logmsgbot> !log milimetric@deploy1003 Finished deploy [analytics/refinery@0d25645]: Syncing browser general script, and refinery-source 0.2.45 apparently (duration: 54m 21s)
2024-08-07 17:29:44 <logmsgbot> !log brennen@deploy1003 brennen, bd808: Backport for [[gerrit:1060468|Revert "Drop writeapi flag from siteinfo API" (T115414 T294397 T371977)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
2024-08-07 17:30:34 <logmsgbot> !log milimetric@deploy1003 Started deploy [analytics/refinery@0d25645] (thin): Syncing browser general script, and refinery-source 0.2.45 apparently
2024-08-07 17:31:08 <logmsgbot> !log brennen@deploy1003 brennen, bd808: Continuing with sync
2024-08-07 17:34:56 <logmsgbot> !log milimetric@deploy1003 Finished deploy [analytics/refinery@0d25645] (thin): Syncing browser general script, and refinery-source 0.2.45 apparently (duration: 04m 21s)
2024-08-07 17:35:37 <logmsgbot> !log brennen@deploy1003 Finished scap: Backport for [[gerrit:1060468|Revert "Drop writeapi flag from siteinfo API" (T115414 T294397 T371977)]] (duration: 08m 06s)
2024-08-07 17:35:41 <stashbot> T115414: Remove the ability to disable the API with $wgEnableAPI - https://phabricator.wikimedia.org/T115414
2024-08-07 17:35:42 <stashbot> T294397: Drop writeapi MediaWiki right - https://phabricator.wikimedia.org/T294397
2024-08-07 17:35:42 <stashbot> T371977: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977
2024-08-07 17:35:55 <wikibugs> 'SRE, ''collaboration-services, ''Continuous-Integration-Infrastructure, ''Jenkins, ''Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10049124 (''Dzahn) Turns out there is another jenkins SSH key here: https://phabricator.wikimedia.org/au...'
2024-08-07 17:36:03 <wikibugs> ('CR) ''Bking: [C:''+1] knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - ''https://gerrit.wikimedia.org/r/1060452 (owner: ''Klausman)'
2024-08-07 17:40:07 <wikibugs> ('CR) ''Ssingh: [C:''+1] "I am not sure if this is supposed to go under @ or under a specific spop1024 record so going with this for now and we can see. Since it's " [dns] - ''https://gerrit.wikimedia.org/r/1059412 (https://phabricator.wikimedia.org/T370963) (owner: ''Dwisehaupt)'
2024-08-07 17:40:14 <wikibugs> ('CR) ''Ssingh: [C:''+2] Add yahoo-verification-key for Complaint Feedback Loop [dns] - ''https://gerrit.wikimedia.org/r/1059412 (https://phabricator.wikimedia.org/T370963) (owner: ''Dwisehaupt)'
2024-08-07 17:40:26 <wikibugs> ('PS2) ''Ssingh: Add yahoo-verification-key for Complaint Feedback Loop [dns] - ''https://gerrit.wikimedia.org/r/1059412 (https://phabricator.wikimedia.org/T370963) (owner: ''Dwisehaupt)'
2024-08-07 17:41:26 <jinxer-wm> FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
2024-08-07 17:41:29 <sukhe> !log running authdns-update for Yahoo CFL TXT record: T370963
2024-08-07 17:41:31 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 17:41:31 <stashbot> T370963: Add a TXT record to the Yahoo sending domain - https://phabricator.wikimedia.org/T370963
2024-08-07 17:44:44 <wikibugs> ('CR) ''Ssingh: "recheck" [dns] - ''https://gerrit.wikimedia.org/r/1059412 (https://phabricator.wikimedia.org/T370963) (owner: ''Dwisehaupt)'
2024-08-07 17:45:21 <wikibugs> ('PS1) ''Dzahn: ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - ''https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677)'
2024-08-07 17:49:56 <wikibugs> ('PS1) ''Ssingh: wikimedia.org: dummy change to check auto-review [dns] - ''https://gerrit.wikimedia.org/r/1060484'
2024-08-07 17:50:03 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q1:rack/setup/install frlog2002 - https://phabricator.wikimedia.org/T369935#10049142 (''Papaul) ` [edit interfaces interface-range disabled] - member ge-0/0/36; - member ge-1/0/36; [edit interfaces interface-range vlan-administration] member...'
2024-08-07 17:53:40 <wikibugs> ('CR) ''Dzahn: [V:''-1] "https://puppet-compiler.wmflabs.org/output/1060483/3584/contint1002.wikimedia.org/change.contint1002.wikimedia.org.err"; [puppet] - ''https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) (owner: ''Dzahn)'
2024-08-07 17:54:17 <logmsgbot> !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1038.eqiad.wmnet with OS bullseye
2024-08-07 17:54:34 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''cloud-services-team (Hardware), ''Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10049148 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudcephosd1038.eq...'
2024-08-07 17:56:57 <wikibugs> ('PS2) ''Dzahn: ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - ''https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677)'
2024-08-07 17:57:35 <wikibugs> ('CR) ''Ssingh: "Nice, that worked. Abandoning." [dns] - ''https://gerrit.wikimedia.org/r/1060484 (owner: ''Ssingh)'
2024-08-07 17:58:29 <wikibugs> ('Abandoned) ''Ssingh: wikimedia.org: dummy change to check auto-review [dns] - ''https://gerrit.wikimedia.org/r/1060484 (owner: ''Ssingh)'
2024-08-07 18:00:04 <jouncebot> jnuche and brennen: Time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T1800).
2024-08-07 18:02:12 <wikibugs> ('PS24) ''CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - ''https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618)'
2024-08-07 18:05:21 <wikibugs> ('PS1) ''BCornwall: dummy change to check auto-review [dns] - ''https://gerrit.wikimedia.org/r/1060485'
2024-08-07 18:06:12 <wikibugs> ('CR) ''Pppery: "Ideally some of these domains would point to more specific places rather than wikimedia.org, like wiktionary.app -> wiktionary.org instead" [puppet] - ''https://gerrit.wikimedia.org/r/1055231 (owner: ''Ncmonitor)'
2024-08-07 18:06:29 <wikibugs> ('CR) ''Ebernhardson: [C:''+1] search: index stems for mul labels [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060430 (https://phabricator.wikimedia.org/T371401) (owner: ''DCausse)'
2024-08-07 18:06:50 <jinxer-wm> FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2024-08-07 18:07:34 <wikibugs> ('PS3) ''Dzahn: ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - ''https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677)'
2024-08-07 18:09:22 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage
2024-08-07 18:10:56 <wikibugs> ('PS4) ''Dzahn: ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - ''https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677)'
2024-08-07 18:11:58 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage
2024-08-07 18:12:05 <logmsgbot> !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage
2024-08-07 18:12:33 <wikibugs> ('CR) ''Dzahn: "@hashar No more need to do the resolve part and no more need to join the array elements. It all just works now when passing an array strai" [puppet] - ''https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) (owner: ''Dzahn)'
2024-08-07 18:12:39 <wikibugs> ('CR) ''Dzahn: [V:''+1] ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - ''https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677) (owner: ''Dzahn)'
2024-08-07 18:13:10 <wikibugs> ('Abandoned) ''Ssingh: dummy change to check auto-review [dns] - ''https://gerrit.wikimedia.org/r/1060485 (owner: ''BCornwall)'
2024-08-07 18:14:17 <wikibugs> ('PS25) ''CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - ''https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618)'
2024-08-07 18:14:40 <logmsgbot> !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1038.eqiad.wmnet with reason: host reimage
2024-08-07 18:15:20 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''Patch-For-Review: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10049179 (''XiaoXiao-WMF) Hi! I have followed the email instruction and I have done this step on May 23rd, and now I log into the stat machine I still...'
2024-08-07 18:15:26 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''Patch-For-Review: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10049184 (''XiaoXiao-WMF) ''Resolved''Open'
2024-08-07 18:17:02 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''Patch-For-Review: Requesting Kerberos access for xiaoxiao - https://phabricator.wikimedia.org/T369517#10049185 (''XiaoXiao-WMF) a:''Clement_Goubert''None'
2024-08-07 18:17:26 <logmsgbot> !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1019.eqiad.wmnet
2024-08-07 18:18:37 <wikibugs> ('PS26) ''CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - ''https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618)'
2024-08-07 18:19:26 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10049204 (''Jclark-ctr) p:''Triage''Low a:''Jclark-ctr These can be ignored i am process of imaging these servers and are single power at this time'
2024-08-07 18:20:56 <logmsgbot> !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1019.eqiad.wmnet
2024-08-07 18:21:59 <wikibugs> ('PS27) ''CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - ''https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618)'
2024-08-07 18:22:59 <logmsgbot> !log milimetric@deploy1003 Started deploy [analytics/refinery@fe20690]: Syncing browser general script hive version
2024-08-07 18:28:56 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1296.eqiad.wmnet with OS bullseye
2024-08-07 18:29:09 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049233 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bullseye...'
2024-08-07 18:30:03 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1296.eqiad.wmnet with OS bullseye
2024-08-07 18:30:12 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049237 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bull...'
2024-08-07 18:32:05 <wikibugs> 'ops-eqiad, ''SRE, ''Data-Persistence, ''DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10049244 (''Jclark-ctr) a:''VRiley-WMF'
2024-08-07 18:32:05 <logmsgbot> !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1038.eqiad.wmnet with OS bullseye
2024-08-07 18:32:16 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''cloud-services-team (Hardware), ''Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10049245 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudcephosd1038.eqiad....'
2024-08-07 18:32:29 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage
2024-08-07 18:33:36 <brett> !log start pybal on lvs1019
2024-08-07 18:33:38 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 18:35:06 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1296.eqiad.wmnet with reason: host reimage
2024-08-07 18:36:54 <wikibugs> ('CR) ''Scott French: [C:''+2] "I'll go ahead and approve / merge this now, as there is no change in sudo rights with this patch - only preparation for a change in entryp" [puppet] - ''https://gerrit.wikimedia.org/r/1059942 (https://phabricator.wikimedia.org/T371904) (owner: ''Ahmon Dancy)'
2024-08-07 18:39:05 <logmsgbot> !log milimetric@deploy1003 Finished deploy [analytics/refinery@fe20690]: Syncing browser general script hive version (duration: 16m 05s)
2024-08-07 18:40:26 <brett> !log stop pybal on lvs1018 for server reboot
2024-08-07 18:40:27 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 18:45:28 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1296.eqiad.wmnet with OS bullseye
2024-08-07 18:45:37 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049259 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1296.eqiad.wmnet with OS bullseye...'
2024-08-07 18:55:19 <wikibugs> ('CR) ''Andrew Bogott: [C:''+2] wmfsink: hook delete.end rather than delete.start [puppet] - ''https://gerrit.wikimedia.org/r/1060172 (https://phabricator.wikimedia.org/T371707) (owner: ''Andrew Bogott)'
2024-08-07 18:55:38 <wikibugs> ('PS2) ''Andrew Bogott: wmfsink: hook delete.end rather than delete.start [puppet] - ''https://gerrit.wikimedia.org/r/1060172 (https://phabricator.wikimedia.org/T371707)'
2024-08-07 18:55:38 <wikibugs> ('PS10) ''Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - ''https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707)'
2024-08-07 18:55:38 <wikibugs> ('PS12) ''Andrew Bogott: wmf_sink: replace targeted proxy cleanup with project-wide cleanup [puppet] - ''https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707)'
2024-08-07 18:56:14 <wikibugs> ('PS1) ''Brennen Bearnes: Fix TypeError in PendingChanges by handling null subPage [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - ''https://gerrit.wikimedia.org/r/1060489 (https://phabricator.wikimedia.org/T371986)'
2024-08-07 18:56:21 <wikibugs> ('PS2) ''Jforrester: Fix TypeError in PendingChanges by handling null subPage [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - ''https://gerrit.wikimedia.org/r/1060489 (https://phabricator.wikimedia.org/T371986) (owner: ''Brennen Bearnes)'
2024-08-07 18:56:43 <logmsgbot> !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet
2024-08-07 18:57:17 <brennen> James_F: jinx
2024-08-07 18:57:34 <James_F> Oops, sorry for the clash pick brennen. The perils of doing this from my phone at Wikimania. :-)
2024-08-07 18:57:50 <James_F> Thank you for looking after the train!
2024-08-07 18:58:06 <brennen> i shall deploy, you go enjoy wikimania. :)
2024-08-07 18:59:55 <logmsgbot> !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet
2024-08-07 19:00:34 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by brennen@deploy1003 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - ''https://gerrit.wikimedia.org/r/1060489 (https://phabricator.wikimedia.org/T371986) (owner: ''Brennen Bearnes)'
2024-08-07 19:00:39 <wikibugs> ('CR) ''CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests (''3 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: ''CDobbins)'
2024-08-07 19:00:49 <brett> !log start pybal on lvs1018
2024-08-07 19:00:50 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 19:04:25 <brett> !log stop pybal on lvs1017 for server reboot
2024-08-07 19:04:25 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 19:08:06 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.dns.netbox
2024-08-07 19:10:00 <wikibugs> ('Merged) ''jenkins-bot: Fix TypeError in PendingChanges by handling null subPage [extensions/FlaggedRevs] (wmf/1.43.0-wmf.17) - ''https://gerrit.wikimedia.org/r/1060489 (https://phabricator.wikimedia.org/T371986) (owner: ''Brennen Bearnes)'
2024-08-07 19:10:22 <logmsgbot> !log brennen@deploy1003 Started scap sync-world: Backport for [[gerrit:1060489|Fix TypeError in PendingChanges by handling null subPage (T371986)]]
2024-08-07 19:10:30 <stashbot> T371986: TypeError: Argument 1 passed to PendingChanges::parseParams() must be of the type string, null given - https://phabricator.wikimedia.org/T371986
2024-08-07 19:11:08 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt gerrit1004 - jclark@cumin1002"
2024-08-07 19:11:11 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt gerrit1004 - jclark@cumin1002"
2024-08-07 19:11:11 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2024-08-07 19:11:30 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host gerrit1004.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 19:12:29 <logmsgbot> !log brennen@deploy1003 brennen: Backport for [[gerrit:1060489|Fix TypeError in PendingChanges by handling null subPage (T371986)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
2024-08-07 19:14:13 <logmsgbot> !log brennen@deploy1003 brennen: Continuing with sync
2024-08-07 19:18:46 <logmsgbot> !log brennen@deploy1003 Finished scap: Backport for [[gerrit:1060489|Fix TypeError in PendingChanges by handling null subPage (T371986)]] (duration: 08m 23s)
2024-08-07 19:18:53 <stashbot> T371986: TypeError: Argument 1 passed to PendingChanges::parseParams() must be of the type string, null given - https://phabricator.wikimedia.org/T371986
2024-08-07 19:29:09 <logmsgbot> !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1017.eqiad.wmnet
2024-08-07 19:32:25 <logmsgbot> !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1017.eqiad.wmnet
2024-08-07 19:33:04 <brett> !log start pybal on lvs1017
2024-08-07 19:33:05 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 19:37:42 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gerrit1004.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 19:38:50 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host gerrit1004.wikimedia.org with OS bookworm
2024-08-07 19:38:58 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049405 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host gerrit1004.wikimedia.org with OS bookworm'
2024-08-07 19:39:27 <logmsgbot> !log ebernhardson@deploy1003 Started deploy [airflow-dags/search@049c09e]: workaround process_sparql_query oom issues
2024-08-07 19:39:48 <logmsgbot> !log ebernhardson@deploy1003 Finished deploy [airflow-dags/search@049c09e]: workaround process_sparql_query oom issues (duration: 00m 20s)
2024-08-07 19:39:52 <wikibugs> 'ops-eqiad, ''DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T372001 (''phaultfinder) ''NEW'
2024-08-07 19:42:14 <wikibugs> ('CR) ''Andrew Bogott: "recheck" [puppet] - ''https://gerrit.wikimedia.org/r/1060172 (https://phabricator.wikimedia.org/T371707) (owner: ''Andrew Bogott)'
2024-08-07 19:43:53 <wikibugs> 'ops-eqiad, ''SRE, ''collaboration-services, ''DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10049413 (''Jclark-ctr)'
2024-08-07 19:45:36 <logmsgbot> !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@049c09e]: Deploying new Browser General job
2024-08-07 19:46:17 <logmsgbot> !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@049c09e]: Deploying new Browser General job (duration: 00m 41s)
2024-08-07 19:47:33 <logmsgbot> !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@049c09e]: Deploying new Browser General job
2024-08-07 19:47:36 <logmsgbot> !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@049c09e]: Deploying new Browser General job (duration: 00m 02s)
2024-08-07 19:51:17 <logmsgbot> !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@216348d]: (no justification provided)
2024-08-07 19:52:04 <logmsgbot> !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@216348d]: (no justification provided) (duration: 00m 47s)
2024-08-07 19:52:48 <logmsgbot> !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@049c09e]: (no justification provided)
2024-08-07 19:53:47 <logmsgbot> !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@049c09e]: (no justification provided) (duration: 00m 59s)
2024-08-07 19:55:42 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit1004.wikimedia.org with reason: host reimage
2024-08-07 19:59:07 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit1004.wikimedia.org with reason: host reimage
2024-08-07 20:00:04 <jouncebot> RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T2000).
2024-08-07 20:00:04 <jouncebot> cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2024-08-07 20:00:20 <cjming> i will self-deploy!
2024-08-07 20:00:54 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060445 (https://phabricator.wikimedia.org/T366234) (owner: ''Clare Ming)'
2024-08-07 20:01:35 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.dns.netbox
2024-08-07 20:01:38 <wikibugs> ('Merged) ''jenkins-bot: Fix labs config for Metrics Platform vars [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1060445 (https://phabricator.wikimedia.org/T366234) (owner: ''Clare Ming)'
2024-08-07 20:02:25 <cjming> I'll hang out for a little bit if anyone needs anything
2024-08-07 20:04:41 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt vrts1003 - jclark@cumin1002"
2024-08-07 20:04:45 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt vrts1003 - jclark@cumin1002"
2024-08-07 20:04:45 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2024-08-07 20:08:59 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host vrts1003.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 20:10:25 <jinxer-wm> FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-08-07 20:11:27 <logmsgbot> !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@049c09e]: (no justification provided)
2024-08-07 20:11:31 <logmsgbot> !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@049c09e]: (no justification provided) (duration: 00m 03s)
2024-08-07 20:15:57 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gerrit1004.wikimedia.org with OS bookworm
2024-08-07 20:16:05 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10049555 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host gerrit1004.wikimedia.org with OS bookworm execut...'
2024-08-07 20:17:28 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host gerrit1004.wikimedia.org with OS bookworm
2024-08-07 20:17:32 <wikibugs> 'ops-eqiad, ''SRE, ''collaboration-services, ''DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10049557 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host gerrit1004.wikimedia.org with OS bookworm'
2024-08-07 20:19:45 <jinxer-wm> FIRING: Primary outbound port utilisation over 80% #page: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
2024-08-07 20:19:45 <jinxer-wm> FIRING: Primary inbound port utilisation over 80% #page: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
2024-08-07 20:20:08 <sukhe> hi
2024-08-07 20:20:33 <sukhe> !incidents
2024-08-07 20:20:33 <sirenbot> 4954 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet)
2024-08-07 20:20:33 <sirenbot> 4955 (UNACKED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-f4-eqiad.mgmt.eqiad.wmnet)
2024-08-07 20:20:44 <sukhe> !ack 4954
2024-08-07 20:20:44 <sirenbot> 4954 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cloudsw1-d5-eqiad.mgmt.eqiad.wmnet)
2024-08-07 20:20:53 <sukhe> !ack 4955
2024-08-07 20:20:53 <sirenbot> 4955 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (cloudsw1-f4-eqiad.mgmt.eqiad.wmnet)
2024-08-07 20:21:02 <cjming> !log end of UTC late backport window
2024-08-07 20:21:03 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2024-08-07 20:21:04 <hashar> that is nice
2024-08-07 20:21:17 <hashar> that is the first time I notice sirenbot and it LOOKS SO RAD
2024-08-07 20:21:35 <sukhe> still looking
2024-08-07 20:23:28 <mutante> here. acked that from mobile. limited to cloud
2024-08-07 20:24:45 <jinxer-wm> RESOLVED: Primary outbound port utilisation over 80% #page: Device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
2024-08-07 20:24:45 <jinxer-wm> RESOLVED: Primary inbound port utilisation over 80% #page: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
2024-08-07 20:24:50 <sukhe> :P
2024-08-07 20:25:05 <mutante> and that was that...
2024-08-07 20:26:04 <sukhe> well that was a spike alright, looking at librenms
2024-08-07 20:29:28 <mutante> looking at the netstats section for that device.. i dont even see i?
2024-08-07 20:30:22 <sukhe> mutante: this https://librenms.wikimedia.org/graphs/to=1723062300/device=242/type=device_bits/from=1722975900/legend=no/
2024-08-07 20:31:16 <mutante> it's only the management switch..
2024-08-07 20:31:16 <sukhe> mutante: anything from the cloud folks?
2024-08-07 20:31:44 <mutante> 20:31 < andrewbogott> We have a bad switch so are migrating lots of things way from it. In theory that's not disruptive
2024-08-07 20:31:47 <mutante> :)
2024-08-07 20:31:50 <sukhe> :)
2024-08-07 20:31:57 <sukhe> thanks for following up
2024-08-07 20:31:58 <cdanis> mutante: the management switch isn't pushing 60Gbit
2024-08-07 20:32:19 <andrewbogott> the context is T371878
2024-08-07 20:32:19 <stashbot> T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878
2024-08-07 20:32:23 <andrewbogott> for what I'm doing
2024-08-07 20:32:30 <andrewbogott> but right now all I'm doing is very gradually pooling new ceph nodes
2024-08-07 20:32:41 <cdanis> that's the prod sw1-f4 being polled on its management IPs
2024-08-07 20:32:53 <andrewbogott> ok, so nothing to do with me it sounds lke?
2024-08-07 20:33:21 <cdanis> is pooling the ceph nodes causing data to be resilvered?
2024-08-07 20:33:22 <mutante> well, "bad switch" and reboot of the exact device that just alerted
2024-08-07 20:33:38 <sukhe> I have to run for daycare pickup. I will be back later
2024-08-07 20:34:26 <mutante> it was "d5" and "f4"
2024-08-07 20:34:29 <andrewbogott> cdanis: yes, it rebalances whenever new drives are added.
2024-08-07 20:34:33 <cdanis> andrewbogott: something was nearly maxing out the 40G interconnect between cloudsw1-f4 and cloudsw1-d5 https://librenms.wikimedia.org/device/device=242/tab=port/port=25230/
2024-08-07 20:34:40 <cdanis> so I'm guessing that was Ceph
2024-08-07 20:35:15 <andrewbogott> possible although I'm not sure how we'd get to 40G
2024-08-07 20:35:24 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on gerrit1004.wikimedia.org with reason: host reimage
2024-08-07 20:35:29 <andrewbogott> there's only one new host coming online and it only has a 10G nic
2024-08-07 20:35:38 <andrewbogott> and usually cpu bottlenecks before network bandwidth
2024-08-07 20:35:44 <cdanis> well, *something* exceeded it -- if you look at the errors on the other side of the port, there were a lot of discards https://librenms.wikimedia.org/device/device=242/tab=port/port=25230/
2024-08-07 20:35:48 <andrewbogott> is it still happening?
2024-08-07 20:36:24 <cdanis> no, but you've had two large spikes of discards in the past 24h
2024-08-07 20:36:35 <andrewbogott> ok
2024-08-07 20:37:00 <andrewbogott> that could be ceph, maybe coupled with that badly-behaving switch
2024-08-07 20:37:07 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host vrts1003.mgmt.eqiad.wmnet with reboot policy FORCED
2024-08-07 20:39:03 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host vrts1003.eqiad.wmnet with OS bookworm
2024-08-07 20:39:08 <wikibugs> 'ops-eqiad, ''SRE, ''collaboration-services, ''DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10049655 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host vrts1003.eqiad.wmnet with OS bookworm'
2024-08-07 20:39:14 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit1004.wikimedia.org with reason: host reimage
2024-08-07 20:40:18 <cdanis> so I have no hands-on experience with the system, but, I thought one of the drawbacks of Ceph was that the CRUSH algorithm often caused a lot of churn when fresh nodes/disks were added to the system?
2024-08-07 20:40:24 <cdanis> do you think that might be happening here andrewbogott
2024-08-07 20:41:07 <andrewbogott> It's definitely causing churn but only a reasonable amount (according to the mon: "recovery: 536 MiB/s, 134 objects/s")
2024-08-07 20:41:18 <cdanis> hm
2024-08-07 20:41:21 <andrewbogott> But if the switch malfunctions and discards 99% of our traffic then all bets are off
2024-08-07 20:42:16 <andrewbogott> cursed switch info is at https://phabricator.wikimedia.org/T371879
2024-08-07 20:42:22 <cdanis> *both* switches were saying "this link is 36Gbps+", but only one switch was saying "I'm discarding traffic because my output buffer is full" ... which is expected when you're saturating such a link
2024-08-07 20:42:30 <andrewbogott> ah, I see
2024-08-07 20:42:41 <andrewbogott> looks for a timestamp
2024-08-07 20:43:10 <mutante> sees reports like "
2024-08-07 20:43:10 <mutante> Ceph went haywire after a switch hiccup
2024-08-07 20:43:22 <cdanis> mutante: yeah I think Ceph caused a network saturation event
2024-08-07 20:43:37 <cdanis> distributed storage systems are often very good at that :)
2024-08-07 20:44:05 <mutante> cdanis: ack, thanks for that. at least it felt like it might be related to maintenance
2024-08-07 20:44:47 <mutante> as Andrew said, in combination with the switch issue
2024-08-07 20:45:39 <cdanis> if BFD is running over the links that are saturating, then, Ceph *is* the "switch issue"
2024-08-07 20:45:43 <andrewbogott> this is about the alert at 15:24 right?
2024-08-07 20:45:45 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10049671 (''phaultfinder)'
2024-08-07 20:45:46 <cdanis> is what I am saying, mutante
2024-08-07 20:46:11 <andrewbogott> I didn't repool the thing I am repooling until 20 minutes later than that
2024-08-07 20:46:31 <cdanis> hm
2024-08-07 20:46:37 <andrewbogott> But there has been some amount of ceph rebalancing ongoing for several days
2024-08-07 20:47:07 <mutante> andrewbogott: alert went out at 8:19 UTC
2024-08-07 20:47:20 <andrewbogott> (and of course adding new nodes isn't unusual, we have hundreds of disks in play and they were all added sometime)
2024-08-07 20:49:05 <mutante> cdanis: but there is an actual action that was taken by Cathal per the comment "things remain stable since the changes earlier on"
2024-08-07 20:49:16 <andrewbogott> mutante: do you man 20:19 UTC?
2024-08-07 20:49:44 <mutante> andrewbogott: yes
2024-08-07 20:49:48 <andrewbogott> so 30 minutes ago
2024-08-07 20:49:54 <mutante> yes
2024-08-07 20:50:24 <andrewbogott> I don't think I was doing anything interesting then other than waiting for a previous drive to finish rebalancing which it had been doing for an hour+ at that point. But let me look in the logs some more
2024-08-07 20:52:49 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netops: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10049699 (''Dzahn) We got paged at 20:19 UTC for "primary outbound port utilisation over 80%" on both cloudsw1-d5 and cloudsw1-f4 today. Shortly after it resolved. But somethi...'
2024-08-07 20:53:11 <logmsgbot> !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@4cf9922]: (no justification provided)
2024-08-07 20:53:25 <cdanis> andrewbogott: the real question is what you were doing at 20:06
2024-08-07 20:53:29 <cdanis> https://grafana.wikimedia.org/goto/aqyMilrIg?orgId=1
2024-08-07 20:53:50 <logmsgbot> !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@4cf9922]: (no justification provided) (duration: 00m 38s)
2024-08-07 20:53:54 <cdanis> which is when the cloudcephosd hosts themselves started reporting their network usage to be 120Gbps+
2024-08-07 20:54:01 <cdanis> gigabit/second
2024-08-07 20:54:10 <cdanis> which is almost definitely a problem for your switches
2024-08-07 20:54:57 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netops: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10049702 (''Dzahn) {F57154133}'
2024-08-07 20:55:06 <cdanis> have you been rebalancing these hosts 'gradually' since about 2024-08-06 16:12? because that's when the crazy spikes in cloudcephosd self-reported NIC usage begin https://grafana.wikimedia.org/goto/6Ay4ilrIg?orgId=1
2024-08-07 20:55:39 <mutante> there is also https://phabricator.wikimedia.org/T371869
2024-08-07 20:56:02 <cdanis> cross-switch link saturation would absolutely explain that as well, potentially
2024-08-07 20:56:16 <andrewbogott> yep, yesterday (my AM) was when we started evacuating things that use that switch so we can upgrade and reboot it.
2024-08-07 20:56:17 <cdanis> and, the thing that BFD does is it tells the control plane about neighbor links that are dropping packets
2024-08-07 20:56:40 <cdanis> and I don't think we have any QoS for it (or anywhere) atm
2024-08-07 20:56:43 <mutante> "we have increased the timeouts and changed the LACP mode from 'fast' to 'slow' keepalive messages and that seems to have stabilized the network "
2024-08-07 20:56:47 <cdanis> yeah...
2024-08-07 20:56:58 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gerrit1004.wikimedia.org with OS bookworm
2024-08-07 20:57:06 <wikibugs> 'ops-eqiad, ''SRE, ''collaboration-services, ''DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10049708 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host gerrit1004.wikimedia.org with OS bookworm executed with errors: - ge...'
2024-08-07 20:57:07 <cdanis> I would guess you are wrecking the network with microbursts at the beginning of each rebalance
2024-08-07 20:57:18 <logmsgbot> !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on vrts1003.eqiad.wmnet with reason: host reimage
2024-08-07 20:57:23 <cdanis> anyway I'm sorry, I have to go
2024-08-07 20:57:28 <cdanis> daycare closes soon :)
2024-08-07 20:57:32 <andrewbogott> could be if the 'decide what to do' stage is somehow not throttled properly
2024-08-07 20:57:44 <andrewbogott> Bets to consult topranks about all this during overlapping hours
2024-08-07 20:58:13 <cdanis> please feel me to cc me on the tasks as well, if you want :)
2024-08-07 20:58:16 <cdanis> anyway afk
2024-08-07 21:00:05 <jouncebot> Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240807T2100)
2024-08-07 21:02:50 <logmsgbot> !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on vrts1003.eqiad.wmnet with reason: host reimage
2024-08-07 21:09:18 <wikibugs> 'ops-eqiad, ''SRE, ''collaboration-services, ''DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10049724 (''Dzahn) Even though the comment here says the cookbook failed.. I can see gerrit1004 is up on mgmt interface. I can also login as root on mgmt. Only thing that s...'
2024-08-07 21:10:25 <jinxer-wm> RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2024-08-07 21:19:51 <logmsgbot> !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host vrts1003.eqiad.wmnet with OS bookworm
2024-08-07 21:19:59 <wikibugs> 'ops-eqiad, ''SRE, ''collaboration-services, ''DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10049745 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host vrts1003.eqiad.wmnet with OS bookworm executed with errors: - vrts1003...'
2024-08-07 21:24:45 <wikibugs> 'SRE, ''LDAP-Access-Requests: Grant Access to wmf for Arthur taylor - https://phabricator.wikimedia.org/T371958#10049753 (''bd808) ''Open''Invalid T371888#10049750'
2024-08-07 21:29:07 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10049761 (''Jhancock.wm)'
2024-08-07 21:30:42 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10049766 (''Jhancock.wm) a:''Jhancock.wm''Papaul this one server is ready for @Papaul frdc2004 ETH1 <> FASW-C8A eth-0/0/20 ETH2 <> FASW-C8B eth-1/0/20'
2024-08-07 21:41:41 <jinxer-wm> FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
2024-08-07 21:41:41 <wikibugs> ('PS5) ''Dzahn: ci: replace ferm::service with firewall::service for jenkinsagent [puppet] - ''https://gerrit.wikimedia.org/r/1060483 (https://phabricator.wikimedia.org/T370677)'
2024-08-07 21:44:59 <wikibugs> ('PS1) ''Dzahn: gerrit: increase allowed requests from 300 to 600 for throttling [puppet] - ''https://gerrit.wikimedia.org/r/1060502 (https://phabricator.wikimedia.org/T365259)'
2024-08-07 21:46:34 <wikibugs> ('CR) ''Dzahn: [C:''+2] "Nothing gets actually dropped - it's just to observe the content of the created host sets." [puppet] - ''https://gerrit.wikimedia.org/r/1060502 (https://phabricator.wikimedia.org/T365259) (owner: ''Dzahn)'
2024-08-07 21:52:18 <wikibugs> 'ops-eqiad, ''SRE, ''collaboration-services, ''DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10049806 (''Dzahn) This appears to be T371653. I reopened that ticket and left a comment. Meanwhile I manually changed the status for this host to "active" in netbox. So I t...'
2024-08-07 21:53:04 <wikibugs> 'ops-eqiad, ''SRE, ''collaboration-services, ''DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10049811 (''Dzahn) Isssue above same as T369671#10049724 Manually changed the status to "active" in netbox.'
2024-08-07 22:07:05 <jinxer-wm> FIRING: KubernetesCalicoDown: ml-serve2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2024-08-07 22:32:44 <wikibugs> ('PS1) ''Ahmon Dancy: scap.cfg.erb: Update release_repo_build_and_push_images_cmd [puppet] - ''https://gerrit.wikimedia.org/r/1060505 (https://phabricator.wikimedia.org/T371904)'
2024-08-07 22:35:24 <wikibugs> ('CR) ''CI reject: [V:''-1] scap.cfg.erb: Update release_repo_build_and_push_images_cmd [puppet] - ''https://gerrit.wikimedia.org/r/1060505 (https://phabricator.wikimedia.org/T371904) (owner: ''Ahmon Dancy)'
2024-08-07 22:36:23 <wikibugs> ('PS2) ''Ahmon Dancy: scap.cfg.erb: Update release_repo_build_and_push_images_cmd [puppet] - ''https://gerrit.wikimedia.org/r/1060505 (https://phabricator.wikimedia.org/T371904)'
2024-08-07 23:00:09 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10049925 (''Papaul)'
2024-08-07 23:07:34 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10049939 (''Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member "ge-[0-1]/0/20"; [edit interfaces interface-range vlan-fundr...'
2024-08-07 23:38:45 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1060508'
2024-08-07 23:38:46 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1060508 (owner: ''TrainBranchBot)'
2024-08-07 23:45:30 <wikibugs> 'ops-eqiad, ''SRE, ''collaboration-services, ''DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10049961 (''Jclark-ctr)'
2024-08-07 23:46:32 <wikibugs> 'ops-eqiad, ''SRE, ''collaboration-services, ''DC-Ops: Q1:rack/setup/install vrts1003 - https://phabricator.wikimedia.org/T369674#10049962 (''Jclark-ctr) ''Open''Resolved a:''Jclark-ctr'

This page is generated from SQL logs, you can also download static txt files from here