[00:05:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:09:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-wikifunctions (k8s) 2.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:14:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-wikifunctions (k8s) 2.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:23:45] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:30:44] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1023541 (owner: 10TrainBranchBot)
[00:33:45] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:40:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:49:11] <wikibugs>	 10SRE-swift-storage, 06Commons, 06MediaWiki-Engineering, 10MediaWiki-File-management: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830#9746962 (10tstarling) 05Open→03Resolved
[00:50:36] <wikibugs>	 10SRE-swift-storage, 06Commons, 06MediaWiki-Engineering, 10MediaWiki-File-management: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830#9746966 (10tstarling)
[00:52:55] <wikibugs>	 10SRE-swift-storage, 06Commons, 06MediaWiki-Engineering, 10MediaWiki-File-management: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830#9746971 (10tstarling)
[01:06:29] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T352010)', diff saved to https://phabricator.wikimedia.org/P61231 and previous config saved to /var/cache/conftool/dbconfig/20240426-010628-ladsgroup.json
[01:06:53] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[01:07:30] <jinxer-wm>	 (ProbeDown) firing: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:12:30] <jinxer-wm>	 (ProbeDown) resolved: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:12:48] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:20:00] <icinga-wm>	 PROBLEM - Host cloudweb2002-dev is DOWN: PING CRITICAL - Packet loss = 100%
[01:21:36] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P61232 and previous config saved to /var/cache/conftool/dbconfig/20240426-012135-ladsgroup.json
[01:21:46] <icinga-wm>	 RECOVERY - Host cloudweb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms
[01:25:10] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "labtesthorizon: advance to 2024-04-25-225100-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1024527
[01:26:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "labtesthorizon: advance to 2024-04-25-225100-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1024527 (owner: 10Andrew Bogott)
[01:36:24] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:36:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P61233 and previous config saved to /var/cache/conftool/dbconfig/20240426-013642-ladsgroup.json
[01:39:26] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:46:28] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:48:25] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-redis-exporter@6380.service on netbox2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:49:30] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:51:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T352010)', diff saved to https://phabricator.wikimedia.org/P61234 and previous config saved to /var/cache/conftool/dbconfig/20240426-015149-ladsgroup.json
[01:51:52] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[01:52:05] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[01:52:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T352010)', diff saved to https://phabricator.wikimedia.org/P61235 and previous config saved to /var/cache/conftool/dbconfig/20240426-015212-ladsgroup.json
[01:52:15] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[01:52:30] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:53:32] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:03:56] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 130 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:06:34] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:11:34] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:13:56] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 42 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:14:38] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:18:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:20:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:21:16] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:21:38] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:26:40] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:34:46] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:48:52] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:05:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:16:27] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[03:24:33] <jinxer-wm>	 (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:00:48] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking)
[04:21:28] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:23:08] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:23:30] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:25:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:27:30] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:28:51] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev networktests.yaml: whitespace fix [puppet] - 10https://gerrit.wikimedia.org/r/1024538
[04:29:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] codfw1dev networktests.yaml: whitespace fix [puppet] - 10https://gerrit.wikimedia.org/r/1024538 (owner: 10Andrew Bogott)
[04:31:30] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:34:36] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:42:34] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:43:38] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:47:38] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:56:44] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:58:34] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 115 probes of 738 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:03:32] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 34 probes of 738 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:15:12] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:15:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:15:50] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:20:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:21:16] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:22:50] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:23:08] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:24:56] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:37:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T352010)', diff saved to https://phabricator.wikimedia.org/P61236 and previous config saved to /var/cache/conftool/dbconfig/20240426-053756-ladsgroup.json
[05:38:13] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[05:38:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) wmf_auto_restart_prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:53:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P61237 and previous config saved to /var/cache/conftool/dbconfig/20240426-055303-ladsgroup.json
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240426T0600)
[06:03:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Only enable auto vopsbot restart on active alert host [puppet] - 10https://gerrit.wikimedia.org/r/1024539
[06:04:21] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:06:49] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024539 (owner: 10Muehlenhoff)
[06:08:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P61238 and previous config saved to /var/cache/conftool/dbconfig/20240426-060810-ladsgroup.json
[06:09:21] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:19:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Druid: overlord/coordinator: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024541
[06:20:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Only enable auto vopsbot restart on active alert host [puppet] - 10https://gerrit.wikimedia.org/r/1024539 (owner: 10Muehlenhoff)
[06:21:57] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024541 (owner: 10Muehlenhoff)
[06:23:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T352010)', diff saved to https://phabricator.wikimedia.org/P61239 and previous config saved to /var/cache/conftool/dbconfig/20240426-062317-ladsgroup.json
[06:23:20] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[06:23:33] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[06:23:36] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[06:23:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T352010)', diff saved to https://phabricator.wikimedia.org/P61240 and previous config saved to /var/cache/conftool/dbconfig/20240426-062340-ladsgroup.json
[06:24:08] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:25:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:28:51] <wikibugs>	 (03CR) 10Gmodena: [C:03+2] eventstreams: 23 April 2024 update of ruwiki redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman)
[06:29:47] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams: 23 April 2024 update of ruwiki redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman)
[06:42:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61241 and previous config saved to /var/cache/conftool/dbconfig/20240426-064220-arnaudb.json
[06:48:52] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:57:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61242 and previous config saved to /var/cache/conftool/dbconfig/20240426-065726-arnaudb.json
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240426T0700)
[07:01:12] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[07:01:23] <logmsgbot>	 !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[07:01:59] <logmsgbot>	 !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[07:02:33] <logmsgbot>	 !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[07:05:27] <logmsgbot>	 !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[07:05:58] <logmsgbot>	 !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[07:08:16] <hashar>	 !log Restarting CI Jenkins
[07:08:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:32] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:12:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61243 and previous config saved to /var/cache/conftool/dbconfig/20240426-071231-arnaudb.json
[07:13:12] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:13:16] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 213, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:14:34] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:16:27] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[07:18:47] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2003.codfw.wmnet
[07:18:49] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[07:19:21] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T363551 (10phaultfinder) 03NEW
[07:20:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:21:04] <wikibugs>	 (03PS1) 10JMeybohm: kubestagemaster2003: Add as insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1024543 (https://phabricator.wikimedia.org/T363310)
[07:21:33] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002"
[07:22:16] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:22:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024543 (https://phabricator.wikimedia.org/T363310) (owner: 10JMeybohm)
[07:24:08] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:24:29] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] kubestagemaster2003: Add as insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1024543 (https://phabricator.wikimedia.org/T363310) (owner: 10JMeybohm)
[07:24:33] <jinxer-wm>	 (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:27:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61244 and previous config saved to /var/cache/conftool/dbconfig/20240426-072737-arnaudb.json
[07:28:11] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002"
[07:28:11] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:28:11] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster2003.codfw.wmnet on all recursors
[07:28:14] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2003.codfw.wmnet on all recursors
[07:28:42] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002"
[07:29:34] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002"
[07:30:11] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2003.codfw.wmnet with OS bullseye
[07:30:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9747135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast...
[07:42:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61245 and previous config saved to /var/cache/conftool/dbconfig/20240426-074243-arnaudb.json
[07:45:31] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage
[07:48:44] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage
[07:57:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61246 and previous config saved to /var/cache/conftool/dbconfig/20240426-075748-arnaudb.json
[08:11:46] <wikibugs>	 (03PS1) 10Muehlenhoff: debdeploy-restarts: Discard lsof stderr output [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589
[08:12:16] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:15:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:18:23] <jayme>	 !log depooled mw2391.codfw.wmnet for etcd benchmark
[08:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:42] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2003.codfw.wmnet with OS bullseye
[08:33:42] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster2003.codfw.wmnet
[08:33:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9747162 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster20...
[08:34:01] <hashar>	 !log Restarted Gerrit replica
[08:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:46] <wikibugs>	 (03CR) 10Slyngshede: debdeploy-restarts: Discard lsof stderr output (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589 (owner: 10Muehlenhoff)
[08:41:20] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:41:46] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version
[08:42:24] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:51:28] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:53:28] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:55:20] <wikibugs>	 (03PS2) 10Muehlenhoff: debdeploy-restarts: Don't resolve user names in lsof [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589
[08:56:24] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:56:29] <wikibugs>	 (03CR) 10Muehlenhoff: debdeploy-restarts: Don't resolve user names in lsof (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589 (owner: 10Muehlenhoff)
[08:57:02] <hashar>	 !log Restarted Gerrit
[08:57:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:28] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:03:27] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589 (owner: 10Muehlenhoff)
[09:04:41] <wikibugs>	 (03CR) 10Muehlenhoff: elasticsearch: Configure alerts for short-lived certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking)
[09:06:40] <wikibugs>	 (03Restored) 10DCausse: cirrus: Shift autocomplete traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024478 (https://phabricator.wikimedia.org/T363516) (owner: 10Ebernhardson)
[09:08:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "Couple of minor fixes suggested, otherwise LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin)
[09:13:05] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] debdeploy-restarts: Don't resolve user names in lsof [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589 (owner: 10Muehlenhoff)
[09:13:16] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:16:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump changelog for new debdeploy release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024607
[09:16:17] <wikibugs>	 06SRE, 10CirrusSearch, 03Discovery-Search (Current work), 13Patch-For-Review: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516#9747222 (10dcausse) p:05Triage→03Unbreak! This is still happening, raising to UBN
[09:19:40] <wikibugs>	 (03CR) 10Marco Fossati: [C:03+1] "Roger that, will do." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis)
[09:20:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:22:01] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1021948 (owner: 10EoghanGaffney)
[09:22:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Bump changelog for new debdeploy release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024607 (owner: 10Muehlenhoff)
[09:25:08] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:25:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:26:54] <wikibugs>	 (03CR) 10DCausse: [C:03+1] cirrus: Shift autocomplete traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024478 (https://phabricator.wikimedia.org/T363516) (owner: 10Ebernhardson)
[09:27:32] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:30:34] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:34:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024478 (https://phabricator.wikimedia.org/T363516) (owner: 10Ebernhardson)
[09:36:02] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Shift autocomplete traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024478 (https://phabricator.wikimedia.org/T363516) (owner: 10Ebernhardson)
[09:36:24] <logmsgbot>	 !log dcausse@deploy1002 Started scap: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]]
[09:36:56] <stashbot>	 T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516
[09:37:06] <wikibugs>	 (03CR) 10Brouberol: Create the MPIC Kubernetes chart (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci)
[09:40:15] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to plain
[09:40:58] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to plain
[09:41:08] <logmsgbot>	 !log dcausse@deploy1002 dcausse and ebernhardson: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:41:30] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:41:57] <stashbot>	 T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516
[09:42:28] <logmsgbot>	 !log dcausse@deploy1002 dcausse and ebernhardson: Continuing with sync
[09:42:34] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:47:32] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:47:50] <jayme>	 !log repooled mw2391.codfw.wmnet
[09:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:10] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1009 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:48:44] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1491 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:49:15] <wikibugs>	 (03PS1) 10Brouberol: global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610
[09:49:26] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1349 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:49:32] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1361 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:49:34] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1483 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:50:32] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1382 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:50:42] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@e57ae00]: Deploy of Analytics airflow dags for browser-metrics [airflow-dags/analytics@e57ae006]
[09:50:49] <wikibugs>	 (03PS2) 10Brouberol: global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T331894)
[09:51:10] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@e57ae00]: Deploy of Analytics airflow dags for browser-metrics [airflow-dags/analytics@e57ae006] (duration: 00m 27s)
[09:51:42] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1357 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:52:38] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:53:40] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:53:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[09:54:21] <logmsgbot>	 !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] (duration: 17m 57s)
[09:54:35] <wikibugs>	 (03PS1) 10Btullis: Update the ownership of the aqs cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645)
[09:54:44] <stashbot>	 T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516
[09:54:53] <wikibugs>	 10ops-eqiad, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T363551#9747303 (10hashar) →14Duplicate dup:03T363086
[09:55:17] <wikibugs>	 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9747295 (10hashar) 05Resolved→03Open scap does the docker pull on any of the k8s worker as defined by the `kubernetes-workers` group and parse1002 is n that group: ` deploy1002$ grep -R parse1002 /etc/ds...
[09:55:27] <wikibugs>	 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9747301 (10hashar)
[09:56:03] <wikibugs>	 (03PS2) 10Btullis: Update the ownership of the aqs cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645)
[09:56:26] <wikibugs>	 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9747304 (10hashar) a:05akosiaris→03None Removing assignee that was automatically set by Phabricator when the task got marked as resolved.
[09:56:31] <wikibugs>	 (03PS3) 10Brouberol: global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343)
[09:57:38] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2141/console" [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) (owner: 10Btullis)
[09:57:45] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2003.codfw.wmnet
[09:57:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[09:58:43] <wikibugs>	 (03PS3) 10Btullis: Update the ownership of the aqs cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645)
[09:58:47] <wikibugs>	 (03PS4) 10Brouberol: global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343)
[09:59:44] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2140/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol)
[09:59:59] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2142/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) (owner: 10Btullis)
[10:00:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:00:39] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[10:02:50] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster2003.codfw.wmnet on all recursors
[10:02:53] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2003.codfw.wmnet on all recursors
[10:03:19] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host kubestagemaster2003.codfw.wmnet
[10:03:50] <wikibugs>	 06SRE, 10CirrusSearch, 03Discovery-Search (Current work), 13Patch-For-Review: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516#9747331 (10dcausse) p:05Unbreak!→03Medium completion traffic is now served from codfw whic...
[10:05:08] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:06:16] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to plain
[10:07:03] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to plain
[10:08:06] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2003.codfw.wmnet with OS bullseye
[10:13:00] <wikibugs>	 (03PS1) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894)
[10:13:16] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:13:51] <wikibugs>	 (03PS2) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894)
[10:14:47] <wikibugs>	 (03PS3) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894)
[10:15:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:18:08] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2143/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[10:18:10] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:18:44] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1491 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:19:26] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1349 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:19:32] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1361 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:19:34] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1483 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:20:32] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1382 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:21:05] <wikibugs>	 (03PS1) 10Joal: Absent all report-updater jobs [puppet] - 10https://gerrit.wikimedia.org/r/1024614 (https://phabricator.wikimedia.org/T307540)
[10:21:42] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1357 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:22:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for Collaboration services (batch two) [puppet] - 10https://gerrit.wikimedia.org/r/1024615
[10:26:31] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] druid::broker: Switch to firewall::service for test_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1024403 (owner: 10Muehlenhoff)
[10:26:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] druid::broker: Switch public workers to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024409 (owner: 10Muehlenhoff)
[10:27:05] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] druid::broker: Switch analytics workers to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024410 (owner: 10Muehlenhoff)
[10:27:52] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Druid: overlord/coordinator: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024541 (owner: 10Muehlenhoff)
[10:39:17] <logmsgbot>	 jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade.
[10:43:19] <wikibugs>	 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559 (10BTullis) 03NEW p:05Triage→03High
[10:43:32] <wikibugs>	 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9747453 (10BTullis)
[10:46:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report
[10:46:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[10:47:09] <wikibugs>	 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9747458 (10MoritzMuehlenhoff) Looks good. Best to create it in group D
[10:48:52] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:54:16] <logmsgbot>	 !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kubestagemaster2003.codfw.wmnet with OS bullseye
[10:55:51] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubestagemaster2003.codfw.wmnet
[10:59:42] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[11:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240426T0700)
[11:00:04] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: Time to do the GitLab version upgrades deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240426T1100).
[11:00:56] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2144/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024614 (https://phabricator.wikimedia.org/T307540) (owner: 10Joal)
[11:01:28] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests, 13Patch-For-Review: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9747504 (10Lina_Farid_WMDE) Thank you all ! I cannot access superset. yet. Do I need additional permissions for that? When I trey to login into https://idp.w...
[11:01:56] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Absent all report-updater jobs [puppet] - 10https://gerrit.wikimedia.org/r/1024614 (https://phabricator.wikimedia.org/T307540) (owner: 10Joal)
[11:02:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend cloudnet-codfw1dev Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1024620
[11:02:25] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] "This looks good. W will be deploying this today to avoid the job execution on Sunday." [puppet] - 10https://gerrit.wikimedia.org/r/1024614 (https://phabricator.wikimedia.org/T307540) (owner: 10Joal)
[11:04:21] <icinga-wm>	 RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 110, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:05:53] <icinga-wm>	 PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3498 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:06:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version
[11:06:53] <icinga-wm>	 RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 104916 bytes in 0.606 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring
[11:07:00] <jinxer-wm>	 (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:09:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Harmonise analytics Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1024625
[11:10:07] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[11:10:41] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org
[11:10:58] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[11:10:58] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:10:59] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubestagemaster2003.codfw.wmnet
[11:11:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9747539 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kubestagemaster20...
[11:13:07] <icinga-wm>	 PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[11:13:20] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2003.codfw.wmnet
[11:13:21] <jelto>	 ^ expected due to host restart
[11:13:22] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[11:14:23] <icinga-wm>	 RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms
[11:15:49] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002"
[11:16:27] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[11:16:38] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002"
[11:16:38] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:16:38] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster2003.codfw.wmnet on all recursors
[11:16:41] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2003.codfw.wmnet on all recursors
[11:17:00] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002"
[11:17:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org
[11:17:52] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002"
[11:18:17] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2003.codfw.wmnet with OS bullseye
[11:18:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9747564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast...
[11:18:56] <jinxer-wm>	 (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:19:44] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T363566 (10phaultfinder) 03NEW
[11:20:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:21:43] <wikibugs>	 (03PS1) 10Muehlenhoff: arclamp: Enable profile::auto_restarts::service for Redis [puppet] - 10https://gerrit.wikimedia.org/r/1024630 (https://phabricator.wikimedia.org/T135991)
[11:24:33] <jinxer-wm>	 (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:26:35] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:26:53] <icinga-wm>	 RECOVERY - Host parse1002 is UP: PING WARNING - Packet loss = 60%, RTA = 30.27 ms
[11:27:37] <icinga-wm>	 PROBLEM - SSH on parse1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:27:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access by two weeks [puppet] - 10https://gerrit.wikimedia.org/r/1024634
[11:28:31] <claime>	 !log Deactivating puppet for parse1002 - T363086
[11:28:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:50] <stashbot>	 T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086
[11:28:55] <wikibugs>	 (03PS1) 10Btullis: Add a basic role for ceph:cephsdm [puppet] - 10https://gerrit.wikimedia.org/r/1024635 (https://phabricator.wikimedia.org/T363558)
[11:29:14] <claime>	 !log Forcing puppet run on deploy server - T363086
[11:29:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Extend access by two weeks [puppet] - 10https://gerrit.wikimedia.org/r/1024634 (owner: 10Muehlenhoff)
[11:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:40] <wikibugs>	 (03PS2) 10Btullis: Add a basic role for ceph:cephadm [puppet] - 10https://gerrit.wikimedia.org/r/1024635 (https://phabricator.wikimedia.org/T363558)
[11:29:41] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:29:43] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/1024421 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff)
[11:32:25] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add a basic role for ceph:cephadm [puppet] - 10https://gerrit.wikimedia.org/r/1024635 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis)
[11:32:30] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage
[11:33:17] <icinga-wm>	 PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:33:27] <claime>	 !log Forcing puppet run on O:alerting_host - T363086
[11:33:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:50] <stashbot>	 T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086
[11:34:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T352010)', diff saved to https://phabricator.wikimedia.org/P61247 and previous config saved to /var/cache/conftool/dbconfig/20240426-113416-ladsgroup.json
[11:34:34] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[11:35:45] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage
[11:35:56] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.ganeti.makevm for new host cephadm1001.eqiad.wmnet
[11:35:57] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[11:39:07] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cephadm1001.eqiad.wmnet - btullis@cumin1002"
[11:42:09] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cephadm1001.eqiad.wmnet - btullis@cumin1002"
[11:42:09] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:42:09] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.wipe-cache cephadm1001.eqiad.wmnet on all recursors
[11:42:13] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cephadm1001.eqiad.wmnet on all recursors
[11:42:37] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cephadm1001.eqiad.wmnet - btullis@cumin1002"
[11:43:29] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cephadm1001.eqiad.wmnet - btullis@cumin1002"
[11:43:46] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephadm1001.eqiad.wmnet with OS bookworm
[11:43:57] <wikibugs>	 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9747640 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bookworm
[11:45:00] <wikibugs>	 (03PS5) 10Santiago Faci: Create the MPIC Kubernetes chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343)
[11:46:57] <wikibugs>	 (03CR) 10Santiago Faci: "All changes have been made!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci)
[11:48:26] <wikibugs>	 (03PS1) 10Muehlenhoff: miscweb: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024640 (https://phabricator.wikimedia.org/T135991)
[11:49:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P61248 and previous config saved to /var/cache/conftool/dbconfig/20240426-114923-ladsgroup.json
[11:50:20] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2003.codfw.wmnet with OS bullseye
[11:50:20] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster2003.codfw.wmnet
[11:50:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9747660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster20...
[11:53:41] <moritzm>	 !log uploaded debdeploy 0.0.99.14 to apt.wikimedia.org (for buster/bullseye/bookworm)
[11:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:46] <claime>	 !log Silencing all alerts matching parse1002.* for 4 days - T363086
[11:54:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:14] <stashbot>	 T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086
[11:54:58] <wikibugs>	 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9747667 (10Clement_Goubert) Downtimed, silence ID e5915daa-08f1-45f6-b805-fee5078d64da
[12:01:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Looks good! Let's implement the helmfile containing both the mpic-next and mpic releases, and let's test this all out!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci)
[12:02:21] <wikibugs>	 (03PS1) 10Muehlenhoff: cloudweb: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1024647 (https://phabricator.wikimedia.org/T135991)
[12:04:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P61249 and previous config saved to /var/cache/conftool/dbconfig/20240426-120431-ladsgroup.json
[12:19:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T352010)', diff saved to https://phabricator.wikimedia.org/P61250 and previous config saved to /var/cache/conftool/dbconfig/20240426-121939-ladsgroup.json
[12:19:42] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[12:19:45] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[12:19:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T352010)', diff saved to https://phabricator.wikimedia.org/P61251 and previous config saved to /var/cache/conftool/dbconfig/20240426-121951-ladsgroup.json
[12:20:03] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[12:21:15] <wikibugs>	 (03PS1) 10Elukey: Add overrides for Cassandra settings to restbase2021 [puppet] - 10https://gerrit.wikimedia.org/r/1024649 (https://phabricator.wikimedia.org/T352647)
[12:22:51] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2145/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024649 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[12:23:46] <wikibugs>	 (03PS2) 10Elukey: Add overrides for Cassandra settings to restbase2021 [puppet] - 10https://gerrit.wikimedia.org/r/1024649 (https://phabricator.wikimedia.org/T352647)
[12:23:46] <wikibugs>	 (03PS1) 10Elukey: role::cassandra_dev: clean-up after PKI TLS certs rollout [puppet] - 10https://gerrit.wikimedia.org/r/1024650 (https://phabricator.wikimedia.org/T352647)
[12:25:09] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2146/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024650 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[12:25:53] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Disable boostrap mode on all k8s etcd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1024395 (owner: 10JMeybohm)
[12:26:42] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephadm1001.eqiad.wmnet with OS bookworm
[12:26:43] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host cephadm1001.eqiad.wmnet
[12:26:48] <wikibugs>	 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9747726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bookworm executed with errors: - cephadm10...
[12:27:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for Traffic services [puppet] - 10https://gerrit.wikimedia.org/r/1024651
[12:28:32] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: move Istio configs to mw-api-int-ro for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021490 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey)
[12:28:44] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey)
[12:29:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "-1ing to signal that this isn't ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/1024400 (https://phabricator.wikimedia.org/T363399) (owner: 10Alexandros Kosiaris)
[12:33:12] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:33:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:33:43] <wikibugs>	 (03PS2) 10Elukey: ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622)
[12:34:15] <wikibugs>	 (03PS3) 10Elukey: ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622)
[12:34:31] <wikibugs>	 (03PS4) 10Elukey: ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622)
[12:34:41] <wikibugs>	 (03PS5) 10Elukey: ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622)
[12:37:49] <wikibugs>	 (03PS1) 10Elukey: admin_ng: fix MW API's Service Entry for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024653
[12:41:09] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572 (10eoghan) 03NEW
[12:41:33] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: fix MW API's Service Entry for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024653 (owner: 10Elukey)
[12:44:29] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:44:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:45:42] <wikibugs>	 (03PS1) 10EoghanGaffney: mailman: Take ownership of lists hosts [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706)
[12:46:13] <wikibugs>	 (03PS2) 10EoghanGaffney: mailman: Change ownership of lists hosts to sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706)
[12:47:06] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.makevm: Default to Puppet 7 for new VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1024656
[12:47:16] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:48:20] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:48:38] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey)
[12:49:12] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey)
[12:52:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[12:54:36] <wikibugs>	 (03PS3) 10Awight: Revert temporary monitoring for scraper [puppet] - 10https://gerrit.wikimedia.org/r/1023152 (https://phabricator.wikimedia.org/T362904)
[12:55:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024447 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn)
[12:55:15] <wikibugs>	 (03CR) 10Awight: [C:03+1] "Job is complete, feel free to revert this config!" [puppet] - 10https://gerrit.wikimedia.org/r/1023152 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight)
[12:56:40] <wikibugs>	 (03PS1) 10Kormat: admin: (kormat) Switch to using set_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1024657
[12:56:58] <wikibugs>	 (03CR) 10LSobanski: [C:03+1] mailman: Change ownership of lists hosts to sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[12:58:28] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin: (kormat) Switch to using set_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1024657 (owner: 10Kormat)
[13:01:15] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] sre.ganeti.makevm: Default to Puppet 7 for new VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1024656 (owner: 10Muehlenhoff)
[13:04:44] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests, 13Patch-For-Review: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9747863 (10Dzahn) @Lina_Farid_WMDE Something still needs to be merged for that to work. It's in code review. It will be done soon though.
[13:05:32] <wikibugs>	 (03PS1) 10Elukey: kserve-inference: improve transparent proxy settings for revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024661 (https://phabricator.wikimedia.org/T353622)
[13:05:34] <wikibugs>	 (03PS1) 10Elukey: ml-services: set host for wikidata's isvc in revscoring staging pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024662 (https://phabricator.wikimedia.org/T353622)
[13:06:23] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576 (10cmooney) 03NEW p:05Triage→03Medium
[13:07:11] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9747884 (10cmooney)
[13:07:13] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9747885 (10cmooney)
[13:07:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722#9747887 (10cmooney)
[13:09:08] <wikibugs>	 (03PS1) 10Muehlenhoff: aptrepo: Add new repository component and repo sync config for Node 20 [puppet] - 10https://gerrit.wikimedia.org/r/1024663 (https://phabricator.wikimedia.org/T362681)
[13:09:21] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9747891 (10cmooney)
[13:12:43] <wikibugs>	 (03CR) 10Kormat: [C:03+2] admin: (kormat) Switch to using set_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1024657 (owner: 10Kormat)
[13:13:08] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] "Thanks! Let's merge!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci)
[13:13:56] <wikibugs>	 (03Merged) 10jenkins-bot: Create the MPIC Kubernetes chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci)
[13:14:26] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.hosts.decommission for hosts lists2001.codfw.wmnet
[13:17:10] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 42 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:19:53] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] kserve-inference: improve transparent proxy settings for revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024661 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey)
[13:21:10] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.dns.netbox
[13:21:46] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kserve-inference: improve transparent proxy settings for revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024661 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey)
[13:22:09] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ml-services: set host for wikidata's isvc in revscoring staging pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024662 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey)
[13:22:10] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:23:37] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lists2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1002"
[13:25:21] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[13:27:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend cloudbackup Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1024686
[13:27:58] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lists2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1002"
[13:27:59] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:27:59] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lists2001.codfw.wmnet
[13:28:05] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9747919 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by eoghan@cumin1002 for hosts: `lists2001.codfw.wmnet` - lists2001.codfw.wmnet (**PA...
[13:28:06] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: name=elastic110[3-7]\.eqiad\.wmnet
[13:30:31] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] magru: update edgeuno transit IP [homer/public] - 10https://gerrit.wikimedia.org/r/1024516 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi)
[13:31:20] <wikibugs>	 (03Merged) 10jenkins-bot: magru: update edgeuno transit IP [homer/public] - 10https://gerrit.wikimedia.org/r/1024516 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi)
[13:33:18] <wikibugs>	 10ops-codfw, 06SRE, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9747947 (10cmooney)
[13:35:32] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 92 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:36:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Add cephadm globbing to partition selection [puppet] - 10https://gerrit.wikimedia.org/r/1024689 (https://phabricator.wikimedia.org/T363559)
[13:37:28] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add cephadm globbing to partition selection [puppet] - 10https://gerrit.wikimedia.org/r/1024689 (https://phabricator.wikimedia.org/T363559) (owner: 10Muehlenhoff)
[13:44:48] <wikibugs>	 (03PS1) 10Muehlenhoff: netbox-standalone: Enable profile::auto_restarts::service for Redis [puppet] - 10https://gerrit.wikimedia.org/r/1024690 (https://phabricator.wikimedia.org/T135991)
[13:45:20] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephadm1001.eqiad.wmnet with OS bookworm
[13:45:30] <wikibugs>	 06SRE, 10vm-requests, 13Patch-For-Review: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9747965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bookworm
[13:45:32] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 84 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:47:42] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to plain
[13:48:14] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to plain
[13:50:55] <wikibugs>	 (03PS1) 10Elukey: role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647)
[13:52:32] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 102 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:52:58] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2147/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[13:56:34] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:57:28] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephadm1001.eqiad.wmnet with reason: host reimage
[14:00:36] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:00:53] <wikibugs>	 (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024649 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[14:01:22] <wikibugs>	 (03PS1) 10Muehlenhoff: kerberos: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1024694
[14:02:01] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9748017 (10Eevans) The first device is done rebuilding:  `lang=sh-session eevans@aqs1014:~$ sudo mdadm --detail /dev/md2 /dev/md2:            Version : 1.2      Creation Time : Tue Mar  9 14:18:06 2021...
[14:02:42] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.reboot-single for host aqs1014.eqiad.wmnet
[14:02:47] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephadm1001.eqiad.wmnet with reason: host reimage
[14:03:05] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9748019 (10ops-monitoring-bot) Host rebooted by eevans@cumin1002 with reason: None
[14:04:41] <wikibugs>	 (03PS1) 10Btullis: Use an LVM volume for /var/lib/ceph on cephosd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1024695 (https://phabricator.wikimedia.org/T324660)
[14:05:46] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024694 (owner: 10Muehlenhoff)
[14:08:52] <jinxer-wm>	 (ProbeDown) firing: (4) Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:10:37] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1014.eqiad.wmnet
[14:10:40] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on aqs1014 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T363580 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:10:45] <wikibugs>	 10ops-eqiad, 06SRE: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T363580 (10ops-monitoring-bot) 03NEW
[14:13:52] <jinxer-wm>	 (ProbeDown) resolved: (4) Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:15:52] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephadm1001.eqiad.wmnet with OS bookworm
[14:16:01] <wikibugs>	 06SRE, 10vm-requests, 13Patch-For-Review: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9748045 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bookworm completed:...
[14:16:24] <wikibugs>	 (03CR) 10Eevans: [C:03+1] role::cassandra_dev: clean-up after PKI TLS certs rollout [puppet] - 10https://gerrit.wikimedia.org/r/1024650 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[14:19:22] <wikibugs>	 (03CR) 10Elukey: [V:03+1 C:03+2] role::cassandra_dev: clean-up after PKI TLS certs rollout [puppet] - 10https://gerrit.wikimedia.org/r/1024650 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[14:19:25] <jinxer-wm>	 (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:21:51] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:22:55] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:30:31] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] kerberos: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1024694 (owner: 10Muehlenhoff)
[14:34:54] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Add overrides for Cassandra settings to restbase2021 [puppet] - 10https://gerrit.wikimedia.org/r/1024649 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[14:38:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#9748100 (10cmooney) There are a few elements here to consider:  ######Existing cloud-hosts private IPv6 ranges  The existing cloud-hosts vlans, in the WMF production realm, have IPs from the wider WM...
[14:38:52] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:56] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2021.codfw.wmnet: Move to PKI TLS certs - elukey@cumin1002
[14:40:41] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update revertrisk image to support all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203)
[14:41:23] <wikibugs>	 06SRE, 10vm-requests, 13Patch-For-Review: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9748105 (10BTullis) 05Open→03Resolved
[14:45:28] <wikibugs>	 (03PS2) 10Elukey: role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647)
[14:47:38] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9748109 (10eoghan) a:05jhathaway→03eoghan
[14:48:14] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2021.codfw.wmnet: Move to PKI TLS certs - elukey@cumin1002
[14:48:57] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.155:7000 on restbase2021 is CRITICAL: SSL CRITICAL - failed to verify restbase2021-c against restbase2021-c.codfw.wmnet, cassandra, restbase2021.codfw.wmnet:Certificate restbase2021-c.codfw.wmnet valid until 2024-05-24 14:32:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:48:57] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.16.154:7000 on restbase2021 is CRITICAL: SSL CRITICAL - failed to verify restbase2021-b against restbase2021-b.codfw.wmnet, cassandra, restbase2021.codfw.wmnet:Certificate restbase2021-b.codfw.wmnet valid until 2024-05-24 14:32:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:48:57] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.153:7000 on restbase2021 is CRITICAL: SSL CRITICAL - failed to verify restbase2021-a against restbase2021-a.codfw.wmnet, cassandra, restbase2021.codfw.wmnet:Certificate restbase2021-a.codfw.wmnet valid until 2024-05-24 14:32:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:49:11] <wikibugs>	 (03PS1) 10Btullis: Start switching cephosd servers to cephadm management [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558)
[14:49:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:54:08] <wikibugs>	 (03PS2) 10Btullis: Start switching cephosd servers to cephadm management [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558)
[14:55:15] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] wmnet: add CNAME records for commons-impact-analytics (k8s ingress) [dns] - 10https://gerrit.wikimedia.org/r/1023964 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[14:55:28] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] service: add commons-impact-analytics AQS 2.0 service [puppet] - 10https://gerrit.wikimedia.org/r/1023961 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French)
[14:56:57] <wikibugs>	 (03CR) 10AikoChou: ml-services: update revertrisk image to support all wikis (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) (owner: 10Ilias Sarantopoulos)
[14:57:34] <wikibugs>	 (03PS1) 10Btullis: Add docker engine to the ceph::cephadm role [puppet] - 10https://gerrit.wikimedia.org/r/1024703 (https://phabricator.wikimedia.org/T363558)
[14:58:53] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2149/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis)
[15:03:59] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:04:42] <wikibugs>	 (03CR) 10Elukey: "Hey Ben, I saw the code change passing by, got interested. If you don't mind me asking, was this use case discussed with Service Ops? IIRC" [puppet] - 10https://gerrit.wikimedia.org/r/1024703 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis)
[15:06:42] <wikibugs>	 (03PS3) 10EoghanGaffney: mailman: Change ownership of lists hosts to sre-collab and rename [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706)
[15:07:34] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubestagemaster2003.codfw.wmnet
[15:10:22] <wikibugs>	 (03CR) 10Btullis: [C:04-2] "Thanks elukey. No, I haven't discussed it yet. I'll do so now." [puppet] - 10https://gerrit.wikimedia.org/r/1024703 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis)
[15:11:01] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:04-2] "Setting to -2 while the use-case for docker is discussed." [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis)
[15:12:07] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.dns.netbox
[15:12:41] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:13:43] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: update revertrisk image to support all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203)
[15:14:04] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: update revertrisk image to support all wikis (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) (owner: 10Ilias Sarantopoulos)
[15:14:50] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=20:pooled=yes; selector: name=elastic1103\.eqiad\.wmnet
[15:16:11] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.3 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:16:27] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[15:16:31] <wikibugs>	 (03CR) 10LSobanski: [C:03+1] mailman: Change ownership of lists hosts to sre-collab and rename [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[15:17:13] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:17:48] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: update revertrisk image to support all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) (owner: 10Ilias Sarantopoulos)
[15:19:07] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[15:19:14] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host lists2001
[15:20:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:20:26] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lists2001
[15:21:02] <wikibugs>	 06SRE, 06serviceops: Container Image policy for non-k8s uses - https://phabricator.wikimedia.org/T357441#9748210 (10BTullis) I won't reopen this ticket, but I would like to draw your collective attention to this ticket, if I may: {T363558} The use-case is very similar to that discussed here, but the questi...
[15:22:51] <denisse>	 !log Downtiming the Prometheus PoP hosts as part of the cergen to CFSSL migration - T360414
[15:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:57] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[15:23:06] <stashbot>	 T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414
[15:24:36] <logmsgbot>	 !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on prometheus6002.drmrs.wmnet,prometheus5002.eqsin.wmnet,prometheus3003.esams.wmnet,prometheus4002.ulsfo.wmnet with reason: Downtiming the Prometheus PoP hosts part of the cergen to CFSSL migration - T360414
[15:24:46] <denisse>	 !log Disabling Puppet on the Prometheus PoP hosts as part of the cergen to CFSSL migration - T360414
[15:24:55] <logmsgbot>	 !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on prometheus6002.drmrs.wmnet,prometheus5002.eqsin.wmnet,prometheus3003.esams.wmnet,prometheus4002.ulsfo.wmnet with reason: Downtiming the Prometheus PoP hosts part of the cergen to CFSSL migration - T360414
[15:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:20] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update revertrisk image to support all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) (owner: 10Ilias Sarantopoulos)
[15:25:39] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002"
[15:25:39] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:25:40] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubestagemaster2003.codfw.wmnet
[15:25:51] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9748260 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kubestagemaster20...
[15:25:57] <wikibugs>	 (03PS1) 10JHathaway: postfix: mx-{in,out} test data [labs/private] - 10https://gerrit.wikimedia.org/r/1024726
[15:26:15] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update revertrisk image to support all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) (owner: 10Ilias Sarantopoulos)
[15:26:57] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1 C:03+2] prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[15:27:13] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1 C:03+2] prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[15:27:24] <wikibugs>	 (03PS2) 10JHathaway: postfix: mx-{in,out} test data [labs/private] - 10https://gerrit.wikimedia.org/r/1024726 (https://phabricator.wikimedia.org/T325398)
[15:28:41] <wikibugs>	 (03CR) 10Eevans: [C:03+1] "LGTM; Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) (owner: 10Btullis)
[15:28:47] <denisse>	 !log testing patch #1023917 on prometheus6002
[15:28:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:59] <denisse>	 !log testing patch #1023917 on prometheus6002 - T360414
[15:29:11] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:24] <stashbot>	 T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414
[15:31:41] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=20:pooled=yes; selector: name=elastic1104\.eqiad\.wmnet
[15:31:50] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] mailman: Change ownership of lists hosts to sre-collab and rename [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[15:33:39] <wikibugs>	 (03CR) 10Elukey: "I see that we already have https://phabricator.wikimedia.org/T357441 so it should be enough for the moment, please go ahead if you want :)" [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis)
[15:34:10] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 28 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:34:51] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.dns.netbox
[15:36:13] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:36:25] <wikibugs>	 (03PS1) 10JHathaway: postfix: add some recommended hardening settings [puppet] - 10https://gerrit.wikimedia.org/r/1024729 (https://phabricator.wikimedia.org/T325398)
[15:36:44] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: mx-{in,out} test data [labs/private] - 10https://gerrit.wikimedia.org/r/1024726 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway)
[15:36:50] <wikibugs>	 (03CR) 10JHathaway: [V:03+2 C:03+2] postfix: mx-{in,out} test data [labs/private] - 10https://gerrit.wikimedia.org/r/1024726 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway)
[15:41:48] <wikibugs>	 (03PS1) 10Hnowlan: Enable async upload-by-URL on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007)
[15:42:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable async upload-by-URL on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan)
[15:43:03] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=20:pooled=yes; selector: name=elastic1105\.eqiad\.wmnet
[15:43:21] <wikibugs>	 (03PS2) 10Hnowlan: Enable async upload-by-URL on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007)
[15:43:27] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "This is beta cluster, enable it everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan)
[15:44:03] <wikibugs>	 (03CR) 10Ladsgroup: "I'm not sure we even have testwiki in beta cluster, that's like the inception movie." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan)
[15:46:20] <denisse>	 !log Enabling Puppet on the Prometheus PoP hosts as part of the cergen to CFSSL migration - T360414
[15:46:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:44] <stashbot>	 T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414
[15:52:50] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "To test what? How broken beta cluster is? lol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan)
[15:53:32] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.hosts.provision for host lists2001.mgmt.codfw.wmnet with reboot policy FORCED
[15:55:35] <vgutierrez>	 !log depool ncredir6001
[15:55:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:11] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1103\.eqiad\.wmnet
[15:56:20] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1104\.eqiad\.wmnet
[15:56:33] <wikibugs>	 (03PS3) 10Elukey: role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647)
[15:56:34] <wikibugs>	 (03PS1) 10Elukey: role::restbase::production: move Cassandra codfw instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1024736 (https://phabricator.wikimedia.org/T352647)
[15:56:35] <wikibugs>	 (03PS1) 10Elukey: role::restbase::production: move eqiad Cassandra instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1024737 (https://phabricator.wikimedia.org/T352647)
[15:58:19] <wikibugs>	 (03PS1) 10Elukey: role::restbase::production: cleanup after PKI migration [puppet] - 10https://gerrit.wikimedia.org/r/1024738 (https://phabricator.wikimedia.org/T352647)
[15:59:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-out - https://phabricator.wikimedia.org/T325407#9748391 (10jhathaway)
[16:00:04] <wikibugs>	 (03PS1) 10JHathaway: postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398)
[16:00:06] <wikibugs>	 (03PS1) 10JHathaway: postfix: take mx_out boxes out of insetup [puppet] - 10https://gerrit.wikimedia.org/r/1024740 (https://phabricator.wikimedia.org/T325407)
[16:00:08] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-out - https://phabricator.wikimedia.org/T325407#9748389 (10jhathaway)
[16:00:08] <wikibugs>	 (03PS1) 10JHathaway: postfix: mx-out hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1024741 (https://phabricator.wikimedia.org/T325407)
[16:00:17] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1024736 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[16:00:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway)
[16:02:19] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1024737 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey)
[16:02:34] <wikibugs>	 (03PS2) 10JHathaway: postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398)
[16:02:35] <wikibugs>	 (03PS2) 10JHathaway: postfix: take mx_out boxes out of insetup [puppet] - 10https://gerrit.wikimedia.org/r/1024740 (https://phabricator.wikimedia.org/T325407)
[16:02:35] <wikibugs>	 (03PS2) 10JHathaway: postfix: mx-out hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1024741 (https://phabricator.wikimedia.org/T325407)
[16:02:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway)
[16:06:17] <vgutierrez>	 !log repool ncredir6001
[16:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:30] <wikibugs>	 (03PS2) 10Btullis: Use an LVM volume for /var/lib/ceph on cephosd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1024695 (https://phabricator.wikimedia.org/T324660)
[16:08:30] <wikibugs>	 (03PS3) 10Btullis: Start switching cephosd servers to cephadm management [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558)
[16:10:45] <wikibugs>	 (03PS1) 10Btullis: Update the DPE ceph cluster to reef [puppet] - 10https://gerrit.wikimedia.org/r/1024742 (https://phabricator.wikimedia.org/T362993)
[16:11:22] <wikibugs>	 (03PS2) 10Btullis: Update the DPE ceph cluster to reef [puppet] - 10https://gerrit.wikimedia.org/r/1024742 (https://phabricator.wikimedia.org/T362993)
[16:12:19] <wikibugs>	 (03PS1) 10Elukey: ml-services: fix wikidata host header for ml-staging's revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024743
[16:13:11] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2153/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024742 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis)
[16:13:20] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ml-services: fix wikidata host header for ml-staging's revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024743 (owner: 10Elukey)
[16:16:12] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Update the ownership of the aqs cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) (owner: 10Btullis)
[16:16:48] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[16:17:17] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[16:17:22] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:17:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[16:18:06] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[16:18:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[16:18:42] <wikibugs>	 (03PS3) 10JHathaway: postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398)
[16:18:42] <wikibugs>	 (03PS3) 10JHathaway: postfix: take mx_out boxes out of insetup [puppet] - 10https://gerrit.wikimedia.org/r/1024740 (https://phabricator.wikimedia.org/T325407)
[16:18:42] <wikibugs>	 (03PS3) 10JHathaway: postfix: mx-out hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1024741 (https://phabricator.wikimedia.org/T325407)
[16:18:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[16:20:06] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lists2001.mgmt.codfw.wmnet with reboot policy FORCED
[16:20:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[16:22:42] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.hosts.reimage for host lists2001.wikimedia.org with OS bookworm
[16:22:54] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9748515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1002 for host lists2001.wikimedia.org with OS bookworm
[16:23:28] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:27:00] <wikibugs>	 (03PS1) 10Andrea Denisse: ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1024712 (https://phabricator.wikimedia.org/T360414)
[16:30:09] <denisse>	 !log Delete the unused Prometheus PoP TLS certificates in the private repository as part of the cergen to CFSSL migration - T360414
[16:30:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:39] <stashbot>	 T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414
[16:30:56] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1024712 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[16:30:58] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+2 C:03+2] ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1024712 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[16:31:32] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:32:43] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1105\.eqiad\.wmnet
[16:32:59] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=20:pooled=yes; selector: name=elastic1106\.eqiad\.wmnet
[16:35:36] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1106\.eqiad\.wmnet
[16:35:48] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=20:pooled=yes; selector: name=elastic1107\.eqiad\.wmnet
[16:38:42] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:42:44] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:44:36] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9748606 (10LSobanski) p:05Triage→03High
[16:44:48] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:48:39] <wikibugs>	 (03PS1) 10Cory Massaro: Enable wasmedge resource limits in Wikifunctions production services. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024767
[16:53:29] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] admin: add Lina Farid to LDAP_only (nda) [puppet] - 10https://gerrit.wikimedia.org/r/1024449 (https://phabricator.wikimedia.org/T362959) (owner: 10BCornwall)
[16:57:52] <wikibugs>	 (03PS1) 10Btullis: Fix the cephosd server reimages [puppet] - 10https://gerrit.wikimedia.org/r/1024773 (https://phabricator.wikimedia.org/T362993)
[16:57:58] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1107\.eqiad\.wmnet
[16:58:21] <wikibugs>	 (03PS2) 10Btullis: Fix the cephosd server reimages [puppet] - 10https://gerrit.wikimedia.org/r/1024773 (https://phabricator.wikimedia.org/T362993)
[17:00:03] <wikibugs>	 (03PS1) 10Cathal Mooney: Add new vlan sub-interfaces to eqiad LVS for racks E5-7 and F5-7 [puppet] - 10https://gerrit.wikimedia.org/r/1024776 (https://phabricator.wikimedia.org/T334230)
[17:00:41] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2154/console" [puppet] - 10https://gerrit.wikimedia.org/r/1024773 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis)
[17:00:54] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests, 13Patch-For-Review: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9748700 (10BCornwall) 05In progress→03Resolved Thanks for your patience, @Lina_Farid_WMDE. We've merged the code now and you should be able to access...
[17:01:07] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9748716 (10BTullis)
[17:02:11] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[17:02:12] <wikibugs>	 (03PS2) 10Cathal Mooney: Add new vlan sub-interfaces to eqiad LVS for racks E5-7 and F5-7 [puppet] - 10https://gerrit.wikimedia.org/r/1024776 (https://phabricator.wikimedia.org/T334230)
[17:04:14] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1024625 (owner: 10Muehlenhoff)
[17:05:04] <wikibugs>	 (03CR) 10Bking: [C:03+1] Add new vlan sub-interfaces to eqiad LVS for racks E5-7 and F5-7 [puppet] - 10https://gerrit.wikimedia.org/r/1024776 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[17:05:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add new vlan sub-interfaces to eqiad LVS for racks E5-7 and F5-7 [puppet] - 10https://gerrit.wikimedia.org/r/1024776 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[17:14:54] <logmsgbot>	 !log eoghan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lists2001.wikimedia.org with OS bookworm
[17:15:00] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9748784 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1002 for host lists2001.wikimedia.org with OS bookworm executed wi...
[17:15:25] <wikibugs>	 (03PS1) 10Cathal Mooney: Add new vlan names to LVS balancer config for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1024782 (https://phabricator.wikimedia.org/T334230)
[17:16:01] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add new vlan names to LVS balancer config for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1024782 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney)
[17:27:37] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic110[3-7]\.eqiad\.wmnet
[17:44:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw1356:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1356 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:46:46] <wikibugs>	 (03CR) 10BCornwall: purged: add PKI cert handling (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins)
[17:47:03] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.80.0" for 325 hosts
[17:47:49] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.80.0" completed for 325 hosts
[17:48:27] <logmsgbot>	 !log dancy@deploy1002 Started scap: Testing T325530
[17:48:48] <stashbot>	 T325530: scap: hide helmfile operations behind a progress bar - https://phabricator.wikimedia.org/T325530
[17:49:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw1356:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1356 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:50:36] <wikibugs>	 06SRE, 06Traffic: Migrate purged away from cergen-issued certificate - https://phabricator.wikimedia.org/T360506#9748958 (10CDobbins) CR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019866  Description of changes:  * Add a feature flag `profile::cache::purged::use_pki` to control whether to use cfssl...
[17:51:56] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1355 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:56:16] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2021 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:57:42] <logmsgbot>	 !log dancy@deploy1002 Finished scap: Testing T325530 (duration: 09m 14s)
[17:57:59] <stashbot>	 T325530: scap: hide helmfile operations behind a progress bar - https://phabricator.wikimedia.org/T325530
[18:04:37] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] netbox-standalone: Enable profile::auto_restarts::service for Redis [puppet] - 10https://gerrit.wikimedia.org/r/1024690 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[18:08:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on mw1356:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1356 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:13:40] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on mw1356:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1356 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:21:56] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1355 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:23:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T352010)', diff saved to https://phabricator.wikimedia.org/P61256 and previous config saved to /var/cache/conftool/dbconfig/20240426-182320-ladsgroup.json
[18:23:49] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[18:26:16] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2021 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[18:38:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P61257 and previous config saved to /var/cache/conftool/dbconfig/20240426-183827-ladsgroup.json
[18:39:15] <wikibugs>	 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9749374 (10andrea.denisse)
[18:53:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P61258 and previous config saved to /var/cache/conftool/dbconfig/20240426-185335-ladsgroup.json
[19:02:18] <wikibugs>	 (03PS1) 10Jforrester: Fix for encoded characters in resource attribute [extensions/TimedMediaHandler] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1024714 (https://phabricator.wikimedia.org/T363550)
[19:03:52] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:08:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T352010)', diff saved to https://phabricator.wikimedia.org/P61259 and previous config saved to /var/cache/conftool/dbconfig/20240426-190842-ladsgroup.json
[19:08:45] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[19:08:59] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[19:09:00] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[19:09:03] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[19:09:10] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T352010)', diff saved to https://phabricator.wikimedia.org/P61260 and previous config saved to /var/cache/conftool/dbconfig/20240426-190909-ladsgroup.json
[19:09:10] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[19:09:32] <mutante>	 !log LDAP - added linafaridwmde to groups wmde and nda (T362959)
[19:10:54] <wikibugs>	 (03PS1) 10Herron: istio_slos: add secondary recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879)
[19:11:12] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9749633 (10Dzahn) 19:09 < mutante> !log LDAP - added linafaridwmde to groups wmde and nda (T362959)
[19:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:48] <stashbot>	 T362959: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959
[19:14:23] <wikibugs>	 (03CR) 10Herron: "its a bit crude but should be workable for the purpose of evaluating the updated rule metrics without history" [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron)
[19:14:30] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Enable wasmedge resource limits in evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024767 (owner: 10Cory Massaro)
[19:16:27] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[19:20:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:43:17] <wikibugs>	 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9749681 (10andrea.denisse)
[20:01:17] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013#9749712 (10QChris)
[20:41:59] <wikibugs>	 (03PS1) 10Eevans: cassandra: add (faux) password for cassandra-devel user [labs/private] - 10https://gerrit.wikimedia.org/r/1024805 (https://phabricator.wikimedia.org/T355730)
[20:43:47] <wikibugs>	 (03PS9) 10Eevans: cassandra-dev: surrogate user for cqlsh (dev access) [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730)
[20:44:19] <wikibugs>	 (03PS1) 10Andrea Denisse: wmnet: Add discovery entries for grafana and grafana-next [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T360414)
[20:44:41] <wikibugs>	 (03CR) 10Eevans: [V:03+2 C:03+2] cassandra: add (faux) password for cassandra-devel user [labs/private] - 10https://gerrit.wikimedia.org/r/1024805 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans)
[20:47:22] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans)
[20:54:32] <wikibugs>	 (03PS2) 10Andrea Denisse: wmnet: Add discovery entries for grafana and grafana-next [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T360414)
[20:56:17] <wikibugs>	 (03PS1) 10Andrea Denisse: trafficserver: Add discovery entries for grafana and grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T360414)
[20:58:18] <wikibugs>	 (03PS10) 10Eevans: cassandra-dev: surrogate user for cqlsh (dev access) [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730)
[20:59:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm, would be like other misc services without geoip/LVS do it" [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse)
[21:01:05] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host lists2001.wikimedia.org with OS bullseye
[21:01:11] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9749849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bullseye
[21:03:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:04:01] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans)
[21:13:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.17% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:14:08] <wikibugs>	 (03PS11) 10Eevans: cassandra-dev: surrogate user for cqlsh (dev access) [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730)
[21:18:17] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lists2001.wikimedia.org with reason: host reimage
[21:19:28] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9749896 (10Dzahn) The attempt with bookworm started by Eoghan was stuck at the partitioning step in the Debian installer with "No root file system is defined...
[21:20:40] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans)
[21:21:41] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists2001.wikimedia.org with reason: host reimage
[21:24:57] <logmsgbot>	 !log amastilovic@deploy1002 Started deploy [airflow-dags/analytics@33b39d9]: (no justification provided)
[21:25:26] <logmsgbot>	 !log amastilovic@deploy1002 Finished deploy [airflow-dags/analytics@33b39d9]: (no justification provided) (duration: 00m 28s)
[21:27:43] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra-dev: surrogate user for cqlsh (dev access) [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans)
[21:30:16] <wikibugs>	 (03CR) 10Dzahn: "lgtm, let's deploy next week though, not Friday, and CCing Clement because it's deployment_server" [puppet] - 10https://gerrit.wikimedia.org/r/1024479 (https://phabricator.wikimedia.org/T363519) (owner: 10Ahmon Dancy)
[21:30:44] <wikibugs>	 (03CR) 10Ahmon Dancy: "Works for me. Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1024479 (https://phabricator.wikimedia.org/T363519) (owner: 10Ahmon Dancy)
[21:37:30] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dzahn@cumin2002"
[21:37:35] <wikibugs>	 (03CR) 10Dzahn: "I might have some nitpicks but I am happy if it gets merged like that and we can just follow-up. I would like it though if we do add a des" [puppet] - 10https://gerrit.wikimedia.org/r/1024615 (owner: 10Muehlenhoff)
[21:38:43] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dzahn@cumin2002"
[21:38:45] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lists2001.wikimedia.org with OS bullseye
[21:38:57] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9749967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bullseye completed: -...
[21:39:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] miscweb: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024640 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[21:42:27] <wikibugs>	 (03CR) 10Dzahn: Automate quarterly Phabricator data for WMF QLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024348 (https://phabricator.wikimedia.org/T362804) (owner: 10Aklapper)
[21:43:13] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host lists2001.wikimedia.org with OS bookworm
[21:43:25] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9749978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bookworm
[21:51:15] <wikibugs>	 (03PS1) 10Eevans: cassandra-dev: ensure directory exists before adding files [puppet] - 10https://gerrit.wikimedia.org/r/1024811 (https://phabricator.wikimedia.org/T355730)
[21:53:08] <wikibugs>	 (03PS1) 10Dzahn: phabricator::logmail: parameterize sender adddress for stats mails [puppet] - 10https://gerrit.wikimedia.org/r/1024812
[21:53:35] <wikibugs>	 (03PS2) 10Dzahn: phabricator::logmail: parameterize sender adddress for stats mails [puppet] - 10https://gerrit.wikimedia.org/r/1024812
[21:54:36] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024811 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans)
[21:56:45] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra-dev: ensure directory exists before adding files [puppet] - 10https://gerrit.wikimedia.org/r/1024811 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans)
[22:04:27] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lists2001.wikimedia.org with reason: host reimage
[22:04:54] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962)
[22:07:46] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists2001.wikimedia.org with reason: host reimage
[22:18:29] <wikibugs>	 (03PS1) 10Ayounsi: magru: add momentum/novacore peer IPs/AS [homer/public] - 10https://gerrit.wikimedia.org/r/1024815 (https://phabricator.wikimedia.org/T362421)
[22:22:01] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9750051 (10Dzahn) Somehow it worked on the next attempt with bookworm as well. It must have been a fluke. Host is up now with bookworm, no config change to b...
[22:24:44] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lists2001.wikimedia.org with OS bookworm
[22:24:52] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9750055 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bookworm completed: -...
[22:25:26] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "Ok, yea, agreed, let's keep it at least until after the switch for now." [puppet] - 10https://gerrit.wikimedia.org/r/1020955 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[22:27:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T352010)', diff saved to https://phabricator.wikimedia.org/P61261 and previous config saved to /var/cache/conftool/dbconfig/20240426-222728-ladsgroup.json
[22:27:38] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "wow, thanks for all the detail!" [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn)
[22:27:54] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[22:42:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P61262 and previous config saved to /var/cache/conftool/dbconfig/20240426-224235-ladsgroup.json
[22:44:21] <wikibugs>	 (03CR) 10Dzahn: Automate quarterly Phabricator data for WMF QLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024348 (https://phabricator.wikimedia.org/T362804) (owner: 10Aklapper)
[22:44:51] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] magru: add momentum/novacore peer IPs/AS [homer/public] - 10https://gerrit.wikimedia.org/r/1024815 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi)
[22:50:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9750079 (10Dzahn) 05Open→03In progress p:05Triage→03Medium
[22:51:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9750081 (10Dzahn) a:03YLiou_WMF
[22:52:57] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9750084 (10Dzahn) a:05YLiou_WMF→03Miriam I think L3 might not be needed if this isn't shell access.  We do need the manager approval though please.
[22:55:15] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415#9750086 (10Dzahn) 05Open→03In progress a:03Dzahn
[22:57:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P61263 and previous config saved to /var/cache/conftool/dbconfig/20240426-225744-ladsgroup.json
[23:00:31] <wikibugs>	 (03Merged) 10jenkins-bot: magru: add momentum/novacore peer IPs/AS [homer/public] - 10https://gerrit.wikimedia.org/r/1024815 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi)
[23:03:52] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:12:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T352010)', diff saved to https://phabricator.wikimedia.org/P61264 and previous config saved to /var/cache/conftool/dbconfig/20240426-231252-ladsgroup.json
[23:12:55] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance
[23:13:09] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance
[23:13:17] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[23:13:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T352010)', diff saved to https://phabricator.wikimedia.org/P61265 and previous config saved to /var/cache/conftool/dbconfig/20240426-231316-ladsgroup.json
[23:16:27] <jinxer-wm>	 (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[23:20:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:35:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 800.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[23:38:35] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024747
[23:38:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024747 (owner: 10TrainBranchBot)
[23:45:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 827.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[23:58:12] <wikibugs>	 (03PS1) 10Eevans: cassandra_dev: rename surrogate user [puppet] - 10https://gerrit.wikimedia.org/r/1024820 (https://phabricator.wikimedia.org/T355730)
[23:59:18] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024747 (owner: 10TrainBranchBot)