[00:05:26] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-wikifunctions (k8s) 2.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:14:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-wikifunctions (k8s) 2.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:23:45] (WidespreadPuppetFailure) firing: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:30:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1023541 (owner: 10TrainBranchBot) [00:33:45] (WidespreadPuppetFailure) resolved: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:40:26] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:49:11] 10SRE-swift-storage, 06Commons, 06MediaWiki-Engineering, 10MediaWiki-File-management: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830#9746962 (10tstarling) 05Open→03Resolved [00:50:36] 10SRE-swift-storage, 06Commons, 06MediaWiki-Engineering, 10MediaWiki-File-management: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830#9746966 (10tstarling) [00:52:55] 10SRE-swift-storage, 06Commons, 06MediaWiki-Engineering, 10MediaWiki-File-management: Uploads fail due to 401 error from swift on wednesdays - https://phabricator.wikimedia.org/T358830#9746971 (10tstarling) [01:06:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T352010)', diff saved to https://phabricator.wikimedia.org/P61231 and previous config saved to /var/cache/conftool/dbconfig/20240426-010628-ladsgroup.json [01:06:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:07:30] (ProbeDown) firing: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:30] (ProbeDown) resolved: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:48] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:20:00] PROBLEM - Host cloudweb2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [01:21:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P61232 and previous config saved to /var/cache/conftool/dbconfig/20240426-012135-ladsgroup.json [01:21:46] RECOVERY - Host cloudweb2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [01:25:10] (03PS1) 10Andrew Bogott: Revert "labtesthorizon: advance to 2024-04-25-225100-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1024527 [01:26:15] (03CR) 10Andrew Bogott: [C:03+2] Revert "labtesthorizon: advance to 2024-04-25-225100-dev" [puppet] - 10https://gerrit.wikimedia.org/r/1024527 (owner: 10Andrew Bogott) [01:36:24] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:36:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P61233 and previous config saved to /var/cache/conftool/dbconfig/20240426-013642-ladsgroup.json [01:39:26] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:46:28] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:48:25] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-redis-exporter@6380.service on netbox2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:30] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:51:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T352010)', diff saved to https://phabricator.wikimedia.org/P61234 and previous config saved to /var/cache/conftool/dbconfig/20240426-015149-ladsgroup.json [01:51:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [01:52:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [01:52:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T352010)', diff saved to https://phabricator.wikimedia.org/P61235 and previous config saved to /var/cache/conftool/dbconfig/20240426-015212-ladsgroup.json [01:52:15] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:52:30] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:53:32] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:03:56] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 130 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:06:34] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:11:34] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:13:56] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 42 probes of 736 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:14:38] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:18:25] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:20:26] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:21:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:21:38] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:26:40] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:34:46] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:48:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:26] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:24:33] (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:00:48] (03CR) 10Ryan Kemper: [C:03+1] elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [04:21:28] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:23:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:23:30] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:25:26] (SystemdUnitFailed) firing: (4) docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:27:30] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:28:51] (03PS1) 10Andrew Bogott: codfw1dev networktests.yaml: whitespace fix [puppet] - 10https://gerrit.wikimedia.org/r/1024538 [04:29:40] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev networktests.yaml: whitespace fix [puppet] - 10https://gerrit.wikimedia.org/r/1024538 (owner: 10Andrew Bogott) [04:31:30] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:34:36] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:42:34] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:43:38] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:47:38] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:56:44] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:58:34] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 115 probes of 738 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:03:32] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 34 probes of 738 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:15:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:15:26] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:50] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:26] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:21:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:22:50] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:24:56] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:37:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T352010)', diff saved to https://phabricator.wikimedia.org/P61236 and previous config saved to /var/cache/conftool/dbconfig/20240426-053756-ladsgroup.json [05:38:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:38:25] (SystemdUnitFailed) resolved: (2) wmf_auto_restart_prometheus-redis-exporter@6380.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:53:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P61237 and previous config saved to /var/cache/conftool/dbconfig/20240426-055303-ladsgroup.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240426T0600) [06:03:11] (03PS1) 10Muehlenhoff: Only enable auto vopsbot restart on active alert host [puppet] - 10https://gerrit.wikimedia.org/r/1024539 [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:06:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024539 (owner: 10Muehlenhoff) [06:08:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P61238 and previous config saved to /var/cache/conftool/dbconfig/20240426-060810-ladsgroup.json [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:19:35] (03PS1) 10Muehlenhoff: Druid: overlord/coordinator: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024541 [06:20:02] (03CR) 10Muehlenhoff: [C:03+2] Only enable auto vopsbot restart on active alert host [puppet] - 10https://gerrit.wikimedia.org/r/1024539 (owner: 10Muehlenhoff) [06:21:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024541 (owner: 10Muehlenhoff) [06:23:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T352010)', diff saved to https://phabricator.wikimedia.org/P61239 and previous config saved to /var/cache/conftool/dbconfig/20240426-062317-ladsgroup.json [06:23:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [06:23:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [06:23:36] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:23:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T352010)', diff saved to https://phabricator.wikimedia.org/P61240 and previous config saved to /var/cache/conftool/dbconfig/20240426-062340-ladsgroup.json [06:24:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:25:26] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:28:51] (03CR) 10Gmodena: [C:03+2] eventstreams: 23 April 2024 update of ruwiki redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [06:29:47] (03Merged) 10jenkins-bot: eventstreams: 23 April 2024 update of ruwiki redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [06:42:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61241 and previous config saved to /var/cache/conftool/dbconfig/20240426-064220-arnaudb.json [06:48:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:57:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61242 and previous config saved to /var/cache/conftool/dbconfig/20240426-065726-arnaudb.json [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240426T0700) [07:01:12] !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [07:01:23] !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [07:01:59] !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [07:02:33] !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [07:05:27] !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [07:05:58] !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [07:08:16] !log Restarting CI Jenkins [07:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:32] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61243 and previous config saved to /var/cache/conftool/dbconfig/20240426-071231-arnaudb.json [07:13:12] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 213, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:14:34] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:18:47] !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2003.codfw.wmnet [07:18:49] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [07:19:21] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T363551 (10phaultfinder) 03NEW [07:20:26] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:04] (03PS1) 10JMeybohm: kubestagemaster2003: Add as insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1024543 (https://phabricator.wikimedia.org/T363310) [07:21:33] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [07:22:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:22:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024543 (https://phabricator.wikimedia.org/T363310) (owner: 10JMeybohm) [07:24:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:24:29] (03CR) 10JMeybohm: [C:03+2] kubestagemaster2003: Add as insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/1024543 (https://phabricator.wikimedia.org/T363310) (owner: 10JMeybohm) [07:24:33] (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:27:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61244 and previous config saved to /var/cache/conftool/dbconfig/20240426-072737-arnaudb.json [07:28:11] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [07:28:11] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:28:11] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster2003.codfw.wmnet on all recursors [07:28:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2003.codfw.wmnet on all recursors [07:28:42] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [07:29:34] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [07:30:11] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2003.codfw.wmnet with OS bullseye [07:30:29] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9747135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast... [07:42:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61245 and previous config saved to /var/cache/conftool/dbconfig/20240426-074243-arnaudb.json [07:45:31] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage [07:48:44] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage [07:57:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61246 and previous config saved to /var/cache/conftool/dbconfig/20240426-075748-arnaudb.json [08:11:46] (03PS1) 10Muehlenhoff: debdeploy-restarts: Discard lsof stderr output [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589 [08:12:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:15:26] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:23] !log depooled mw2391.codfw.wmnet for etcd benchmark [08:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2003.codfw.wmnet with OS bullseye [08:33:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster2003.codfw.wmnet [08:33:53] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9747162 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster20... [08:34:01] !log Restarted Gerrit replica [08:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:46] (03CR) 10Slyngshede: debdeploy-restarts: Discard lsof stderr output (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589 (owner: 10Muehlenhoff) [08:41:20] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:46] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [08:42:24] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:51:28] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:53:28] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:55:20] (03PS2) 10Muehlenhoff: debdeploy-restarts: Don't resolve user names in lsof [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589 [08:56:24] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:56:29] (03CR) 10Muehlenhoff: debdeploy-restarts: Don't resolve user names in lsof (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589 (owner: 10Muehlenhoff) [08:57:02] !log Restarted Gerrit [08:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:28] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:03:27] (03CR) 10Slyngshede: [C:03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589 (owner: 10Muehlenhoff) [09:04:41] (03CR) 10Muehlenhoff: elasticsearch: Configure alerts for short-lived certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [09:06:40] (03Restored) 10DCausse: cirrus: Shift autocomplete traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024478 (https://phabricator.wikimedia.org/T363516) (owner: 10Ebernhardson) [09:08:52] (03CR) 10Alexandros Kosiaris: [C:04-1] "Couple of minor fixes suggested, otherwise LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [09:13:05] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] debdeploy-restarts: Don't resolve user names in lsof [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024589 (owner: 10Muehlenhoff) [09:13:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:16:14] (03PS1) 10Muehlenhoff: Bump changelog for new debdeploy release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024607 [09:16:17] 06SRE, 10CirrusSearch, 03Discovery-Search (Current work), 13Patch-For-Review: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516#9747222 (10dcausse) p:05Triage→03Unbreak! This is still happening, raising to UBN [09:19:40] (03CR) 10Marco Fossati: [C:03+1] "Roger that, will do." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [09:20:26] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:01] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1021948 (owner: 10EoghanGaffney) [09:22:27] (03CR) 10Muehlenhoff: [C:03+2] Bump changelog for new debdeploy release [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/1024607 (owner: 10Muehlenhoff) [09:25:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:25:26] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:54] (03CR) 10DCausse: [C:03+1] cirrus: Shift autocomplete traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024478 (https://phabricator.wikimedia.org/T363516) (owner: 10Ebernhardson) [09:27:32] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:30:34] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:34:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024478 (https://phabricator.wikimedia.org/T363516) (owner: 10Ebernhardson) [09:36:02] (03Merged) 10jenkins-bot: cirrus: Shift autocomplete traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024478 (https://phabricator.wikimedia.org/T363516) (owner: 10Ebernhardson) [09:36:24] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] [09:36:56] T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516 [09:37:06] (03CR) 10Brouberol: Create the MPIC Kubernetes chart (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [09:40:15] !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to plain [09:40:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to plain [09:41:08] !log dcausse@deploy1002 dcausse and ebernhardson: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:41:30] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:41:57] T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516 [09:42:28] !log dcausse@deploy1002 dcausse and ebernhardson: Continuing with sync [09:42:34] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:47:32] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:47:50] !log repooled mw2391.codfw.wmnet [09:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:10] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1009 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:48:44] PROBLEM - Check whether ferm is active by checking the default input chain on mw1491 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:49:15] (03PS1) 10Brouberol: global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 [09:49:26] PROBLEM - Check whether ferm is active by checking the default input chain on mw1349 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:49:32] PROBLEM - Check whether ferm is active by checking the default input chain on mw1361 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:49:34] PROBLEM - Check whether ferm is active by checking the default input chain on mw1483 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:50:32] PROBLEM - Check whether ferm is active by checking the default input chain on mw1382 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:50:42] !log joal@deploy1002 Started deploy [airflow-dags/analytics@e57ae00]: Deploy of Analytics airflow dags for browser-metrics [airflow-dags/analytics@e57ae006] [09:50:49] (03PS2) 10Brouberol: global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T331894) [09:51:10] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@e57ae00]: Deploy of Analytics airflow dags for browser-metrics [airflow-dags/analytics@e57ae006] (duration: 00m 27s) [09:51:42] PROBLEM - Check whether ferm is active by checking the default input chain on mw1357 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:52:38] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:53:40] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:53:46] (03CR) 10CI reject: [V:04-1] global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:54:21] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] (duration: 17m 57s) [09:54:35] (03PS1) 10Btullis: Update the ownership of the aqs cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) [09:54:44] T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516 [09:54:53] 10ops-eqiad, 06SRE: ManagementSSHDown - https://phabricator.wikimedia.org/T363551#9747303 (10hashar) →14Duplicate dup:03T363086 [09:55:17] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9747295 (10hashar) 05Resolved→03Open scap does the docker pull on any of the k8s worker as defined by the `kubernetes-workers` group and parse1002 is n that group: ` deploy1002$ grep -R parse1002 /etc/ds... [09:55:27] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9747301 (10hashar) [09:56:03] (03PS2) 10Btullis: Update the ownership of the aqs cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) [09:56:26] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9747304 (10hashar) a:05akosiaris→03None Removing assignee that was automatically set by Phabricator when the task got marked as resolved. [09:56:31] (03PS3) 10Brouberol: global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) [09:57:38] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2141/console" [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) (owner: 10Btullis) [09:57:45] !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2003.codfw.wmnet [09:57:46] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [09:58:43] (03PS3) 10Btullis: Update the ownership of the aqs cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) [09:58:47] (03PS4) 10Brouberol: global_config: add analytics mariadb/postgresql instances [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) [09:59:44] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2140/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024610 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [09:59:59] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2142/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) (owner: 10Btullis) [10:00:26] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:39] !log jayme@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:02:50] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster2003.codfw.wmnet on all recursors [10:02:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2003.codfw.wmnet on all recursors [10:03:19] !log jayme@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host kubestagemaster2003.codfw.wmnet [10:03:50] 06SRE, 10CirrusSearch, 03Discovery-Search (Current work), 13Patch-For-Review: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516#9747331 (10dcausse) p:05Unbreak!→03Medium completion traffic is now served from codfw whic... [10:05:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:06:16] !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to plain [10:07:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to plain [10:08:06] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2003.codfw.wmnet with OS bullseye [10:13:00] (03PS1) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) [10:13:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:13:51] (03PS2) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) [10:14:47] (03PS3) 10Brouberol: global_config: add elasticearch instances [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) [10:15:26] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:08] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2143/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [10:18:10] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:18:44] RECOVERY - Check whether ferm is active by checking the default input chain on mw1491 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:19:26] RECOVERY - Check whether ferm is active by checking the default input chain on mw1349 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:19:32] RECOVERY - Check whether ferm is active by checking the default input chain on mw1361 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:19:34] RECOVERY - Check whether ferm is active by checking the default input chain on mw1483 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:20:32] RECOVERY - Check whether ferm is active by checking the default input chain on mw1382 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:21:05] (03PS1) 10Joal: Absent all report-updater jobs [puppet] - 10https://gerrit.wikimedia.org/r/1024614 (https://phabricator.wikimedia.org/T307540) [10:21:42] RECOVERY - Check whether ferm is active by checking the default input chain on mw1357 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:22:39] (03PS1) 10Muehlenhoff: Deprecate system::role for Collaboration services (batch two) [puppet] - 10https://gerrit.wikimedia.org/r/1024615 [10:26:31] (03CR) 10Brouberol: [C:03+1] druid::broker: Switch to firewall::service for test_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1024403 (owner: 10Muehlenhoff) [10:26:48] (03CR) 10Brouberol: [C:03+1] druid::broker: Switch public workers to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024409 (owner: 10Muehlenhoff) [10:27:05] (03CR) 10Brouberol: [C:03+1] druid::broker: Switch analytics workers to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024410 (owner: 10Muehlenhoff) [10:27:52] (03CR) 10Brouberol: [C:03+1] Druid: overlord/coordinator: New options for using firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024541 (owner: 10Muehlenhoff) [10:39:17] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:43:19] 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559 (10BTullis) 03NEW p:05Triage→03High [10:43:32] 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9747453 (10BTullis) [10:46:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [10:46:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [10:47:09] 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9747458 (10MoritzMuehlenhoff) Looks good. Best to create it in group D [10:48:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:54:16] !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kubestagemaster2003.codfw.wmnet with OS bullseye [10:55:51] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubestagemaster2003.codfw.wmnet [10:59:42] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240426T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: Time to do the GitLab version upgrades deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240426T1100). [11:00:56] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2144/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024614 (https://phabricator.wikimedia.org/T307540) (owner: 10Joal) [11:01:28] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests, 13Patch-For-Review: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9747504 (10Lina_Farid_WMDE) Thank you all ! I cannot access superset. yet. Do I need additional permissions for that? When I trey to login into https://idp.w... [11:01:56] (03CR) 10Btullis: [V:03+1 C:03+2] Absent all report-updater jobs [puppet] - 10https://gerrit.wikimedia.org/r/1024614 (https://phabricator.wikimedia.org/T307540) (owner: 10Joal) [11:02:12] (03PS1) 10Muehlenhoff: Extend cloudnet-codfw1dev Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1024620 [11:02:25] (03CR) 10Btullis: [V:03+1 C:03+2] "This looks good. W will be deploying this today to avoid the job execution on Sunday." [puppet] - 10https://gerrit.wikimedia.org/r/1024614 (https://phabricator.wikimedia.org/T307540) (owner: 10Joal) [11:04:21] RECOVERY - BGP status on cr2-drmrs is OK: BGP OK - up: 110, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:05:53] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3498 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:06:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [11:06:53] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 104916 bytes in 0.606 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [11:07:00] (ProbeDown) firing: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:09:14] (03PS1) 10Muehlenhoff: Harmonise analytics Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1024625 [11:10:07] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [11:10:41] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [11:10:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [11:10:58] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:10:59] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubestagemaster2003.codfw.wmnet [11:11:12] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9747539 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kubestagemaster20... [11:13:07] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [11:13:20] !log jayme@cumin1002 START - Cookbook sre.ganeti.makevm for new host kubestagemaster2003.codfw.wmnet [11:13:21] ^ expected due to host restart [11:13:22] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [11:14:23] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [11:15:49] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [11:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:16:38] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [11:16:38] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:16:38] !log jayme@cumin1002 START - Cookbook sre.dns.wipe-cache kubestagemaster2003.codfw.wmnet on all recursors [11:16:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) kubestagemaster2003.codfw.wmnet on all recursors [11:17:00] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [11:17:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [11:17:52] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM kubestagemaster2003.codfw.wmnet - jayme@cumin1002" [11:18:17] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kubestagemaster2003.codfw.wmnet with OS bullseye [11:18:26] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9747564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemast... [11:18:56] (ProbeDown) resolved: (2) Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:19:44] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T363566 (10phaultfinder) 03NEW [11:20:26] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:43] (03PS1) 10Muehlenhoff: arclamp: Enable profile::auto_restarts::service for Redis [puppet] - 10https://gerrit.wikimedia.org/r/1024630 (https://phabricator.wikimedia.org/T135991) [11:24:33] (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:26:35] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:26:53] RECOVERY - Host parse1002 is UP: PING WARNING - Packet loss = 60%, RTA = 30.27 ms [11:27:37] PROBLEM - SSH on parse1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:27:43] (03PS1) 10Muehlenhoff: Extend access by two weeks [puppet] - 10https://gerrit.wikimedia.org/r/1024634 [11:28:31] !log Deactivating puppet for parse1002 - T363086 [11:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:50] T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086 [11:28:55] (03PS1) 10Btullis: Add a basic role for ceph:cephsdm [puppet] - 10https://gerrit.wikimedia.org/r/1024635 (https://phabricator.wikimedia.org/T363558) [11:29:14] !log Forcing puppet run on deploy server - T363086 [11:29:15] (03CR) 10Muehlenhoff: [C:03+2] Extend access by two weeks [puppet] - 10https://gerrit.wikimedia.org/r/1024634 (owner: 10Muehlenhoff) [11:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:40] (03PS2) 10Btullis: Add a basic role for ceph:cephadm [puppet] - 10https://gerrit.wikimedia.org/r/1024635 (https://phabricator.wikimedia.org/T363558) [11:29:41] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:29:43] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/1024421 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [11:32:25] (03CR) 10Btullis: [C:03+2] Add a basic role for ceph:cephadm [puppet] - 10https://gerrit.wikimedia.org/r/1024635 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis) [11:32:30] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage [11:33:17] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:33:27] !log Forcing puppet run on O:alerting_host - T363086 [11:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:50] T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086 [11:34:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T352010)', diff saved to https://phabricator.wikimedia.org/P61247 and previous config saved to /var/cache/conftool/dbconfig/20240426-113416-ladsgroup.json [11:34:34] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:35:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2003.codfw.wmnet with reason: host reimage [11:35:56] !log btullis@cumin1002 START - Cookbook sre.ganeti.makevm for new host cephadm1001.eqiad.wmnet [11:35:57] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [11:39:07] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cephadm1001.eqiad.wmnet - btullis@cumin1002" [11:42:09] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cephadm1001.eqiad.wmnet - btullis@cumin1002" [11:42:09] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:42:09] !log btullis@cumin1002 START - Cookbook sre.dns.wipe-cache cephadm1001.eqiad.wmnet on all recursors [11:42:13] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cephadm1001.eqiad.wmnet on all recursors [11:42:37] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cephadm1001.eqiad.wmnet - btullis@cumin1002" [11:43:29] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cephadm1001.eqiad.wmnet - btullis@cumin1002" [11:43:46] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephadm1001.eqiad.wmnet with OS bookworm [11:43:57] 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9747640 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bookworm [11:45:00] (03PS5) 10Santiago Faci: Create the MPIC Kubernetes chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) [11:46:57] (03CR) 10Santiago Faci: "All changes have been made!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [11:48:26] (03PS1) 10Muehlenhoff: miscweb: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024640 (https://phabricator.wikimedia.org/T135991) [11:49:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P61248 and previous config saved to /var/cache/conftool/dbconfig/20240426-114923-ladsgroup.json [11:50:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster2003.codfw.wmnet with OS bullseye [11:50:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host kubestagemaster2003.codfw.wmnet [11:50:31] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9747660 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster20... [11:53:41] !log uploaded debdeploy 0.0.99.14 to apt.wikimedia.org (for buster/bullseye/bookworm) [11:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:46] !log Silencing all alerts matching parse1002.* for 4 days - T363086 [11:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:14] T363086: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086 [11:54:58] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9747667 (10Clement_Goubert) Downtimed, silence ID e5915daa-08f1-45f6-b805-fee5078d64da [12:01:17] (03CR) 10Brouberol: [C:03+1] "Looks good! Let's implement the helmfile containing both the mpic-next and mpic releases, and let's test this all out!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [12:02:21] (03PS1) 10Muehlenhoff: cloudweb: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1024647 (https://phabricator.wikimedia.org/T135991) [12:04:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P61249 and previous config saved to /var/cache/conftool/dbconfig/20240426-120431-ladsgroup.json [12:19:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T352010)', diff saved to https://phabricator.wikimedia.org/P61250 and previous config saved to /var/cache/conftool/dbconfig/20240426-121939-ladsgroup.json [12:19:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [12:19:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [12:19:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T352010)', diff saved to https://phabricator.wikimedia.org/P61251 and previous config saved to /var/cache/conftool/dbconfig/20240426-121951-ladsgroup.json [12:20:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:21:15] (03PS1) 10Elukey: Add overrides for Cassandra settings to restbase2021 [puppet] - 10https://gerrit.wikimedia.org/r/1024649 (https://phabricator.wikimedia.org/T352647) [12:22:51] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2145/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024649 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [12:23:46] (03PS2) 10Elukey: Add overrides for Cassandra settings to restbase2021 [puppet] - 10https://gerrit.wikimedia.org/r/1024649 (https://phabricator.wikimedia.org/T352647) [12:23:46] (03PS1) 10Elukey: role::cassandra_dev: clean-up after PKI TLS certs rollout [puppet] - 10https://gerrit.wikimedia.org/r/1024650 (https://phabricator.wikimedia.org/T352647) [12:25:09] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2146/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024650 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [12:25:53] (03CR) 10Elukey: [C:03+1] Disable boostrap mode on all k8s etcd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1024395 (owner: 10JMeybohm) [12:26:42] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cephadm1001.eqiad.wmnet with OS bookworm [12:26:43] !log btullis@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host cephadm1001.eqiad.wmnet [12:26:48] 06SRE, 10vm-requests: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9747726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bookworm executed with errors: - cephadm10... [12:27:04] (03PS1) 10Muehlenhoff: Deprecate system::role for Traffic services [puppet] - 10https://gerrit.wikimedia.org/r/1024651 [12:28:32] (03CR) 10Elukey: [C:03+2] admin_ng: move Istio configs to mw-api-int-ro for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021490 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [12:28:44] (03CR) 10Elukey: [C:03+2] kserve-inference: allow transparent proxy mode for revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021981 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [12:29:18] (03CR) 10Alexandros Kosiaris: [C:04-1] "-1ing to signal that this isn't ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/1024400 (https://phabricator.wikimedia.org/T363399) (owner: 10Alexandros Kosiaris) [12:33:12] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:33:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:33:43] (03PS2) 10Elukey: ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622) [12:34:15] (03PS3) 10Elukey: ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622) [12:34:31] (03PS4) 10Elukey: ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622) [12:34:41] (03PS5) 10Elukey: ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622) [12:37:49] (03PS1) 10Elukey: admin_ng: fix MW API's Service Entry for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024653 [12:41:09] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572 (10eoghan) 03NEW [12:41:33] (03CR) 10Elukey: [C:03+2] admin_ng: fix MW API's Service Entry for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024653 (owner: 10Elukey) [12:44:29] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:44:51] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:45:42] (03PS1) 10EoghanGaffney: mailman: Take ownership of lists hosts [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706) [12:46:13] (03PS2) 10EoghanGaffney: mailman: Change ownership of lists hosts to sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706) [12:47:06] (03PS1) 10Muehlenhoff: sre.ganeti.makevm: Default to Puppet 7 for new VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1024656 [12:47:16] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:48:20] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:48:38] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [12:49:12] (03CR) 10Elukey: [C:03+2] ml-services: remove WIKI_URL from revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023475 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [12:52:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:54:36] (03PS3) 10Awight: Revert temporary monitoring for scraper [puppet] - 10https://gerrit.wikimedia.org/r/1023152 (https://phabricator.wikimedia.org/T362904) [12:55:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024447 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [12:55:15] (03CR) 10Awight: [C:03+1] "Job is complete, feel free to revert this config!" [puppet] - 10https://gerrit.wikimedia.org/r/1023152 (https://phabricator.wikimedia.org/T362904) (owner: 10Awight) [12:56:40] (03PS1) 10Kormat: admin: (kormat) Switch to using set_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1024657 [12:56:58] (03CR) 10LSobanski: [C:03+1] mailman: Change ownership of lists hosts to sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [12:58:28] (03CR) 10Klausman: [C:03+1] admin: (kormat) Switch to using set_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1024657 (owner: 10Kormat) [13:01:15] (03CR) 10JHathaway: [C:03+1] sre.ganeti.makevm: Default to Puppet 7 for new VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1024656 (owner: 10Muehlenhoff) [13:04:44] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests, 13Patch-For-Review: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9747863 (10Dzahn) @Lina_Farid_WMDE Something still needs to be merged for that to work. It's in code review. It will be done soon though. [13:05:32] (03PS1) 10Elukey: kserve-inference: improve transparent proxy settings for revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024661 (https://phabricator.wikimedia.org/T353622) [13:05:34] (03PS1) 10Elukey: ml-services: set host for wikidata's isvc in revscoring staging pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024662 (https://phabricator.wikimedia.org/T353622) [13:06:23] 10ops-codfw, 06SRE, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576 (10cmooney) 03NEW p:05Triage→03Medium [13:07:11] 10ops-codfw, 06SRE, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9747884 (10cmooney) [13:07:13] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 2 others: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9747885 (10cmooney) [13:07:14] 06SRE, 06Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722#9747887 (10cmooney) [13:09:08] (03PS1) 10Muehlenhoff: aptrepo: Add new repository component and repo sync config for Node 20 [puppet] - 10https://gerrit.wikimedia.org/r/1024663 (https://phabricator.wikimedia.org/T362681) [13:09:21] 10ops-codfw, 06SRE, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9747891 (10cmooney) [13:12:43] (03CR) 10Kormat: [C:03+2] admin: (kormat) Switch to using set_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1024657 (owner: 10Kormat) [13:13:08] (03CR) 10Santiago Faci: [C:03+2] "Thanks! Let's merge!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [13:13:56] (03Merged) 10jenkins-bot: Create the MPIC Kubernetes chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021494 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [13:14:26] !log eoghan@cumin1002 START - Cookbook sre.hosts.decommission for hosts lists2001.codfw.wmnet [13:17:10] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 42 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:19:53] (03CR) 10Ilias Sarantopoulos: [C:03+1] kserve-inference: improve transparent proxy settings for revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024661 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [13:21:10] !log eoghan@cumin1002 START - Cookbook sre.dns.netbox [13:21:46] (03CR) 10Elukey: [C:03+2] kserve-inference: improve transparent proxy settings for revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024661 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [13:22:09] (03CR) 10Elukey: [C:03+2] ml-services: set host for wikidata's isvc in revscoring staging pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024662 (https://phabricator.wikimedia.org/T353622) (owner: 10Elukey) [13:22:10] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:23:37] !log eoghan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lists2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1002" [13:25:21] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:27:08] (03PS1) 10Muehlenhoff: Extend cloudbackup Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1024686 [13:27:58] !log eoghan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lists2001.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eoghan@cumin1002" [13:27:59] !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:27:59] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lists2001.codfw.wmnet [13:28:05] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9747919 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by eoghan@cumin1002 for hosts: `lists2001.codfw.wmnet` - lists2001.codfw.wmnet (**PA... [13:28:06] !log akosiaris@cumin1002 conftool action : set/pooled=no; selector: name=elastic110[3-7]\.eqiad\.wmnet [13:30:31] (03CR) 10Ayounsi: [C:03+2] magru: update edgeuno transit IP [homer/public] - 10https://gerrit.wikimedia.org/r/1024516 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [13:31:20] (03Merged) 10jenkins-bot: magru: update edgeuno transit IP [homer/public] - 10https://gerrit.wikimedia.org/r/1024516 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [13:33:18] 10ops-codfw, 06SRE, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#9747947 (10cmooney) [13:35:32] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 92 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:36:01] (03PS1) 10Muehlenhoff: Add cephadm globbing to partition selection [puppet] - 10https://gerrit.wikimedia.org/r/1024689 (https://phabricator.wikimedia.org/T363559) [13:37:28] (03CR) 10Btullis: [C:03+2] Add cephadm globbing to partition selection [puppet] - 10https://gerrit.wikimedia.org/r/1024689 (https://phabricator.wikimedia.org/T363559) (owner: 10Muehlenhoff) [13:44:48] (03PS1) 10Muehlenhoff: netbox-standalone: Enable profile::auto_restarts::service for Redis [puppet] - 10https://gerrit.wikimedia.org/r/1024690 (https://phabricator.wikimedia.org/T135991) [13:45:20] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephadm1001.eqiad.wmnet with OS bookworm [13:45:30] 06SRE, 10vm-requests, 13Patch-For-Review: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9747965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bookworm [13:45:32] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 84 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:47:42] !log jayme@cumin1002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster2003.codfw.wmnet to plain [13:48:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster2003.codfw.wmnet to plain [13:50:55] (03PS1) 10Elukey: role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647) [13:52:32] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 102 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:52:58] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2147/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:56:34] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:57:28] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephadm1001.eqiad.wmnet with reason: host reimage [14:00:36] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:00:53] (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024649 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [14:01:22] (03PS1) 10Muehlenhoff: kerberos: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1024694 [14:02:01] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9748017 (10Eevans) The first device is done rebuilding: `lang=sh-session eevans@aqs1014:~$ sudo mdadm --detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Tue Mar 9 14:18:06 2021... [14:02:42] !log eevans@cumin1002 START - Cookbook sre.hosts.reboot-single for host aqs1014.eqiad.wmnet [14:02:47] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephadm1001.eqiad.wmnet with reason: host reimage [14:03:05] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9748019 (10ops-monitoring-bot) Host rebooted by eevans@cumin1002 with reason: None [14:04:41] (03PS1) 10Btullis: Use an LVM volume for /var/lib/ceph on cephosd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1024695 (https://phabricator.wikimedia.org/T324660) [14:05:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024694 (owner: 10Muehlenhoff) [14:08:52] (ProbeDown) firing: (4) Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:37] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1014.eqiad.wmnet [14:10:40] ACKNOWLEDGEMENT - MD RAID on aqs1014 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T363580 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:10:45] 10ops-eqiad, 06SRE: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T363580 (10ops-monitoring-bot) 03NEW [14:13:52] (ProbeDown) resolved: (4) Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:52] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephadm1001.eqiad.wmnet with OS bookworm [14:16:01] 06SRE, 10vm-requests, 13Patch-For-Review: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9748045 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host cephadm1001.eqiad.wmnet with OS bookworm completed:... [14:16:24] (03CR) 10Eevans: [C:03+1] role::cassandra_dev: clean-up after PKI TLS certs rollout [puppet] - 10https://gerrit.wikimedia.org/r/1024650 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [14:19:22] (03CR) 10Elukey: [V:03+1 C:03+2] role::cassandra_dev: clean-up after PKI TLS certs rollout [puppet] - 10https://gerrit.wikimedia.org/r/1024650 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [14:19:25] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:51] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:22:55] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:30:31] (03CR) 10JHathaway: [C:03+1] kerberos: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1024694 (owner: 10Muehlenhoff) [14:34:54] (03CR) 10Elukey: [C:03+2] Add overrides for Cassandra settings to restbase2021 [puppet] - 10https://gerrit.wikimedia.org/r/1024649 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [14:38:44] 06SRE, 06Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#9748100 (10cmooney) There are a few elements here to consider: ######Existing cloud-hosts private IPv6 ranges The existing cloud-hosts vlans, in the WMF production realm, have IPs from the wider WM... [14:38:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:56] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase2021.codfw.wmnet: Move to PKI TLS certs - elukey@cumin1002 [14:40:41] (03PS1) 10Ilias Sarantopoulos: ml-services: update revertrisk image to support all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) [14:41:23] 06SRE, 10vm-requests, 13Patch-For-Review: eqiad: 1 VMs requested for ceph cluster administration (cephadm) - https://phabricator.wikimedia.org/T363559#9748105 (10BTullis) 05Open→03Resolved [14:45:28] (03PS2) 10Elukey: role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647) [14:47:38] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9748109 (10eoghan) a:05jhathaway→03eoghan [14:48:14] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase2021.codfw.wmnet: Move to PKI TLS certs - elukey@cumin1002 [14:48:57] PROBLEM - cassandra-c SSL 10.192.16.155:7000 on restbase2021 is CRITICAL: SSL CRITICAL - failed to verify restbase2021-c against restbase2021-c.codfw.wmnet, cassandra, restbase2021.codfw.wmnet:Certificate restbase2021-c.codfw.wmnet valid until 2024-05-24 14:32:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:48:57] PROBLEM - cassandra-b SSL 10.192.16.154:7000 on restbase2021 is CRITICAL: SSL CRITICAL - failed to verify restbase2021-b against restbase2021-b.codfw.wmnet, cassandra, restbase2021.codfw.wmnet:Certificate restbase2021-b.codfw.wmnet valid until 2024-05-24 14:32:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:48:57] PROBLEM - cassandra-a SSL 10.192.16.153:7000 on restbase2021 is CRITICAL: SSL CRITICAL - failed to verify restbase2021-a against restbase2021-a.codfw.wmnet, cassandra, restbase2021.codfw.wmnet:Certificate restbase2021-a.codfw.wmnet valid until 2024-05-24 14:32:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:49:11] (03PS1) 10Btullis: Start switching cephosd servers to cephadm management [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558) [14:49:25] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:08] (03PS2) 10Btullis: Start switching cephosd servers to cephadm management [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558) [14:55:15] (03CR) 10Hnowlan: [C:03+1] wmnet: add CNAME records for commons-impact-analytics (k8s ingress) [dns] - 10https://gerrit.wikimedia.org/r/1023964 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [14:55:28] (03CR) 10Hnowlan: [C:03+1] service: add commons-impact-analytics AQS 2.0 service [puppet] - 10https://gerrit.wikimedia.org/r/1023961 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [14:56:57] (03CR) 10AikoChou: ml-services: update revertrisk image to support all wikis (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) (owner: 10Ilias Sarantopoulos) [14:57:34] (03PS1) 10Btullis: Add docker engine to the ceph::cephadm role [puppet] - 10https://gerrit.wikimedia.org/r/1024703 (https://phabricator.wikimedia.org/T363558) [14:58:53] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2149/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis) [15:03:59] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:42] (03CR) 10Elukey: "Hey Ben, I saw the code change passing by, got interested. If you don't mind me asking, was this use case discussed with Service Ops? IIRC" [puppet] - 10https://gerrit.wikimedia.org/r/1024703 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis) [15:06:42] (03PS3) 10EoghanGaffney: mailman: Change ownership of lists hosts to sre-collab and rename [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706) [15:07:34] !log jayme@cumin1002 START - Cookbook sre.hosts.decommission for hosts kubestagemaster2003.codfw.wmnet [15:10:22] (03CR) 10Btullis: [C:04-2] "Thanks elukey. No, I haven't discussed it yet. I'll do so now." [puppet] - 10https://gerrit.wikimedia.org/r/1024703 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis) [15:11:01] (03CR) 10Btullis: [V:03+1 C:04-2] "Setting to -2 while the use-case for docker is discussed." [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis) [15:12:07] !log jayme@cumin1002 START - Cookbook sre.dns.netbox [15:12:41] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:13:43] (03PS2) 10Ilias Sarantopoulos: ml-services: update revertrisk image to support all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) [15:14:04] (03CR) 10Ilias Sarantopoulos: ml-services: update revertrisk image to support all wikis (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) (owner: 10Ilias Sarantopoulos) [15:14:50] !log bking@cumin2002 conftool action : set/weight=20:pooled=yes; selector: name=elastic1103\.eqiad\.wmnet [15:16:11] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.3 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:16:31] (03CR) 10LSobanski: [C:03+1] mailman: Change ownership of lists hosts to sre-collab and rename [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:17:13] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:17:48] (03CR) 10AikoChou: [C:03+1] ml-services: update revertrisk image to support all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) (owner: 10Ilias Sarantopoulos) [15:19:07] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:19:14] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host lists2001 [15:20:26] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:26] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lists2001 [15:21:02] 06SRE, 06serviceops: Container Image policy for non-k8s uses - https://phabricator.wikimedia.org/T357441#9748210 (10BTullis) I won't reopen this ticket, but I would like to draw your collective attention to this ticket, if I may: {T363558} The use-case is very similar to that discussed here, but the questi... [15:22:51] !log Downtiming the Prometheus PoP hosts as part of the cergen to CFSSL migration - T360414 [15:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:57] !log jayme@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [15:23:06] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [15:24:36] !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on prometheus6002.drmrs.wmnet,prometheus5002.eqsin.wmnet,prometheus3003.esams.wmnet,prometheus4002.ulsfo.wmnet with reason: Downtiming the Prometheus PoP hosts part of the cergen to CFSSL migration - T360414 [15:24:46] !log Disabling Puppet on the Prometheus PoP hosts as part of the cergen to CFSSL migration - T360414 [15:24:55] !log denisse@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on prometheus6002.drmrs.wmnet,prometheus5002.eqsin.wmnet,prometheus3003.esams.wmnet,prometheus4002.ulsfo.wmnet with reason: Downtiming the Prometheus PoP hosts part of the cergen to CFSSL migration - T360414 [15:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:20] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update revertrisk image to support all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) (owner: 10Ilias Sarantopoulos) [15:25:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubestagemaster2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jayme@cumin1002" [15:25:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts kubestagemaster2003.codfw.wmnet [15:25:51] 06SRE, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06serviceops, and 3 others: Site: codfw 1 VM request for staging-codfw kube-apiserver - https://phabricator.wikimedia.org/T363310#9748260 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: `kubestagemaster20... [15:25:57] (03PS1) 10JHathaway: postfix: mx-{in,out} test data [labs/private] - 10https://gerrit.wikimedia.org/r/1024726 [15:26:15] (03Merged) 10jenkins-bot: ml-services: update revertrisk image to support all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024699 (https://phabricator.wikimedia.org/T363203) (owner: 10Ilias Sarantopoulos) [15:26:57] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [15:27:13] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] prometheus: Ensure TLS certificates are provided by CFSSL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [15:27:24] (03PS2) 10JHathaway: postfix: mx-{in,out} test data [labs/private] - 10https://gerrit.wikimedia.org/r/1024726 (https://phabricator.wikimedia.org/T325398) [15:28:41] (03CR) 10Eevans: [C:03+1] "LGTM; Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) (owner: 10Btullis) [15:28:47] !log testing patch #1023917 on prometheus6002 [15:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:59] !log testing patch #1023917 on prometheus6002 - T360414 [15:29:11] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:24] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [15:31:41] !log bking@cumin2002 conftool action : set/weight=20:pooled=yes; selector: name=elastic1104\.eqiad\.wmnet [15:31:50] (03CR) 10EoghanGaffney: [C:03+2] mailman: Change ownership of lists hosts to sre-collab and rename [puppet] - 10https://gerrit.wikimedia.org/r/1024655 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:33:39] (03CR) 10Elukey: "I see that we already have https://phabricator.wikimedia.org/T357441 so it should be enough for the moment, please go ahead if you want :)" [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis) [15:34:10] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 28 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:34:51] !log eoghan@cumin1002 START - Cookbook sre.dns.netbox [15:36:13] !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:36:25] (03PS1) 10JHathaway: postfix: add some recommended hardening settings [puppet] - 10https://gerrit.wikimedia.org/r/1024729 (https://phabricator.wikimedia.org/T325398) [15:36:44] (03CR) 10JHathaway: [C:03+2] postfix: mx-{in,out} test data [labs/private] - 10https://gerrit.wikimedia.org/r/1024726 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [15:36:50] (03CR) 10JHathaway: [V:03+2 C:03+2] postfix: mx-{in,out} test data [labs/private] - 10https://gerrit.wikimedia.org/r/1024726 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [15:41:48] (03PS1) 10Hnowlan: Enable async upload-by-URL on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) [15:42:30] (03CR) 10CI reject: [V:04-1] Enable async upload-by-URL on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [15:43:03] !log bking@cumin2002 conftool action : set/weight=20:pooled=yes; selector: name=elastic1105\.eqiad\.wmnet [15:43:21] (03PS2) 10Hnowlan: Enable async upload-by-URL on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) [15:43:27] (03CR) 10Ladsgroup: [C:04-1] "This is beta cluster, enable it everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [15:44:03] (03CR) 10Ladsgroup: "I'm not sure we even have testwiki in beta cluster, that's like the inception movie." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [15:46:20] !log Enabling Puppet on the Prometheus PoP hosts as part of the cergen to CFSSL migration - T360414 [15:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:44] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [15:52:50] (03CR) 10Ladsgroup: [C:03+1] "To test what? How broken beta cluster is? lol" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024731 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [15:53:32] !log eoghan@cumin1002 START - Cookbook sre.hosts.provision for host lists2001.mgmt.codfw.wmnet with reboot policy FORCED [15:55:35] !log depool ncredir6001 [15:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:11] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1103\.eqiad\.wmnet [15:56:20] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1104\.eqiad\.wmnet [15:56:33] (03PS3) 10Elukey: role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647) [15:56:34] (03PS1) 10Elukey: role::restbase::production: move Cassandra codfw instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1024736 (https://phabricator.wikimedia.org/T352647) [15:56:35] (03PS1) 10Elukey: role::restbase::production: move eqiad Cassandra instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1024737 (https://phabricator.wikimedia.org/T352647) [15:58:19] (03PS1) 10Elukey: role::restbase::production: cleanup after PKI migration [puppet] - 10https://gerrit.wikimedia.org/r/1024738 (https://phabricator.wikimedia.org/T352647) [15:59:12] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-out - https://phabricator.wikimedia.org/T325407#9748391 (10jhathaway) [16:00:04] (03PS1) 10JHathaway: postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398) [16:00:06] (03PS1) 10JHathaway: postfix: take mx_out boxes out of insetup [puppet] - 10https://gerrit.wikimedia.org/r/1024740 (https://phabricator.wikimedia.org/T325407) [16:00:08] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-out - https://phabricator.wikimedia.org/T325407#9748389 (10jhathaway) [16:00:08] (03PS1) 10JHathaway: postfix: mx-out hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1024741 (https://phabricator.wikimedia.org/T325407) [16:00:17] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1024736 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:00:26] (03CR) 10CI reject: [V:04-1] postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [16:02:19] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1024737 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:02:34] (03PS2) 10JHathaway: postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398) [16:02:35] (03PS2) 10JHathaway: postfix: take mx_out boxes out of insetup [puppet] - 10https://gerrit.wikimedia.org/r/1024740 (https://phabricator.wikimedia.org/T325407) [16:02:35] (03PS2) 10JHathaway: postfix: mx-out hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1024741 (https://phabricator.wikimedia.org/T325407) [16:02:55] (03CR) 10CI reject: [V:04-1] postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [16:06:17] !log repool ncredir6001 [16:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:30] (03PS2) 10Btullis: Use an LVM volume for /var/lib/ceph on cephosd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1024695 (https://phabricator.wikimedia.org/T324660) [16:08:30] (03PS3) 10Btullis: Start switching cephosd servers to cephadm management [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558) [16:10:45] (03PS1) 10Btullis: Update the DPE ceph cluster to reef [puppet] - 10https://gerrit.wikimedia.org/r/1024742 (https://phabricator.wikimedia.org/T362993) [16:11:22] (03PS2) 10Btullis: Update the DPE ceph cluster to reef [puppet] - 10https://gerrit.wikimedia.org/r/1024742 (https://phabricator.wikimedia.org/T362993) [16:12:19] (03PS1) 10Elukey: ml-services: fix wikidata host header for ml-staging's revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024743 [16:13:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2153/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024742 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [16:13:20] (03CR) 10Elukey: [C:03+2] ml-services: fix wikidata host header for ml-staging's revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024743 (owner: 10Elukey) [16:16:12] (03CR) 10Btullis: [V:03+1 C:03+2] Update the ownership of the aqs cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/1024611 (https://phabricator.wikimedia.org/T361645) (owner: 10Btullis) [16:16:48] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [16:17:17] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [16:17:22] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:17:40] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [16:18:06] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:18:37] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [16:18:42] (03PS3) 10JHathaway: postfix: mx_out role [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398) [16:18:42] (03PS3) 10JHathaway: postfix: take mx_out boxes out of insetup [puppet] - 10https://gerrit.wikimedia.org/r/1024740 (https://phabricator.wikimedia.org/T325407) [16:18:42] (03PS3) 10JHathaway: postfix: mx-out hiera data [puppet] - 10https://gerrit.wikimedia.org/r/1024741 (https://phabricator.wikimedia.org/T325407) [16:18:57] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [16:20:06] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lists2001.mgmt.codfw.wmnet with reboot policy FORCED [16:20:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [16:22:42] !log eoghan@cumin1002 START - Cookbook sre.hosts.reimage for host lists2001.wikimedia.org with OS bookworm [16:22:54] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9748515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1002 for host lists2001.wikimedia.org with OS bookworm [16:23:28] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:27:00] (03PS1) 10Andrea Denisse: ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1024712 (https://phabricator.wikimedia.org/T360414) [16:30:09] !log Delete the unused Prometheus PoP TLS certificates in the private repository as part of the cergen to CFSSL migration - T360414 [16:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:39] T360414: Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414 [16:30:56] (03CR) 10Andrea Denisse: [C:03+2] ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1024712 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:30:58] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] ssl: Delete dummy TLS key for the Prometheus hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1024712 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [16:31:32] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:32:43] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1105\.eqiad\.wmnet [16:32:59] !log bking@cumin2002 conftool action : set/weight=20:pooled=yes; selector: name=elastic1106\.eqiad\.wmnet [16:35:36] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1106\.eqiad\.wmnet [16:35:48] !log bking@cumin2002 conftool action : set/weight=20:pooled=yes; selector: name=elastic1107\.eqiad\.wmnet [16:38:42] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:42:44] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:44:36] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9748606 (10LSobanski) p:05Triage→03High [16:44:48] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:48:39] (03PS1) 10Cory Massaro: Enable wasmedge resource limits in Wikifunctions production services. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024767 [16:53:29] (03CR) 10BCornwall: [C:03+2] admin: add Lina Farid to LDAP_only (nda) [puppet] - 10https://gerrit.wikimedia.org/r/1024449 (https://phabricator.wikimedia.org/T362959) (owner: 10BCornwall) [16:57:52] (03PS1) 10Btullis: Fix the cephosd server reimages [puppet] - 10https://gerrit.wikimedia.org/r/1024773 (https://phabricator.wikimedia.org/T362993) [16:57:58] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=elastic1107\.eqiad\.wmnet [16:58:21] (03PS2) 10Btullis: Fix the cephosd server reimages [puppet] - 10https://gerrit.wikimedia.org/r/1024773 (https://phabricator.wikimedia.org/T362993) [17:00:03] (03PS1) 10Cathal Mooney: Add new vlan sub-interfaces to eqiad LVS for racks E5-7 and F5-7 [puppet] - 10https://gerrit.wikimedia.org/r/1024776 (https://phabricator.wikimedia.org/T334230) [17:00:41] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2154/console" [puppet] - 10https://gerrit.wikimedia.org/r/1024773 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [17:00:54] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests, 13Patch-For-Review: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9748700 (10BCornwall) 05In progress→03Resolved Thanks for your patience, @Lina_Farid_WMDE. We've merged the code now and you should be able to access... [17:01:07] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9748716 (10BTullis) [17:02:11] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [17:02:12] (03PS2) 10Cathal Mooney: Add new vlan sub-interfaces to eqiad LVS for racks E5-7 and F5-7 [puppet] - 10https://gerrit.wikimedia.org/r/1024776 (https://phabricator.wikimedia.org/T334230) [17:04:14] (03CR) 10Btullis: [C:03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1024625 (owner: 10Muehlenhoff) [17:05:04] (03CR) 10Bking: [C:03+1] Add new vlan sub-interfaces to eqiad LVS for racks E5-7 and F5-7 [puppet] - 10https://gerrit.wikimedia.org/r/1024776 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [17:05:25] (03CR) 10Cathal Mooney: [C:03+2] Add new vlan sub-interfaces to eqiad LVS for racks E5-7 and F5-7 [puppet] - 10https://gerrit.wikimedia.org/r/1024776 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [17:14:54] !log eoghan@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lists2001.wikimedia.org with OS bookworm [17:15:00] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9748784 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1002 for host lists2001.wikimedia.org with OS bookworm executed wi... [17:15:25] (03PS1) 10Cathal Mooney: Add new vlan names to LVS balancer config for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1024782 (https://phabricator.wikimedia.org/T334230) [17:16:01] (03CR) 10Cathal Mooney: [C:03+2] Add new vlan names to LVS balancer config for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1024782 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [17:27:37] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=elastic110[3-7]\.eqiad\.wmnet [17:44:40] (KubernetesRsyslogDown) firing: rsyslog on mw1356:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1356 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:46:46] (03CR) 10BCornwall: purged: add PKI cert handling (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (owner: 10CDobbins) [17:47:03] !log dancy@deploy1002 Installing scap version "4.80.0" for 325 hosts [17:47:49] !log dancy@deploy1002 Installation of scap version "4.80.0" completed for 325 hosts [17:48:27] !log dancy@deploy1002 Started scap: Testing T325530 [17:48:48] T325530: scap: hide helmfile operations behind a progress bar - https://phabricator.wikimedia.org/T325530 [17:49:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1356:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1356 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:50:36] 06SRE, 06Traffic: Migrate purged away from cergen-issued certificate - https://phabricator.wikimedia.org/T360506#9748958 (10CDobbins) CR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019866 Description of changes: * Add a feature flag `profile::cache::purged::use_pki` to control whether to use cfssl... [17:51:56] PROBLEM - Check whether ferm is active by checking the default input chain on mw1355 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:56:16] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2021 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:57:42] !log dancy@deploy1002 Finished scap: Testing T325530 (duration: 09m 14s) [17:57:59] T325530: scap: hide helmfile operations behind a progress bar - https://phabricator.wikimedia.org/T325530 [18:04:37] (03CR) 10Ayounsi: [C:03+1] netbox-standalone: Enable profile::auto_restarts::service for Redis [puppet] - 10https://gerrit.wikimedia.org/r/1024690 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:08:40] (KubernetesRsyslogDown) firing: rsyslog on mw1356:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1356 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:13:40] (KubernetesRsyslogDown) resolved: rsyslog on mw1356:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=mw1356 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:21:56] RECOVERY - Check whether ferm is active by checking the default input chain on mw1355 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:23:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T352010)', diff saved to https://phabricator.wikimedia.org/P61256 and previous config saved to /var/cache/conftool/dbconfig/20240426-182320-ladsgroup.json [18:23:49] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:26:16] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2021 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:38:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P61257 and previous config saved to /var/cache/conftool/dbconfig/20240426-183827-ladsgroup.json [18:39:15] 06SRE, 10observability, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9749374 (10andrea.denisse) [18:53:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P61258 and previous config saved to /var/cache/conftool/dbconfig/20240426-185335-ladsgroup.json [19:02:18] (03PS1) 10Jforrester: Fix for encoded characters in resource attribute [extensions/TimedMediaHandler] (wmf/1.43.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1024714 (https://phabricator.wikimedia.org/T363550) [19:03:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:08:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T352010)', diff saved to https://phabricator.wikimedia.org/P61259 and previous config saved to /var/cache/conftool/dbconfig/20240426-190842-ladsgroup.json [19:08:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [19:08:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [19:09:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [19:09:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [19:09:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T352010)', diff saved to https://phabricator.wikimedia.org/P61260 and previous config saved to /var/cache/conftool/dbconfig/20240426-190909-ladsgroup.json [19:09:10] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:09:32] !log LDAP - added linafaridwmde to groups wmde and nda (T362959) [19:10:54] (03PS1) 10Herron: istio_slos: add secondary recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879) [19:11:12] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9749633 (10Dzahn) 19:09 < mutante> !log LDAP - added linafaridwmde to groups wmde and nda (T362959) [19:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:48] T362959: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959 [19:14:23] (03CR) 10Herron: "its a bit crude but should be workable for the purpose of evaluating the updated rule metrics without history" [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [19:14:30] (03PS2) 10Jforrester: wikifunctions: Enable wasmedge resource limits in evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024767 (owner: 10Cory Massaro) [19:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:20:26] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:17] 06SRE, 10observability, 10SRE Observability (FY2023/2024-Q4): Phase out cergen for Observability services - https://phabricator.wikimedia.org/T360414#9749681 (10andrea.denisse) [20:01:17] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013#9749712 (10QChris) [20:41:59] (03PS1) 10Eevans: cassandra: add (faux) password for cassandra-devel user [labs/private] - 10https://gerrit.wikimedia.org/r/1024805 (https://phabricator.wikimedia.org/T355730) [20:43:47] (03PS9) 10Eevans: cassandra-dev: surrogate user for cqlsh (dev access) [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [20:44:19] (03PS1) 10Andrea Denisse: wmnet: Add discovery entries for grafana and grafana-next [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T360414) [20:44:41] (03CR) 10Eevans: [V:03+2 C:03+2] cassandra: add (faux) password for cassandra-devel user [labs/private] - 10https://gerrit.wikimedia.org/r/1024805 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [20:47:22] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [20:54:32] (03PS2) 10Andrea Denisse: wmnet: Add discovery entries for grafana and grafana-next [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T360414) [20:56:17] (03PS1) 10Andrea Denisse: trafficserver: Add discovery entries for grafana and grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T360414) [20:58:18] (03PS10) 10Eevans: cassandra-dev: surrogate user for cqlsh (dev access) [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [20:59:12] (03CR) 10Dzahn: [C:03+1] "lgtm, would be like other misc services without geoip/LVS do it" [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:01:05] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host lists2001.wikimedia.org with OS bullseye [21:01:11] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9749849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bullseye [21:03:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 37.27% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:04:01] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [21:13:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 39.17% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:14:08] (03PS11) 10Eevans: cassandra-dev: surrogate user for cqlsh (dev access) [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) [21:18:17] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lists2001.wikimedia.org with reason: host reimage [21:19:28] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9749896 (10Dzahn) The attempt with bookworm started by Eoghan was stuck at the partitioning step in the Debian installer with "No root file system is defined... [21:20:40] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [21:21:41] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists2001.wikimedia.org with reason: host reimage [21:24:57] !log amastilovic@deploy1002 Started deploy [airflow-dags/analytics@33b39d9]: (no justification provided) [21:25:26] !log amastilovic@deploy1002 Finished deploy [airflow-dags/analytics@33b39d9]: (no justification provided) (duration: 00m 28s) [21:27:43] (03CR) 10Eevans: [C:03+2] cassandra-dev: surrogate user for cqlsh (dev access) [puppet] - 10https://gerrit.wikimedia.org/r/1016899 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [21:30:16] (03CR) 10Dzahn: "lgtm, let's deploy next week though, not Friday, and CCing Clement because it's deployment_server" [puppet] - 10https://gerrit.wikimedia.org/r/1024479 (https://phabricator.wikimedia.org/T363519) (owner: 10Ahmon Dancy) [21:30:44] (03CR) 10Ahmon Dancy: "Works for me. Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1024479 (https://phabricator.wikimedia.org/T363519) (owner: 10Ahmon Dancy) [21:37:30] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dzahn@cumin2002" [21:37:35] (03CR) 10Dzahn: "I might have some nitpicks but I am happy if it gets merged like that and we can just follow-up. I would like it though if we do add a des" [puppet] - 10https://gerrit.wikimedia.org/r/1024615 (owner: 10Muehlenhoff) [21:38:43] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - dzahn@cumin2002" [21:38:45] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lists2001.wikimedia.org with OS bullseye [21:38:57] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9749967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bullseye completed: -... [21:39:26] (03CR) 10Dzahn: [C:03+1] miscweb: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024640 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [21:42:27] (03CR) 10Dzahn: Automate quarterly Phabricator data for WMF QLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024348 (https://phabricator.wikimedia.org/T362804) (owner: 10Aklapper) [21:43:13] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host lists2001.wikimedia.org with OS bookworm [21:43:25] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9749978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bookworm [21:51:15] (03PS1) 10Eevans: cassandra-dev: ensure directory exists before adding files [puppet] - 10https://gerrit.wikimedia.org/r/1024811 (https://phabricator.wikimedia.org/T355730) [21:53:08] (03PS1) 10Dzahn: phabricator::logmail: parameterize sender adddress for stats mails [puppet] - 10https://gerrit.wikimedia.org/r/1024812 [21:53:35] (03PS2) 10Dzahn: phabricator::logmail: parameterize sender adddress for stats mails [puppet] - 10https://gerrit.wikimedia.org/r/1024812 [21:54:36] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024811 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [21:56:45] (03CR) 10Eevans: [C:03+2] cassandra-dev: ensure directory exists before adding files [puppet] - 10https://gerrit.wikimedia.org/r/1024811 (https://phabricator.wikimedia.org/T355730) (owner: 10Eevans) [22:04:27] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lists2001.wikimedia.org with reason: host reimage [22:04:54] (03PS1) 10Kimberly Sarabia: Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) [22:07:46] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists2001.wikimedia.org with reason: host reimage [22:18:29] (03PS1) 10Ayounsi: magru: add momentum/novacore peer IPs/AS [homer/public] - 10https://gerrit.wikimedia.org/r/1024815 (https://phabricator.wikimedia.org/T362421) [22:22:01] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9750051 (10Dzahn) Somehow it worked on the next attempt with bookworm as well. It must have been a fluke. Host is up now with bookworm, no config change to b... [22:24:44] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lists2001.wikimedia.org with OS bookworm [22:24:52] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9750055 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host lists2001.wikimedia.org with OS bookworm completed: -... [22:25:26] (03CR) 10Dzahn: [C:04-1] "Ok, yea, agreed, let's keep it at least until after the switch for now." [puppet] - 10https://gerrit.wikimedia.org/r/1020955 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [22:27:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T352010)', diff saved to https://phabricator.wikimedia.org/P61261 and previous config saved to /var/cache/conftool/dbconfig/20240426-222728-ladsgroup.json [22:27:38] (03CR) 10Dzahn: [C:04-1] "wow, thanks for all the detail!" [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [22:27:54] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:42:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P61262 and previous config saved to /var/cache/conftool/dbconfig/20240426-224235-ladsgroup.json [22:44:21] (03CR) 10Dzahn: Automate quarterly Phabricator data for WMF QLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024348 (https://phabricator.wikimedia.org/T362804) (owner: 10Aklapper) [22:44:51] (03CR) 10Ayounsi: [C:03+2] magru: add momentum/novacore peer IPs/AS [homer/public] - 10https://gerrit.wikimedia.org/r/1024815 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [22:50:59] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9750079 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [22:51:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9750081 (10Dzahn) a:03YLiou_WMF [22:52:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9750084 (10Dzahn) a:05YLiou_WMF→03Miriam I think L3 might not be needed if this isn't shell access. We do need the manager approval though please. [22:55:15] 06SRE, 06serviceops, 13Patch-For-Review: upgrade deployment servers to bullseye / add bullseye support to puppet role - https://phabricator.wikimedia.org/T363415#9750086 (10Dzahn) 05Open→03In progress a:03Dzahn [22:57:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P61263 and previous config saved to /var/cache/conftool/dbconfig/20240426-225744-ladsgroup.json [23:00:31] (03Merged) 10jenkins-bot: magru: add momentum/novacore peer IPs/AS [homer/public] - 10https://gerrit.wikimedia.org/r/1024815 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [23:03:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:12:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T352010)', diff saved to https://phabricator.wikimedia.org/P61264 and previous config saved to /var/cache/conftool/dbconfig/20240426-231252-ladsgroup.json [23:12:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [23:13:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2188.codfw.wmnet with reason: Maintenance [23:13:17] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:13:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T352010)', diff saved to https://phabricator.wikimedia.org/P61265 and previous config saved to /var/cache/conftool/dbconfig/20240426-231316-ladsgroup.json [23:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:20:26] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:35:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 800.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024747 [23:38:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024747 (owner: 10TrainBranchBot) [23:45:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 827.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:58:12] (03PS1) 10Eevans: cassandra_dev: rename surrogate user [puppet] - 10https://gerrit.wikimedia.org/r/1024820 (https://phabricator.wikimedia.org/T355730) [23:59:18] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024747 (owner: 10TrainBranchBot)