[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179236 (owner: 10TrainBranchBot) [00:05:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:08:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179262 [00:08:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179262 (owner: 10TrainBranchBot) [00:12:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P81390 and previous config saved to /var/cache/conftool/dbconfig/20250817-001216-ladsgroup.json [00:15:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:25:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:27:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P81391 and previous config saved to /var/cache/conftool/dbconfig/20250817-002723-ladsgroup.json [00:31:29] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179262 (owner: 10TrainBranchBot) [00:33:37] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:37] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:42:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T400854)', diff saved to https://phabricator.wikimedia.org/P81392 and previous config saved to /var/cache/conftool/dbconfig/20250817-004231-ladsgroup.json [00:42:37] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [00:42:48] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2199.codfw.wmnet with reason: Maintenance [00:46:10] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2206.codfw.wmnet with reason: Maintenance [00:46:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T400854)', diff saved to https://phabricator.wikimedia.org/P81393 and previous config saved to /var/cache/conftool/dbconfig/20250817-004616-ladsgroup.json [00:49:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:50:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T400854)', diff saved to https://phabricator.wikimedia.org/P81394 and previous config saved to /var/cache/conftool/dbconfig/20250817-005002-ladsgroup.json [00:50:07] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [00:59:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:40] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:05:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P81395 and previous config saved to /var/cache/conftool/dbconfig/20250817-010510-ladsgroup.json [01:09:32] FIRING: JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:09:54] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402043#11091941 (10phaultfinder) [01:11:23] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [01:12:26] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 46s) [01:15:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:15:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:20:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P81396 and previous config saved to /var/cache/conftool/dbconfig/20250817-012017-ladsgroup.json [01:35:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T400854)', diff saved to https://phabricator.wikimedia.org/P81397 and previous config saved to /var/cache/conftool/dbconfig/20250817-013525-ladsgroup.json [01:35:30] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [01:35:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2210.codfw.wmnet with reason: Maintenance [01:35:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T400854)', diff saved to https://phabricator.wikimedia.org/P81398 and previous config saved to /var/cache/conftool/dbconfig/20250817-013537-ladsgroup.json [01:39:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T400854)', diff saved to https://phabricator.wikimedia.org/P81399 and previous config saved to /var/cache/conftool/dbconfig/20250817-013922-ladsgroup.json [01:47:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:53:51] 10ops-codfw, 06DC-Ops: Alert for device ps1-b2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402099 (10phaultfinder) 03NEW [01:54:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P81400 and previous config saved to /var/cache/conftool/dbconfig/20250817-015430-ladsgroup.json [01:57:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:07:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:09:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P81401 and previous config saved to /var/cache/conftool/dbconfig/20250817-020937-ladsgroup.json [02:17:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:24:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T400854)', diff saved to https://phabricator.wikimedia.org/P81402 and previous config saved to /var/cache/conftool/dbconfig/20250817-022445-ladsgroup.json [02:24:50] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [02:25:01] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2219.codfw.wmnet with reason: Maintenance [02:25:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T400854)', diff saved to https://phabricator.wikimedia.org/P81403 and previous config saved to /var/cache/conftool/dbconfig/20250817-022508-ladsgroup.json [02:25:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:28:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T400854)', diff saved to https://phabricator.wikimedia.org/P81404 and previous config saved to /var/cache/conftool/dbconfig/20250817-022851-ladsgroup.json [02:43:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P81405 and previous config saved to /var/cache/conftool/dbconfig/20250817-024359-ladsgroup.json [02:58:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:59:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P81406 and previous config saved to /var/cache/conftool/dbconfig/20250817-025906-ladsgroup.json [03:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:04:32] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:09:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:14:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T400854)', diff saved to https://phabricator.wikimedia.org/P81407 and previous config saved to /var/cache/conftool/dbconfig/20250817-031414-ladsgroup.json [03:14:18] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [03:14:29] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2236.codfw.wmnet with reason: Maintenance [03:14:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2236 (T400854)', diff saved to https://phabricator.wikimedia.org/P81408 and previous config saved to /var/cache/conftool/dbconfig/20250817-031436-ladsgroup.json [03:18:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T400854)', diff saved to https://phabricator.wikimedia.org/P81409 and previous config saved to /var/cache/conftool/dbconfig/20250817-031824-ladsgroup.json [03:29:32] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:29:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:33:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P81410 and previous config saved to /var/cache/conftool/dbconfig/20250817-033332-ladsgroup.json [03:48:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P81411 and previous config saved to /var/cache/conftool/dbconfig/20250817-034839-ladsgroup.json [04:03:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2236 (T400854)', diff saved to https://phabricator.wikimedia.org/P81412 and previous config saved to /var/cache/conftool/dbconfig/20250817-040347-ladsgroup.json [04:03:52] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [04:04:03] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2237.codfw.wmnet with reason: Maintenance [04:04:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2237 (T400854)', diff saved to https://phabricator.wikimedia.org/P81413 and previous config saved to /var/cache/conftool/dbconfig/20250817-040410-ladsgroup.json [04:07:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T400854)', diff saved to https://phabricator.wikimedia.org/P81414 and previous config saved to /var/cache/conftool/dbconfig/20250817-040755-ladsgroup.json [04:20:04] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402043#11091987 (10phaultfinder) [04:23:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P81415 and previous config saved to /var/cache/conftool/dbconfig/20250817-042303-ladsgroup.json [04:37:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:38:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P81416 and previous config saved to /var/cache/conftool/dbconfig/20250817-043811-ladsgroup.json [04:39:33] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:52:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:53:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T400854)', diff saved to https://phabricator.wikimedia.org/P81417 and previous config saved to /var/cache/conftool/dbconfig/20250817-045318-ladsgroup.json [04:53:25] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [04:53:35] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2239.codfw.wmnet with reason: Maintenance [04:57:15] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2240.codfw.wmnet with reason: Maintenance [04:57:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2240 (T400854)', diff saved to https://phabricator.wikimedia.org/P81418 and previous config saved to /var/cache/conftool/dbconfig/20250817-045722-ladsgroup.json [04:59:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:01:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T400854)', diff saved to https://phabricator.wikimedia.org/P81419 and previous config saved to /var/cache/conftool/dbconfig/20250817-050107-ladsgroup.json [05:01:12] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [05:08:37] FIRING: [3x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:38] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [05:16:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P81420 and previous config saved to /var/cache/conftool/dbconfig/20250817-051615-ladsgroup.json [05:19:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402043#11092007 (10phaultfinder) [05:31:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P81421 and previous config saved to /var/cache/conftool/dbconfig/20250817-053122-ladsgroup.json [05:39:55] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402043#11092009 (10phaultfinder) [05:46:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T400854)', diff saved to https://phabricator.wikimedia.org/P81422 and previous config saved to /var/cache/conftool/dbconfig/20250817-054629-ladsgroup.json [05:46:34] T400854: Add rc_source_name_timestamp index to recentchanges table in wmf production - https://phabricator.wikimedia.org/T400854 [05:54:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402043#11092014 (10phaultfinder) [06:00:24] (03PS1) 10EggRoll97: Add Oath log to bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179264 (https://phabricator.wikimedia.org/T401350) [06:01:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179264 (https://phabricator.wikimedia.org/T401350) (owner: 10EggRoll97) [06:14:32] FIRING: [3x] JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:17:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:38:37] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:49:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250817T0700) [07:04:32] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:29:32] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:51:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in codfw - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=codfw - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [07:53:35] (03PS5) 10Giuseppe Lavagetto: haproxy: allow having multiple requestctl scopes [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) [07:55:58] (03PS6) 10Giuseppe Lavagetto: haproxy: allow having multiple requestctl scopes [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) [07:57:27] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6595/co" [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [08:19:57] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402043#11092036 (10phaultfinder) [08:26:20] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [08:28:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:33:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:59:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:03:28] 10ops-magru: Unresponsive management for dns7001.mgmt:22 - https://phabricator.wikimedia.org/T402105 (10phaultfinder) 03NEW [09:03:29] 10ops-magru: Unresponsive management for cp7015.mgmt:22 - https://phabricator.wikimedia.org/T402107 (10phaultfinder) 03NEW [09:03:30] 10ops-magru: Unresponsive management for cp7016.mgmt:22 - https://phabricator.wikimedia.org/T402106 (10phaultfinder) 03NEW [09:03:31] 10ops-magru: Unresponsive management for cp7003.mgmt:22 - https://phabricator.wikimedia.org/T402109 (10phaultfinder) 03NEW [09:03:32] 10ops-magru: Unresponsive management for cp7014.mgmt:22 - https://phabricator.wikimedia.org/T402108 (10phaultfinder) 03NEW [09:03:34] 10ops-magru: Unresponsive management for cp7010.mgmt:22 - https://phabricator.wikimedia.org/T402110 (10phaultfinder) 03NEW [09:04:29] 10ops-magru: Unresponsive management for ganeti7002.mgmt:22 - https://phabricator.wikimedia.org/T402111 (10phaultfinder) 03NEW [09:04:30] 10ops-magru: Unresponsive management for cp7001.mgmt:22 - https://phabricator.wikimedia.org/T402113 (10phaultfinder) 03NEW [09:04:31] 10ops-magru: Unresponsive management for cp7007.mgmt:22 - https://phabricator.wikimedia.org/T402114 (10phaultfinder) 03NEW [09:04:32] 10ops-magru: Unresponsive management for cp7011.mgmt:22 - https://phabricator.wikimedia.org/T402116 (10phaultfinder) 03NEW [09:04:33] 10ops-magru: Unresponsive management for cp7012.mgmt:22 - https://phabricator.wikimedia.org/T402112 (10phaultfinder) 03NEW [09:04:34] 10ops-magru: Unresponsive management for ganeti7004.mgmt:22 - https://phabricator.wikimedia.org/T402115 (10phaultfinder) 03NEW [09:05:27] 10ops-magru: Unresponsive management for cp7013.mgmt:22 - https://phabricator.wikimedia.org/T402117 (10phaultfinder) 03NEW [09:05:28] 10ops-magru: Unresponsive management for cp7002.mgmt:22 - https://phabricator.wikimedia.org/T402120 (10phaultfinder) 03NEW [09:05:29] 10ops-magru: Unresponsive management for ganeti7001.mgmt:22 - https://phabricator.wikimedia.org/T402119 (10phaultfinder) 03NEW [09:05:30] 10ops-magru: Unresponsive management for cp7004.mgmt:22 - https://phabricator.wikimedia.org/T402118 (10phaultfinder) 03NEW [09:05:31] 10ops-magru: Unresponsive management for lvs7002.mgmt:22 - https://phabricator.wikimedia.org/T402123 (10phaultfinder) 03NEW [09:05:32] 10ops-magru: Unresponsive management for ganeti7003.mgmt:22 - https://phabricator.wikimedia.org/T402122 (10phaultfinder) 03NEW [09:05:35] 10ops-magru: Unresponsive management for dns7002.mgmt:22 - https://phabricator.wikimedia.org/T402121 (10phaultfinder) 03NEW [09:06:28] 10ops-magru: Unresponsive management for lvs7003.mgmt:22 - https://phabricator.wikimedia.org/T402124 (10phaultfinder) 03NEW [09:06:29] 10ops-magru: Unresponsive management for cp7009.mgmt:22 - https://phabricator.wikimedia.org/T402125 (10phaultfinder) 03NEW [09:06:30] 10ops-magru: Unresponsive management for cp7008.mgmt:22 - https://phabricator.wikimedia.org/T402126 (10phaultfinder) 03NEW [09:06:31] 10ops-magru: Unresponsive management for lvs7001.mgmt:22 - https://phabricator.wikimedia.org/T402127 (10phaultfinder) 03NEW [09:06:32] 10ops-magru: Unresponsive management for cp7005.mgmt:22 - https://phabricator.wikimedia.org/T402128 (10phaultfinder) 03NEW [09:11:38] FIRING: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [09:27:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:33:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:38:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:04:17] FIRING: [3x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:17] FIRING: [4x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:32] FIRING: JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:18:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate push-notifications.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:39:33] FIRING: SystemdUnitFailed: netbox_ganeti_magru02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:32] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:12:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:29:33] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:26:35] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [12:42:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:59:55] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11092401 (10Ladsgroup) >>! In T400195#11091192, @Jhancock.wm wrote: > @Marostegui these are ready! Thanks! Manuel is out. I will try to start getting them into production. [13:00:40] RECOVERY - Host mr1-magru is UP: PING OK - Packet loss = 0%, RTA = 111.00 ms [13:00:52] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [13:02:42] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 200.69 ms [13:03:37] RESOLVED: JobUnavailable: Reduced availability for job pdu_pro4x in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:04:04] RECOVERY - Host mr1-magru IPv6 is UP: PING OK - Packet loss = 0%, RTA = 110.95 ms [13:04:48] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 131.03 ms [13:06:23] RESOLVED: [2x] GnmiTargetDown: asw1-b3-magru is unreachable through gNMI - https://wikitech.wikimedia.org/wiki/Network_telemetry#Troubleshooting - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic - https://alerts.wikimedia.org/?q=alertname%3DGnmiTargetDown [13:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:09:17] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:01] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate push-notifications.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:36:05] (03PS6) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) [14:36:05] (03PS7) 10Giuseppe Lavagetto: haproxy: allow having multiple requestctl scopes [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) [14:37:26] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6596/co" [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [14:39:33] FIRING: SystemdUnitFailed: netbox_ganeti_magru02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:56] (03PS7) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) [14:54:32] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6597/co" [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [15:01:14] 06SRE: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142 (10Josve05a) 03NEW [15:04:32] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:08:20] (03PS8) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) [15:08:37] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:57] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6599/co" [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [15:11:52] (03PS9) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) [15:12:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:13:11] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6600/co" [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [15:14:22] 06SRE: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092445 (10Josve05a) Follow-up from ticket #2025081710002753 : - OS: Windows 11 - IP Address (IPv4): 103.217.230.77 - Browser: Chromium v126.0.6478.251 - Browser add-ons: uBlock Origin, Shazam,... [15:17:37] (03CR) 10Giuseppe Lavagetto: [V:03+1] "all varnishtests pass now" [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [15:17:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:23:37] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:33] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:30:52] 06SRE: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092446 (10Josve05a) See also e.g. https://www.reddit.com/r/wikipedia/comments/1mhbumf/wikipedia_hasnt_been_working_for_me_for_the_past/ [15:32:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:37:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:51:20] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [15:52:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.187s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:55:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:57:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.187s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:00:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [16:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:10:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [17:15:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [17:30:00] 06SRE: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092492 (10Josve05a) Additional report from user (Ticket #2025081710003225): - Device: HP Pavilion x360 Convertible (64-bit) -- Issue also occurs on other desktops and laptops - Browsers tes... [17:36:39] 06SRE: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092497 (10Josve05a) [17:42:21] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092500 (10Bugreporter) [17:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:09:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:19:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate push-notifications.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:32:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:37:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:39:33] FIRING: SystemdUnitFailed: netbox_ganeti_magru02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:27:25] (03PS1) 10Andrew Bogott: magnum capi_helm: use the cloudinfra chartmuseum repo for helm charts [puppet] - 10https://gerrit.wikimedia.org/r/1179278 (https://phabricator.wikimedia.org/T393782) [19:29:02] (03CR) 10Andrew Bogott: [C:03+2] magnum capi_helm: use the cloudinfra chartmuseum repo for helm charts [puppet] - 10https://gerrit.wikimedia.org/r/1179278 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [19:29:33] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:51:36] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in codfw - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=codfw - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [20:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:41:06] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092538 (10Aklapper) > Browser: Chromium v126.0.6478.251 I could imagine that such older versions (126 was in April 2024) are more likely to end up rate-limited. Does that also happ... [20:54:14] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:08:14] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [21:13:32] RECOVERY - Disk space on an-druid1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1003&var-datasource=eqiad+prometheus/ops [21:17:40] RECOVERY - Disk space on an-druid1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1005&var-datasource=eqiad+prometheus/ops [21:28:58] RECOVERY - Disk space on an-druid1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1007&var-datasource=eqiad+prometheus/ops [21:31:46] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092576 (10Josve05a) >>! In T402142#11092538, @Aklapper wrote: >> Browsers tested: Edge, AVG, Chrome (issue slightly less frequent on Chrome) > > Which exact versions? * Chrome 139... [21:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:48:35] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092577 (10Josve05a) Ticket #2025073110012101: Used an old version of the browser Chromium (Chromium based). Now updated, and now everything seems to work for this person at least. [22:01:21] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [22:09:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:19:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate push-notifications.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:27:32] PROBLEM - Disk space on an-druid1002 is CRITICAL: DISK CRITICAL - free space: /srv 86612 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1002&var-datasource=eqiad+prometheus/ops [22:39:33] FIRING: SystemdUnitFailed: netbox_ganeti_magru02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:47:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:50:22] PROBLEM - Disk space on an-druid1004 is CRITICAL: DISK CRITICAL - free space: /srv 101822 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1004&var-datasource=eqiad+prometheus/ops [22:57:42] PROBLEM - Disk space on an-druid1001 is CRITICAL: DISK CRITICAL - free space: /srv 100568 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1001&var-datasource=eqiad+prometheus/ops [23:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:29:33] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:38:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1179289 [23:38:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1179289 (owner: 10TrainBranchBot) [23:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:51:37] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1179289 (owner: 10TrainBranchBot)