[00:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:22] FIRING: [7x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:12:22] FIRING: [13x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:17:22] FIRING: [15x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:36:54] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:48:52] PROBLEM - MariaDB Replica Lag: pc1 on pc2021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [01:12:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306075 [01:12:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306075 (owner: 10TrainBranchBot) [01:20:19] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306075 (owner: 10TrainBranchBot) [02:00:26] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:00:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:07:16] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 49s) [02:09:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [02:23:31] FIRING: ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:28:31] FIRING: [2x] ProbeDown: Service gerrit2003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:29:41] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:30:44] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:32:38] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:32:38] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:33:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:33:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:33:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:33:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:33:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:33:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:34:41] FIRING: [3x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from gerrit.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=gerrit.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:38:02] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:39:41] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:46] FIRING: [4x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [02:41:57] FIRING: [2x] GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in codfw - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [02:43:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:04] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:05] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:06] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:07] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:43:31] FIRING: [4x] ProbeDown: Service gerrit2003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:43:46] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Unable to fetch the SHA-1 HEAD from operations/dns.git https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:45:44] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:46:46] RESOLVED: [4x] GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [02:46:51] RESOLVED: [2x] GerritHAProxyServiceUnavailable: Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in codfw - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyServiceUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyServiceUnavailable [02:47:02] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from gerrit.discovery.wmnet in eqiad #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=gerrit.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [02:47:36] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:47:36] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:02] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:02] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:02] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:02] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:04] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:05] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:48:31] RESOLVED: [4x] ProbeDown: Service gerrit2003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:46] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [02:49:41] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:46] FIRING: GerritDiskSpaceExhaustionIncoming: Gerrit disk space runway on gerrit2003:/srv is too low - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#TODO - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-collaboration-services?var-job=node&var-nodename=gerrit2003&var-node=gerrit2003%3A9100&refresh=1m - https://alerts.wikimedia.org/?q=alertname%3DGerritDiskSpaceExhaustionIncoming [03:38:46] RESOLVED: GerritDiskSpaceExhaustionIncoming: Gerrit disk space runway on gerrit2003:/srv is too low - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#TODO - https://grafana.wikimedia.org/d/rYdddlPWk/node-exporter-collaboration-services?var-job=node&var-nodename=gerrit2003&var-node=gerrit2003%3A9100&refresh=1m - https://alerts.wikimedia.org/?q=alertname%3DGerritDiskSpaceExhaustionIncoming [04:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:38] FIRING: [15x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:36:54] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [05:08:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqord and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqord:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:09:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:13:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:17:52] RECOVERY - MariaDB Replica Lag: pc1 on pc2021 is OK: OK slave_sql_lag Replication lag: 37.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [06:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [06:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260628T0700) [07:44:41] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:et-0/1/4 (Transport: cr2-eqiad:et-1/1/5 (Lumen, 449169461) {#3909}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:48:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:37] FIRING: [15x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:29:41] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:36:54] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:39:33] (03PS1) 10Simon04: Replace w.wiki links with their target, use Special:MyLanguage [puppet] - 10https://gerrit.wikimedia.org/r/1306094 [08:42:35] (03CR) 10CI reject: [V:04-1] Replace w.wiki links with their target, use Special:MyLanguage [puppet] - 10https://gerrit.wikimedia.org/r/1306094 (owner: 10Simon04) [08:59:14] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12062855 (10Mvolz) >>! In T430279#12058709, @Joe wrote: > So: I don't think this is an incident with user-facing consequences. It lines up with an increase in requests an... [08:59:35] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - June 26 / ~13 UTC - https://phabricator.wikimedia.org/T430279#12062856 (10Mvolz) [10:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [10:48:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:57] FIRING: ProbeDown: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:57] FIRING: ProbeDown: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:15:17] !incidents [12:15:17] 8103 (UNACKED) ProbeDown sre (10.2.2.51 ip4 shellbox:4008 probes/service http_shellbox_ip4 eqiad) [12:15:17] 8102 (RESOLVED) ATSBackendErrorsHigh cache_text sre (gerrit.discovery.wmnet eqiad) [12:15:37] !ack [12:15:38] 8103 (ACKED) ProbeDown sre (10.2.2.51 ip4 shellbox:4008 probes/service http_shellbox_ip4 eqiad) [12:17:38] FIRING: [15x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:19:57] RESOLVED: ProbeDown: Service shellbox:4008 has failed probes (http_shellbox_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:19:57] RESOLVED: ProbeDown: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:14] (03PS1) 10Giuseppe Lavagetto: Bump up resources for shellbox-score [deployment-charts] - 10https://gerrit.wikimedia.org/r/1306107 [12:36:54] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:37:23] FIRING: [15x] CertAlmostExpired: gNMI TLS certificate for lsw1-c2-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:43:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (2a02:ec80:700:fe0b::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:48:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (2a02:ec80:700:fe0b::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [16:03:26] (03PS1) 10Hashar: zuul: disable nodepool fallback [puppet] - 10https://gerrit.wikimedia.org/r/1306122 (https://phabricator.wikimedia.org/T424879) [16:05:44] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:41] FIRING: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:41] RESOLVED: [3x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:21:15] (03CR) 10Ladsgroup: "it would be great if we can split it to each sub-part of the system. I know it's annoying but this scares me 😭" [puppet] - 10https://gerrit.wikimedia.org/r/1305988 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway)