[00:02:46] (03PS1) 10Dzahn: gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) [00:03:50] !log reedy@deploy1003 Synchronized wmf-config/InitialiseSettings-labs.php: Fix syntax (duration: 11m 02s) [00:03:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:04:59] (03CR) 10CI reject: [V:04-1] gerrit: have different motd banners on active/passive servers [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [00:09:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137841 [00:09:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137841 (owner: 10TrainBranchBot) [00:10:30] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1137827/5325/phab1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1137827 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [00:11:31] (03PS1) 10Dzahn: gerrit: replace legacy fact with modern fact [puppet] - 10https://gerrit.wikimedia.org/r/1137842 [00:12:41] (03PS1) 10Reedy: InitialiseSettings-labs.php: Fix ReadingList config (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137843 [00:12:46] siigh [00:12:54] (03CR) 10Reedy: [C:03+2] InitialiseSettings-labs.php: Fix ReadingList config (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137843 (owner: 10Reedy) [00:13:42] (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Fix ReadingList config (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137843 (owner: 10Reedy) [00:23:40] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:25:28] !log reedy@deploy1003 Synchronized wmf-config/InitialiseSettings-labs.php: Fix syntax (duration: 10m 53s) [00:28:06] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1137841 (owner: 10TrainBranchBot) [01:03:40] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:35] RECOVERY - OpenSearch unassigned shard check - 9600 on cirrussearch2062 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [01:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:34:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [01:35:49] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: / 2042 MB (3% inode=95%): /tmp 2042 MB (3% inode=95%): /var/tmp 2042 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [01:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250422T0200) [02:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250422T0300) [03:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:23:41] FIRING: [2x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:24:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:27:33] PROBLEM - Disk space on an-worker1116 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 182941 MB (4% inode=99%): /var/lib/hadoop/data/l 184691 MB (4% inode=99%): /var/lib/hadoop/data/m 189877 MB (5% inode=99%): /var/lib/hadoop/data/b 217063 MB (5% inode=99%): /var/lib/hadoop/data/e 203391 MB (5% inode=99%): /var/lib/hadoop/data/c 182987 MB (4% inode=99%): /var/lib/hadoop/data/k 155588 MB (4% inode=99%): /var/lib/hadoop/data [03:27:33] 8 MB (4% inode=99%): /var/lib/hadoop/data/h 147964 MB (3% inode=99%): /var/lib/hadoop/data/f 201483 MB (5% inode=99%): /var/lib/hadoop/data/j 186436 MB (4% inode=99%): /var/lib/hadoop/data/d 157050 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1116&var-datasource=eqiad+prometheus/ops [03:31:09] PROBLEM - Disk space on an-worker1089 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 154811 MB (4% inode=99%): /var/lib/hadoop/data/k 193022 MB (5% inode=99%): /var/lib/hadoop/data/m 153718 MB (4% inode=99%): /var/lib/hadoop/data/f 149825 MB (3% inode=99%): /var/lib/hadoop/data/c 163463 MB (4% inode=99%): /var/lib/hadoop/data/l 194755 MB (5% inode=99%): /var/lib/hadoop/data/e 154991 MB (4% inode=99%): /var/lib/hadoop/data [03:31:09] 4 MB (5% inode=99%): /var/lib/hadoop/data/j 183961 MB (4% inode=99%): /var/lib/hadoop/data/i 156077 MB (4% inode=99%): /var/lib/hadoop/data/d 181099 MB (4% inode=99%): /var/lib/hadoop/data/b 167138 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1089&var-datasource=eqiad+prometheus/ops [03:35:49] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: / 2034 MB (3% inode=95%): /tmp 2034 MB (3% inode=95%): /var/tmp 2034 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [03:40:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:40:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:41:45] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:44:45] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53801 bytes in 9.876 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:44:47] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 4.365 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:45:17] PROBLEM - Disk space on analytics1070 is CRITICAL: DISK CRITICAL - free space: / 2125 MB (3% inode=95%): /tmp 2125 MB (3% inode=95%): /var/tmp 2125 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1070&var-datasource=eqiad+prometheus/ops [03:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250422T0400) [04:02:50] !log mwpresync@deploy1003 Pruned MediaWiki: 1.44.0-wmf.23 (duration: 02m 48s) [04:03:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:43:55] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:44:51] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:51:09] PROBLEM - Disk space on an-worker1089 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 155461 MB (4% inode=99%): /var/lib/hadoop/data/k 192562 MB (5% inode=99%): /var/lib/hadoop/data/m 154269 MB (4% inode=99%): /var/lib/hadoop/data/f 148624 MB (3% inode=99%): /var/lib/hadoop/data/c 163228 MB (4% inode=99%): /var/lib/hadoop/data/l 194356 MB (5% inode=99%): /var/lib/hadoop/data/e 158481 MB (4% inode=99%): /var/lib/hadoop/data [04:51:09] 5 MB (5% inode=99%): /var/lib/hadoop/data/j 184155 MB (4% inode=99%): /var/lib/hadoop/data/i 156439 MB (4% inode=99%): /var/lib/hadoop/data/d 181008 MB (4% inode=99%): /var/lib/hadoop/data/b 162472 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1089&var-datasource=eqiad+prometheus/ops [05:03:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:08:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:26:11] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:27:09] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:28:49] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250422T0600) [06:00:05] marostegui, Amir1, and federico3: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250422T0600) [06:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:03:09] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:03:37] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:03:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:08:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:19:34] (03PS4) 10Ori: Remove temporary '-php8' and '-k8s' suffixes from ArcLamp pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118208 (https://phabricator.wikimedia.org/T391516) [06:20:52] (03PS5) 10Ori: Remove temporary '-php8' and '-k8s' suffixes from ArcLamp pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118208 (https://phabricator.wikimedia.org/T391516) [06:23:07] (03PS3) 10Ori: Turn down the PHP8- and Kubernetes-specific ArcLamp listeners [puppet] - 10https://gerrit.wikimedia.org/r/1118209 (https://phabricator.wikimedia.org/T391516) [06:23:15] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:24:09] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:30:51] PROBLEM - Restbase root url on restbase1041 is CRITICAL: connect to address 10.64.48.40 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [06:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250422T0700) [07:23:41] FIRING: [2x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:40:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:02:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:07:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:42:13] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:43:09] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:03:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:48:41] FIRING: [3x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:08:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:40:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:46] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on cloudcephmon1004 - https://phabricator.wikimedia.org/T392424 (10ops-monitoring-bot) 03NEW [12:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [13:03:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:48:41] FIRING: [2x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:04:32] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 4 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10758759 (10Silvan_WMDE) The problem seems to occur specifically, when a newly created... [14:08:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:19:21] (03CR) 10AOkoth: [C:03+2] miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [14:19:35] (03PS7) 10Jelto: miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) [14:31:13] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:32:11] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:32:25] PROBLEM - Hadoop NodeManager on an-worker1198 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:32:35] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:34:33] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392427 (10phaultfinder) 03NEW [14:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:01] PROBLEM - Hadoop NodeManager on an-worker1150 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:39:39] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392428 (10phaultfinder) 03NEW [14:42:13] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:43:45] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:46:01] PROBLEM - Hadoop NodeManager on an-worker1171 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:48:45] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:48:47] (03CR) 10AOkoth: [V:03+2 C:03+2] miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [14:52:35] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:52:35] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:52:35] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:56:11] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 554 bytes in 7.870 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [14:56:25] RECOVERY - Hadoop NodeManager on an-worker1198 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:59:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:00:01] RECOVERY - Hadoop NodeManager on an-worker1150 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:00:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:02:51] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:01] RECOVERY - Hadoop NodeManager on an-worker1171 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:35:48] FIRING: PuppetDisabled: Puppet disabled on elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=elasticsearch&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:44] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140314 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:51] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [17:03:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [17:26:03] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:26:08] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:26:19] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:26:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:27:03] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:27:08] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:27:24] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:29:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:30:58] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:31:08] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:31:19] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:31:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [17:31:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:32:03] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:32:19] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:32:25] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:35:58] RESOLVED: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:48:41] FIRING: [2x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:29] (03PS1) 10D3r1ck01: SUL3: Remove unused CentralAuthSharedDomainPrefix config setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138096 [18:00:51] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:01:51] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:08:41] FIRING: [4x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:08:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:26:26] (03PS3) 10Bartosz Dziewoński: Clean up obsolete SUL3 settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 [19:27:27] (03PS4) 10Bartosz Dziewoński: Simplify CentralAuthEnableSul3 config setting value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 [19:27:30] (03CR) 10Bartosz Dziewoński: Simplify CentralAuthEnableSul3 config setting value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 (owner: 10Bartosz Dziewoński) [19:28:19] (03CR) 10Bartosz Dziewoński: [C:03+1] SUL3: Remove unused CentralAuthSharedDomainPrefix config setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138096 (owner: 10D3r1ck01) [19:29:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138096 (owner: 10D3r1ck01) [19:29:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 (owner: 10Bartosz Dziewoński) [19:35:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:48] FIRING: PuppetDisabled: Puppet disabled on elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=elasticsearch&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [19:37:05] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:38:27] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 46, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:39:27] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:40:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:51] FIRING: [6x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:48:47] (03CR) 10D3r1ck01: [C:03+1] "Yes, thank you! We can deploy both at once 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 (owner: 10Bartosz Dziewoński) [19:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:17] PROBLEM - Disk space on analytics1070 is CRITICAL: DISK CRITICAL - free space: / 2119 MB (3% inode=95%): /tmp 2119 MB (3% inode=95%): /var/tmp 2119 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1070&var-datasource=eqiad+prometheus/ops [20:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:35:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:05] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:58:38] 06SRE: Image not rendering properly on most projects - https://phabricator.wikimedia.org/T392435 (10Quiddity) 03NEW [21:03:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:11] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:04:33] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:08:56] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [21:16:07] 06SRE: Image not rendering properly on most projects - https://phabricator.wikimedia.org/T392435#10759088 (10Alien333) Can you still reproduce? Right now at all the wikis mentioned, the image is working fine for me. [21:25:21] PROBLEM - Disk space on an-worker1165 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 111420 MB (2% inode=99%): /var/lib/hadoop/data/i 233546 MB (6% inode=99%): /var/lib/hadoop/data/e 245200 MB (6% inode=99%): /var/lib/hadoop/data/j 243820 MB (6% inode=99%): /var/lib/hadoop/data/l 143284 MB (3% inode=99%): /var/lib/hadoop/data/h 244064 MB (6% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://gra [21:25:21] imedia.org/d/000000377/host-overview?var-server=an-worker1165&var-datasource=eqiad+prometheus/ops [21:30:58] 06SRE: Image not rendering properly on most projects - https://phabricator.wikimedia.org/T392435#10759111 (10Quiddity) Agreed. I cannot reproduce anymore. (But they were all definitely broken (showing the 'broken image' icon) for a few hours). I'll leave this task unresolved in case it needs follow-up from devs. [21:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:42:15] 06SRE: Image not rendering properly on most projects - https://phabricator.wikimedia.org/T392435#10759116 (10Dylsss) Still broken for me: ` curl 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Prof.Anna-Teresa_Tymieniecka.JPG/250px-Prof.Anna-Teresa_Tymieniecka.JPG' -H 'User-Agent: Mozilla/5.0 (Wind... [21:56:55] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:57:47] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:58:59] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:00:53] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:03:37] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:05:49] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 4.492 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:08:41] FIRING: [4x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:22:15] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:23:11] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:34:36] (03CR) 10Krinkle: [C:03+1] Remove temporary '-php8' and '-k8s' suffixes from ArcLamp pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118208 (https://phabricator.wikimedia.org/T391516) (owner: 10Ori) [23:35:03] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118209 (https://phabricator.wikimedia.org/T391516) (owner: 10Ori) [23:35:48] FIRING: PuppetDisabled: Puppet disabled on elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=elasticsearch&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [23:40:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1138123 [23:40:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1138123 (owner: 10TrainBranchBot) [23:40:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:38] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1138123 (owner: 10TrainBranchBot) [23:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:56:41] !log running cleanupBlocks.php on all wikis T389301 [23:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:45] T389301: Clean up duplicate block_target rows in production - https://phabricator.wikimedia.org/T389301