[00:00:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:00:42] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050680 (owner: 10TrainBranchBot) [00:00:48] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [00:01:54] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for - https://phabricator.wikimedia.org/T368566#9936498 (10Dzahn) [00:02:56] 06SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for - https://phabricator.wikimedia.org/T368566#9936509 (10Dzahn) [00:03:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002" [00:04:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for dbproxy1028,9 - jclark@cumin1002" [00:04:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:04:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:04:43] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye [00:04:53] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9936513 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye [00:10:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:31:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:43:53] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1050682 [00:46:09] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [00:46:41] (03Abandoned) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1050682 (owner: 10Ncmonitor) [00:48:02] (03PS2) 10BCornwall: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1050484 (owner: 10Ncmonitor) [00:49:07] (03CR) 10BCornwall: [V:03+1] "Ib4bfa5dbae8a972a3d32e11fe5cfa8e6d8043fba is an example of running with https://gitlab.wikimedia.org/repos/sre/ncmonitor/-/merge_requests/" [dns] - 10https://gerrit.wikimedia.org/r/1050484 (owner: 10Ncmonitor) [01:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:10:09] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [01:10:26] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [01:12:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:14:56] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1050683 [01:14:59] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1050684 [01:15:20] (03CR) 10CI reject: [V:04-1] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1050683 (owner: 10Ncmonitor) [01:24:59] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1040.eqiad.wmnet with OS bullseye [01:25:08] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9936597 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1040.eqiad.wmnet with OS bullseye executed with errors:... [01:30:11] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:57:11] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:04:15] RESOLVED: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:23] (03Abandoned) 10Jdlrobson: Enable action edit/submit and remaining special pages in dark mode (1.43.0-wmf.12) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1049975 (https://phabricator.wikimedia.org/T366524) (owner: 10Jdlrobson) [02:29:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:51] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes1010.eqiad.wmnet, parse1011.eqiad.wmnet, mw1363.eqiad.wmnet, mw1433.eqiad.wmnet, mw1380.eqiad.wmnet, mw1462.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1457.eqiad.wmnet, wikikube-worker1009.eqiad.wmnet, wikikube-worker1022.eqiad.wmnet, mw1475.eqiad.wmnet, kubernetes1062.eqiad.wmnet, kubernetes1022.eqiad.wmnet, kubernetes [02:36:51] ad.wmnet, mw1384.eqiad.wmnet, parse1013.eqiad.wmnet, mw1387.eqiad.wmnet, mw1470.eqiad.wmnet, mw1378.eqiad.wmnet, kubernetes1021.eqiad.wmnet, mw1390.eqiad.wmnet, mw1430.eqiad.wmnet, mw1459.eqiad.wmnet, mw1476.eqiad.wmnet, mw1480.eqiad.wmnet, mw1482.eqiad.wmnet, mw1484.eqiad.wmnet, kubernetes1016.eqiad.wmnet, mw1492.eqiad.wmnet, kubernetes1040.eqiad.wmnet, kubernetes1047.eqiad.wmnet, mw1486.eqiad.wmnet, mw1405.eqiad.wmnet, mw1435.eqiad.wmne [02:36:51] 4.eqiad.wmnet, parse1024.eqiad.wmnet, parse1010.eqiad.wmnet, mw1391.eqiad.wmnet, wikikube-worker1017.eqiad.wmnet, mw1389.eqiad.wmnet, mw1423.eqiad.wmnet, mw1494.eqiad.wmnet, wikikube-wo https://wikitech.wikimedia.org/wiki/PyBal [02:36:53] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes1010.eqiad.wmnet, parse1011.eqiad.wmnet, parse1013.eqiad.wmnet, parse1014.eqiad.wmnet, parse1007.eqiad.wmnet, mw1367.eqiad.wmnet, mw1475.eqiad.wmnet, mw1374.eqiad.wmnet, mw1434.eqiad.wmnet, kubernetes1058.eqiad.wmnet, mw1439.eqiad.wmnet, parse1019.eqiad.wmnet, mw1432.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, mw1349. [02:36:53] net, mw1433.eqiad.wmnet, mw1479.eqiad.wmnet, kubernetes1023.eqiad.wmnet, mw1378.eqiad.wmnet, kubernetes1021.eqiad.wmnet, mw1430.eqiad.wmnet, mw1459.eqiad.wmnet, mw1388.eqiad.wmnet, mw1480.eqiad.wmnet, mw1482.eqiad.wmnet, parse1009.eqiad.wmnet, mw1449.eqiad.wmnet, mw1495.eqiad.wmnet, mw1399.eqiad.wmnet, mw1435.eqiad.wmnet, kubernetes1012.eqiad.wmnet, mw1488.eqiad.wmnet, parse1023.eqiad.wmnet, wikikube-worker1003.eqiad.wmnet, wikikube-worke [02:36:53] iad.wmnet, wikikube-worker1018.eqiad.wmnet, mw1357.eqiad.wmnet, kubernetes1026.eqiad.wmnet, parse1012.eqiad.wmnet, wikikube-worker1025.eqiad.wmnet, mw1496.eqiad.wmnet, kubernetes1060.eq https://wikitech.wikimedia.org/wiki/PyBal [02:36:58] FIRING: [14x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:44] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:37:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:38:51] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:38:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [02:39:15] FIRING: [18x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:39:15] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:33] FIRING: [18x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:41:58] RESOLVED: [18x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:42:44] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [02:42:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:44:15] RESOLVED: [17x] ProbeDown: Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:59:15] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:37:04] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1050684 (owner: 10Ncmonitor) [03:37:12] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1050683 (owner: 10Ncmonitor) [04:05:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:05:53] PROBLEM - SSH on an-presto1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:09:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:10:43] RECOVERY - SSH on an-presto1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:19:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:24:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:29:15] FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:30:33] RESOLVED: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:49:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:54:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:59:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:00:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:19:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:25:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:31:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:14:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:15:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:24:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:25:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:39:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:40:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:44:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:44:55] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T368811 (10phaultfinder) 03NEW [07:45:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:59:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:00:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:35:04] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T368811#9936710 (10phaultfinder) [09:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:31:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:45:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:59:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:09:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:10:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:32:46] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:34:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:35:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:52:46] RESOLVED: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:09:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:10:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:00:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:05:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:31:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:39:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:15] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:03] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:46:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:15] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 416.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:59:15] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:16] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 50.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:09:55] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 41891976 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:10:55] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 102664 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:15:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:43:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:46:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:00:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:06:27] 06SRE, 06SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9937115 (10Urbanecm) >>! In T343377#9931101, @MoritzMuehlenhoff wrote: > One thing that we could do is to > > - Write a script which parses th... [17:08:49] 06SRE, 06SRE-OnFire, 10Stewards-Onboarding-Tool, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377#9937118 (10Urbanecm) Putting on my radar :). [17:28:39] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [17:35:09] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:03:05] (03PS1) 10Urbanecm: stewards-onboarder: Add gitlab API to config [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) [18:04:34] (03CR) 10Urbanecm: "Would someone help me with changing the patch so that the secret is looked up from private puppet (and with adding the secret itself to pr" [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm) [18:05:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:09:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:29:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:34:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:42:12] (03CR) 10Dzahn: "Just keep this as a placeholder here, maybe add a comment "real secret in private repo" or so and set it to a fake string or nothing." [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm) [18:42:44] (03CR) 10Dzahn: "I will take the secret from your home dir and make sure it gets in the right place." [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm) [18:46:18] (03CR) 10Dzahn: "oh, wait, we need to move this file somewhere under hiera data though. hold on :)" [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm) [18:59:35] (03CR) 10Dzahn: "see https://phabricator.wikimedia.org/T368834#9937194 you can now do a Hiera lookup to get the secret and have it in a variable." [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm) [19:02:05] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 133.95 ms [19:03:23] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 124.26 ms [19:07:16] (03PS2) 10Urbanecm: stewards-onboarder: Add gitlab API to config [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) [19:07:56] (03PS3) 10Urbanecm: stewards-onboarder: Add gitlab API to config [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) [19:08:01] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm) [19:11:57] PROBLEM - SSH on bast4005 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:12:57] RECOVERY - SSH on bast4005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:13:06] (03PS4) 10Urbanecm: stewards-onboarder: Add gitlab API to config [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) [19:13:10] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm) [19:15:23] (03PS5) 10Urbanecm: stewards-onboarder: Add gitlab API to config [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) [19:15:30] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm) [19:18:19] (03CR) 10Urbanecm: "Thanks for the quick response! Looks like I did it? At least output from PCC appears to be reasonable 😊." [puppet] - 10https://gerrit.wikimedia.org/r/1050731 (https://phabricator.wikimedia.org/T368834) (owner: 10Urbanecm) [19:46:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:59:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:04:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:14:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:15:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:30:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:39:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:44:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:49:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:55:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:59:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:04:15] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:26:37] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 363.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:33:37] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 3.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:44:37] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 368.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:50:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:54:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:56:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:58:35] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:00:33] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:14:15] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:19:37] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:56:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:58:35] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:35:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050754 [23:38:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1050754 (owner: 10TrainBranchBot) [23:40:07] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:40:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:40:21] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:40:41] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:41:57] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:42:31] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52196 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:43:13] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:49:15] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:50:33] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable