[00:06:14] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:06:54] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:18:12] 10SRE-tools, 10Infrastructure-Foundations: decommission cookbook: add support for decom spreadsheet - https://phabricator.wikimedia.org/T244315 (10Pppery) Patch was merged. Can this be closed as resolved? [00:19:16] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic, and 2 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Pppery) [00:19:36] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099 (10Pppery) [00:19:55] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Continuous-Integration-Config: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10Pppery) [00:20:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-Needs-Improvement: Authoritative ports list - https://phabricator.wikimedia.org/T277146 (10Pppery) [00:20:34] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-Needs-Improvement: Kryo memcached transcoder broken in CAS 6.3/6.4 - https://phabricator.wikimedia.org/T273867 (10Pppery) [00:20:50] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Performance Issue: Investigate mysterious_sysctl settings and figure out what to do with them - https://phabricator.wikimedia.org/T118812 (10Pppery) [00:21:01] 10Puppet, 10Infrastructure-Foundations: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064 (10Pppery) [00:21:26] 10Puppet, 10Infrastructure-Foundations, 10Technical-Debt: "Setting templatedir is deprecated" warning issued on self-hosted puppetmaster - https://phabricator.wikimedia.org/T95158 (10Pppery) [00:21:58] 10SRE, 10Traffic-Icebox: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10Pppery) [00:22:23] 10SRE, 10Traffic-Icebox, 10Patch-Needs-Improvement, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10Pppery) [00:22:49] 10SRE, 10Traffic-Icebox: VCL: handling of uncacheable responses in wikimedia-common - https://phabricator.wikimedia.org/T180712 (10Pppery) [00:23:06] 10SRE, 10Traffic-Icebox: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368 (10Pppery) [00:23:26] 10SRE, 10DBA, 10Traffic-Icebox: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462 (10Pppery) [00:23:38] 10SRE, 10PyBal, 10Traffic-Icebox: Fully-redundant LVS clusters using Pybal per-service MED feature - https://phabricator.wikimedia.org/T165764 (10Pppery) [00:23:40] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [00:23:46] 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar): Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10Pppery) [00:24:17] 10SRE, 10CX-cxserver, 10Citoid, 10RESTBase, and 2 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001 (10Pppery) [00:24:23] 10SRE, 10PyBal, 10Traffic-Icebox: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372 (10Pppery) [00:24:36] 10SRE, 10Traffic-Icebox: Improve Varnish XFF processing for trusted proxies - https://phabricator.wikimedia.org/T120121 (10Pppery) [00:25:40] 10SRE, 10RESTBase-API, 10Traffic-Icebox: [feature request] Redirect root API path to docs page - https://phabricator.wikimedia.org/T125226 (10Pppery) [00:25:55] 10SRE, 10Traffic-Icebox, 10observability: prometheus -> grafana stats for per-numa-node meminfo - https://phabricator.wikimedia.org/T175636 (10Pppery) [00:32:30] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:52:30] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:09:32] 10SRE, 10Patch-Needs-Improvement, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10Pppery) [01:11:47] 10SRE, 10SRE-swift-storage: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10Pppery) [01:15:27] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.21; 2022-02-07): Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Pppery) [01:24:19] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10Pppery) [01:24:43] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-Needs-Improvement: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10Pppery) [01:38:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [01:56:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:01:18] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:30] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [02:21:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:22] PROBLEM - puppet last run on gitlab1004 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:23:46] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:33:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [02:46:40] RECOVERY - puppet last run on gitlab1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:47:02] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:23:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [03:38:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [04:03:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:04:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:08:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:08:33] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [04:09:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:43:12] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 127 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230402T0700) [08:28:35] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [08:45:34] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 128 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:23:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [10:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:41:04] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:42:50] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:49:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 252.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [10:59:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 205.8k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [11:07:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:52] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:12:08] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:12:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:15:32] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:17:24] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:17:38] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:21:18] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:21:18] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:23:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [12:03:20] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:03:32] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:05:12] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:05:24] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:18:02] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:18:16] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:19:50] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:20:04] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:23:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [12:33:26] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:35:16] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:46:56] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:48:46] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:19:24] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:21:14] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:22:18] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:23:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [13:24:08] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:46:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:46:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:08:35] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [14:09:46] Currently getting "Original error: upstream connect error or disconnect/reset before headers. reset reason: overflow" on enwp [14:10:18] (ProbeDown) firing: (3) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:31] (eqiad) [14:10:40] perryprog: looks like you’re not the only one [14:10:45] It just paged [14:10:53] * Emperor here [14:10:54] always fun to beat the page :) [14:11:03] seems already fixed for me [14:11:06] same here [14:11:09] I can get enwp fine though [14:11:19] good job all, I'll take all the credit [14:13:59] does look to have been a brief spike [14:14:50] Is it over? Unfortunately I'm in library with phone only [14:15:18] (ProbeDown) resolved: (4) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:16:14] (2slow) [14:16:19] Amir1: I think so [14:17:53] Cool [14:18:05] http 50X errors seem to be back to normal and probes recovered/resolved [14:24:47] getting weirdness again? [14:25:17] yeah it's again. #page [14:27:18] (ProbeDown) firing: (4) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:28:05] Emperor: [14:28:12] seems resolved again, same duration [14:28:47] Emperor, jelto: could be flapping [14:30:59] perryprog: there’s definately a 2nd drop on the red dashboards [14:31:04] * perryprog nods [14:31:38] was curling it too. Didn't have headers being printed but at one point I was getting no body response. [14:31:53] Something can’t be right [14:31:56] then it went to the upstream connect error page [14:32:18] (ProbeDown) resolved: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (13) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:46:22] It'll be short-lived if it happens again :) [14:47:46] but think about all the ad revenue we'll lose [14:57:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (12) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:23:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [16:15:08] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:58] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:11:02] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:11:36] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [18:23:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [19:23:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [19:28:04] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:34:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:39:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:08:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [20:23:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [20:56:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:56:30] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:28:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [22:23:33] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [22:43:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [22:58:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)