[00:06:14] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:06:54] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:18:12] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: decommission cookbook: add support for decom spreadsheet - https://phabricator.wikimedia.org/T244315 (10Pppery) Patch was merged. Can this be closed as resolved?
[00:19:16] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Traffic, and 2 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Pppery)
[00:19:36] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099 (10Pppery)
[00:19:55] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Continuous-Integration-Config: Add shell scripts CI validations - https://phabricator.wikimedia.org/T148494 (10Pppery)
[00:20:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-Needs-Improvement: Authoritative ports list - https://phabricator.wikimedia.org/T277146 (10Pppery)
[00:20:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-Needs-Improvement: Kryo memcached transcoder broken in CAS 6.3/6.4 - https://phabricator.wikimedia.org/T273867 (10Pppery)
[00:20:50] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Performance Issue: Investigate mysterious_sysctl settings and figure out what to do with them - https://phabricator.wikimedia.org/T118812 (10Pppery)
[00:21:01] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Bashisms in various /bin/sh scripts - https://phabricator.wikimedia.org/T95064 (10Pppery)
[00:21:26] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Technical-Debt: "Setting templatedir is deprecated" warning issued on self-hosted puppetmaster - https://phabricator.wikimedia.org/T95158 (10Pppery)
[00:21:58] <wikibugs>	 10SRE, 10Traffic-Icebox: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10Pppery)
[00:22:23] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-Needs-Improvement, 10Performance-Team (Radar): Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10Pppery)
[00:22:49] <wikibugs>	 10SRE, 10Traffic-Icebox: VCL: handling of uncacheable responses in wikimedia-common - https://phabricator.wikimedia.org/T180712 (10Pppery)
[00:23:06] <wikibugs>	 10SRE, 10Traffic-Icebox: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368 (10Pppery)
[00:23:26] <wikibugs>	 10SRE, 10DBA, 10Traffic-Icebox: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462 (10Pppery)
[00:23:38] <wikibugs>	 10SRE, 10PyBal, 10Traffic-Icebox: Fully-redundant LVS clusters using Pybal per-service MED feature - https://phabricator.wikimedia.org/T165764 (10Pppery)
[00:23:40] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[00:23:46] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Performance-Team (Radar): Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765 (10Pppery)
[00:24:17] <wikibugs>	 10SRE, 10CX-cxserver, 10Citoid, 10RESTBase, and 2 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001 (10Pppery)
[00:24:23] <wikibugs>	 10SRE, 10PyBal, 10Traffic-Icebox: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372 (10Pppery)
[00:24:36] <wikibugs>	 10SRE, 10Traffic-Icebox: Improve Varnish XFF processing for trusted proxies - https://phabricator.wikimedia.org/T120121 (10Pppery)
[00:25:40] <wikibugs>	 10SRE, 10RESTBase-API, 10Traffic-Icebox: [feature request] Redirect root API path to docs page - https://phabricator.wikimedia.org/T125226 (10Pppery)
[00:25:55] <wikibugs>	 10SRE, 10Traffic-Icebox, 10observability: prometheus -> grafana stats for per-numa-node meminfo - https://phabricator.wikimedia.org/T175636 (10Pppery)
[00:32:30] <jinxer-wm>	 (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[00:52:30] <jinxer-wm>	 (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got better   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[01:09:32] <wikibugs>	 10SRE, 10Patch-Needs-Improvement, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10Pppery)
[01:11:47] <wikibugs>	 10SRE, 10SRE-swift-storage: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10Pppery)
[01:15:27] <wikibugs>	 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, 10MW-1.38-notes (1.38.0-wmf.21; 2022-02-07): Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Pppery)
[01:24:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10Pppery)
[01:24:43] <wikibugs>	 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic, 10Patch-Needs-Improvement: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10Pppery)
[01:38:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[01:56:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:01:18] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:01:30] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:08:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[02:21:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:23:22] <icinga-wm>	 PROBLEM - puppet last run on gitlab1004 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:23:46] <icinga-wm>	 PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:33:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[02:46:40] <icinga-wm>	 RECOVERY - puppet last run on gitlab1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:47:02] <icinga-wm>	 RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[03:23:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[03:38:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[04:03:46] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:04:38] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:08:06] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:08:33] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[04:09:00] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:43:12] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 127 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230402T0700)
[08:28:35] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[08:45:34] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 128 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[10:23:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[10:28:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:33:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:41:04] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:42:50] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:49:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 252.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[10:59:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 205.8k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[11:07:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:11:52] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:12:08] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:12:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:15:32] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:17:24] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:17:38] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:21:18] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:21:18] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:23:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[12:03:20] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:03:32] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:05:12] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:05:24] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:18:02] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:18:16] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:19:50] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:20:04] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:23:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[12:33:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:35:16] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:46:56] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:48:46] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:19:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:21:14] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:22:18] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:23:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[13:24:08] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:46:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:46:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:08:35] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[14:09:46] <perryprog>	 Currently getting "Original error: upstream connect error or disconnect/reset before headers. reset reason: overflow" on enwp
[14:10:18] <jinxer-wm>	 (ProbeDown) firing: (3) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:10:31] <perryprog>	 (eqiad)
[14:10:40] <RhinosF1>	 perryprog: looks like you’re not the only one
[14:10:45] <RhinosF1>	 It just paged
[14:10:53] * Emperor here
[14:10:54] <perryprog>	 always fun to beat the page :)
[14:11:03] <AmandaNP>	 seems already fixed for me
[14:11:06] <perryprog>	 same here
[14:11:09] <RhinosF1>	 I can get enwp fine though
[14:11:19] <perryprog>	 good job all, I'll take all the credit
[14:13:59] <Emperor>	 does look to have been a brief spike
[14:14:50] <Amir1>	 Is it over? Unfortunately I'm in library with phone only
[14:15:18] <jinxer-wm>	 (ProbeDown) resolved: (4) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:16:14] <TheresNoTime>	 (2slow)
[14:16:19] <Emperor>	 Amir1: I think so
[14:17:53] <Amir1>	 Cool
[14:18:05] <jelto>	 http 50X errors seem to be back to normal and probes recovered/resolved
[14:24:47] <perryprog>	 getting weirdness again?
[14:25:17] <perryprog>	 yeah it's again. #page
[14:27:18] <jinxer-wm>	 (ProbeDown) firing: (4) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:27:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:28:05] <RhinosF1>	 Emperor:
[14:28:12] <perryprog>	 seems resolved again, same duration
[14:28:47] <RhinosF1>	 Emperor, jelto: could be flapping
[14:30:59] <RhinosF1>	 perryprog: there’s definately a 2nd drop on the red dashboards
[14:31:04] * perryprog nods
[14:31:38] <perryprog>	 was curling it too. Didn't have headers being printed but at one point I was getting no body response.
[14:31:53] <RhinosF1>	 Something can’t be right
[14:31:56] <perryprog>	 then it went to the upstream connect error page
[14:32:18] <jinxer-wm>	 (ProbeDown) resolved: (5) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:32:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (13) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:46:22] <TheresNoTime>	 It'll be short-lived if it happens again :)
[14:47:46] <perryprog>	 but think about all the ad revenue we'll lose
[14:57:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (12) wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[15:23:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[16:15:08] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:16:58] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:11:02] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:11:36] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:28:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[18:23:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[19:23:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[19:28:04] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:34:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:39:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:08:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[20:23:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[20:56:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:56:30] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:01:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:28:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[22:23:33] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[22:43:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[22:58:34] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)