[00:06:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T342617)', diff saved to https://phabricator.wikimedia.org/P50545 and previous config saved to /var/cache/conftool/dbconfig/20230812-000602-ladsgroup.json [00:06:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:06:06] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:06:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:06:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T342617)', diff saved to https://phabricator.wikimedia.org/P50546 and previous config saved to /var/cache/conftool/dbconfig/20230812-000623-ladsgroup.json [00:19:21] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T344082 (10Gigermano) [00:22:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:24:43] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T344083 (10Gigermano) [00:25:47] (03CR) 10Ssingh: "LGTM! Verified the Netbox includes. Leaving the IPv6 PTR records verification to the CI but otherwise checked them :)" [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [00:25:50] (03CR) 10Ssingh: [C: 03+1] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [00:26:05] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T344084 (10Gigermano) [00:26:34] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T344084 (10Gigermano) https://wikitech.wikimedia.org/wiki/Maps/External_usage [00:30:11] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [00:32:30] (03PS1) 10Ssingh: learn.wiki: update A records [dns] - 10https://gerrit.wikimedia.org/r/948195 (https://phabricator.wikimedia.org/T344073) [00:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947398 [00:38:46] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947398 (owner: 10TrainBranchBot) [00:39:17] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [00:46:43] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947398 (owner: 10TrainBranchBot) [01:13:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T342617)', diff saved to https://phabricator.wikimedia.org/P50547 and previous config saved to /var/cache/conftool/dbconfig/20230812-011330-ladsgroup.json [01:13:34] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:23:59] (03PS1) 10Mdaniels5757: add 'autopatrol' to Wikifunctions' functioneer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) [01:25:39] (03PS3) 10Mdaniels5757: add autopatrolled group with autopatrol right for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) [01:26:44] (03CR) 10Mdaniels5757: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [01:28:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P50548 and previous config saved to /var/cache/conftool/dbconfig/20230812-012836-ladsgroup.json [01:35:05] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:36:37] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:43:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P50549 and previous config saved to /var/cache/conftool/dbconfig/20230812-014342-ladsgroup.json [01:49:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T342617)', diff saved to https://phabricator.wikimedia.org/P50550 and previous config saved to /var/cache/conftool/dbconfig/20230812-014901-ladsgroup.json [01:49:05] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:58:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T342617)', diff saved to https://phabricator.wikimedia.org/P50551 and previous config saved to /var/cache/conftool/dbconfig/20230812-015849-ladsgroup.json [01:58:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [01:58:53] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:59:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [01:59:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T342617)', diff saved to https://phabricator.wikimedia.org/P50552 and previous config saved to /var/cache/conftool/dbconfig/20230812-015910-ladsgroup.json [02:04:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P50553 and previous config saved to /var/cache/conftool/dbconfig/20230812-020407-ladsgroup.json [02:11:35] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:55] PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh6002 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [02:19:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P50554 and previous config saved to /var/cache/conftool/dbconfig/20230812-021913-ladsgroup.json [02:25:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:30:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:30:35] PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh4002 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [02:31:35] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T342617)', diff saved to https://phabricator.wikimedia.org/P50555 and previous config saved to /var/cache/conftool/dbconfig/20230812-023419-ladsgroup.json [02:34:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [02:34:24] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [02:34:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [02:34:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T342617)', diff saved to https://phabricator.wikimedia.org/P50556 and previous config saved to /var/cache/conftool/dbconfig/20230812-023441-ladsgroup.json [02:47:57] PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh3002 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [03:02:59] PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh5001 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [03:17:39] PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh3001 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [03:36:09] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:37:39] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:52:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T342617)', diff saved to https://phabricator.wikimedia.org/P50557 and previous config saved to /var/cache/conftool/dbconfig/20230812-035205-ladsgroup.json [03:52:09] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [04:07:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P50558 and previous config saved to /var/cache/conftool/dbconfig/20230812-040711-ladsgroup.json [04:16:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T342617)', diff saved to https://phabricator.wikimedia.org/P50559 and previous config saved to /var/cache/conftool/dbconfig/20230812-041608-ladsgroup.json [04:16:13] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [04:22:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P50560 and previous config saved to /var/cache/conftool/dbconfig/20230812-042217-ladsgroup.json [04:31:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P50561 and previous config saved to /var/cache/conftool/dbconfig/20230812-043115-ladsgroup.json [04:37:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T342617)', diff saved to https://phabricator.wikimedia.org/P50562 and previous config saved to /var/cache/conftool/dbconfig/20230812-043724-ladsgroup.json [04:37:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [04:37:28] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [04:37:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [04:46:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P50563 and previous config saved to /var/cache/conftool/dbconfig/20230812-044621-ladsgroup.json [05:01:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T342617)', diff saved to https://phabricator.wikimedia.org/P50564 and previous config saved to /var/cache/conftool/dbconfig/20230812-050127-ladsgroup.json [05:01:42] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [05:19:30] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [05:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:53:53] (03CR) 10Anzx: add 'autopatrol' to Wikifunctions' functioneer group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [05:56:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [05:56:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [05:56:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1219 (T342617)', diff saved to https://phabricator.wikimedia.org/P50565 and previous config saved to /var/cache/conftool/dbconfig/20230812-055651-ladsgroup.json [05:56:55] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [05:59:16] (03CR) 10Anzx: "adding Jdforrester as reviewer as per https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/946541/comments/38907160_7c178d16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [06:09:03] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [06:09:09] (03CR) 10Anzx: "adding Jdforrester as reviewer as per https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/946541/comments/38907160_7c178d16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [06:14:37] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [06:33:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:38:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:50:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:55:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230812T0700) [07:39:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T342617)', diff saved to https://phabricator.wikimedia.org/P50566 and previous config saved to /var/cache/conftool/dbconfig/20230812-073953-ladsgroup.json [07:39:57] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [07:54:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P50567 and previous config saved to /var/cache/conftool/dbconfig/20230812-075459-ladsgroup.json [08:10:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P50568 and previous config saved to /var/cache/conftool/dbconfig/20230812-081005-ladsgroup.json [08:11:00] (03CR) 10Sohom Datta: "Will schedule this for Monday afternoon (in all probability)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (owner: 10Sohom Datta) [08:11:05] (03PS2) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 [08:15:05] (03PS3) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 [08:15:19] (03PS4) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 [08:21:12] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Volunteer NDA for RhinosF1 - https://phabricator.wikimedia.org/T341272 (10RhinosF1) @KFrancis: signed [08:25:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T342617)', diff saved to https://phabricator.wikimedia.org/P50569 and previous config saved to /var/cache/conftool/dbconfig/20230812-082511-ladsgroup.json [08:25:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:25:15] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [08:25:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [08:34:26] (03CR) 10RhinosF1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [09:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:47:17] PROBLEM - Host maps2009 is DOWN: PING CRITICAL - Packet loss = 100% [09:50:13] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:45] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T344082 (10Aklapper) [11:10:51] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T344083 (10Aklapper) [11:10:53] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T344084 (10Aklapper) [11:11:31] 10SRE, 10Maps: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T344082 (10Aklapper) 05Open→03Invalid @Gigermano: You provided zero data, thus closing. Please do not create empty tickets but provide required information. [11:24:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:35] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:31] (03PS1) 10Amire80: Add lucaswerkmeister.de to Planet [puppet] - 10https://gerrit.wikimedia.org/r/948203 [11:44:01] RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh4002 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [11:44:01] RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh3001 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [11:44:01] RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh3002 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [11:44:01] RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh6002 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [11:44:03] RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh5001 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [11:46:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344096 (10phaultfinder) [11:46:04] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [11:46:06] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344097 (10phaultfinder) [11:46:08] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344095 (10phaultfinder) [11:46:10] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344099 (10phaultfinder) [11:46:12] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344100 (10phaultfinder) [11:46:14] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344098 (10phaultfinder) [11:51:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [11:51:04] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [11:51:06] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [12:05:13] (03CR) 10Cathal Mooney: [C: 03+2] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [12:06:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:09:21] (03PS1) 10Cathal Mooney: Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) [12:10:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:10:13] (03CR) 10CI reject: [V: 04-1] Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [12:11:01] (03PS2) 10Cathal Mooney: Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) [12:11:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 1.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:11:54] (03CR) 10CI reject: [V: 04-1] Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [12:12:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.965 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:23:12] (03PS3) 10Cathal Mooney: Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) [12:24:18] (03CR) 10CI reject: [V: 04-1] Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [12:37:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:38:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:49:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:50:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:09] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:23] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:08:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:09:29] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:10:49] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:13:45] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.417 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:14:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.378 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:18:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:18:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:34:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:42:43] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:44:05] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:45:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:35] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:35] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:13:11] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.210 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 2.733 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:16:35] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:24:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:33:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:34:39] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:38:03] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:39:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:44:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:44:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.334 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 2.424 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:59:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:39] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:05:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:07:05] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:10:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:11:39] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:12:05] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 5.769 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:16:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:16:43] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:25:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:28:15] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:28:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.825 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:28:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 3.838 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:33:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:34:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.362 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:37:05] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:16] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [15:46:18] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [15:46:20] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [15:46:22] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [15:51:17] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [15:51:19] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [15:51:21] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [15:54:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:54:31] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:55:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:56:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:03:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.835 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:03:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 2.572 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:07:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:08:11] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:12:21] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.440 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:12:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.826 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:20:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:27:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:28:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:40:40] (03CR) 10Mdaniels5757: "Marking resolved (I think)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [17:03:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:01:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:47] (03PS1) 10Cathal Mooney: Remove dns300x from network device NTP servers [homer/public] - 10https://gerrit.wikimedia.org/r/948209 (https://phabricator.wikimedia.org/T329219) [18:15:56] (03CR) 10Cathal Mooney: [C: 03+2] Remove dns300x from network device NTP servers [homer/public] - 10https://gerrit.wikimedia.org/r/948209 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [18:16:29] (03Merged) 10jenkins-bot: Remove dns300x from network device NTP servers [homer/public] - 10https://gerrit.wikimedia.org/r/948209 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [18:17:35] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:46:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:47:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:24:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:28] (03CR) 10Ssingh: "Good catch, thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/948209 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [19:35:55] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:50:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [19:51:04] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [19:51:06] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [19:51:08] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [19:54:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:56:03] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [19:56:05] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [19:56:07] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [20:00:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:01:55] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:02:46] (03PS1) 10Cathal Mooney: Add / update network definitions to include new esams ranges [puppet] - 10https://gerrit.wikimedia.org/r/948216 (https://phabricator.wikimedia.org/T343214) [20:05:19] (03CR) 10CI reject: [V: 04-1] Add / update network definitions to include new esams ranges [puppet] - 10https://gerrit.wikimedia.org/r/948216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [20:09:31] (03PS2) 10Cathal Mooney: Add / update network definitions to include new esams ranges [puppet] - 10https://gerrit.wikimedia.org/r/948216 (https://phabricator.wikimedia.org/T343214) [20:12:36] (03PS3) 10Cathal Mooney: Add / update network definitions to include new esams ranges [puppet] - 10https://gerrit.wikimedia.org/r/948216 (https://phabricator.wikimedia.org/T343214) [20:16:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:51] RECOVERY - BFD status on cr2-eqiad is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:59:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:03:45] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:16:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:37] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:09] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:41:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:16] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10phaultfinder) [22:42:01] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:39] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:47] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:35:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [23:41:03] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:41] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:17] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [23:51:19] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [23:51:21] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [23:51:23] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [23:56:18] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [23:56:20] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder) [23:56:22] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)