[00:06:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T342617)', diff saved to https://phabricator.wikimedia.org/P50545 and previous config saved to /var/cache/conftool/dbconfig/20230812-000602-ladsgroup.json
[00:06:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[00:06:06] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[00:06:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance
[00:06:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T342617)', diff saved to https://phabricator.wikimedia.org/P50546 and previous config saved to /var/cache/conftool/dbconfig/20230812-000623-ladsgroup.json
[00:19:21] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T344082 (10Gigermano)
[00:22:13] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:23:33] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:24:43] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T344083 (10Gigermano)
[00:25:47] <wikibugs>	 (03CR) 10Ssingh: "LGTM! Verified the Netbox includes. Leaving the IPv6 PTR records verification to the CI but otherwise checked them :)" [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[00:25:50] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[00:26:05] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T344084 (10Gigermano)
[00:26:34] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T344084 (10Gigermano) https://wikitech.wikimedia.org/wiki/Maps/External_usage
[00:30:11] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[00:32:30] <wikibugs>	 (03PS1) 10Ssingh: learn.wiki: update A records [dns] - 10https://gerrit.wikimedia.org/r/948195 (https://phabricator.wikimedia.org/T344073)
[00:38:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947398
[00:38:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947398 (owner: 10TrainBranchBot)
[00:39:17] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[00:46:43] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:10] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/947398 (owner: 10TrainBranchBot)
[01:13:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T342617)', diff saved to https://phabricator.wikimedia.org/P50547 and previous config saved to /var/cache/conftool/dbconfig/20230812-011330-ladsgroup.json
[01:13:34] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[01:23:59] <wikibugs>	 (03PS1) 10Mdaniels5757: add 'autopatrol' to Wikifunctions' functioneer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085)
[01:25:39] <wikibugs>	 (03PS3) 10Mdaniels5757: add autopatrolled group with autopatrol right for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946)
[01:26:44] <wikibugs>	 (03CR) 10Mdaniels5757: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757)
[01:28:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P50548 and previous config saved to /var/cache/conftool/dbconfig/20230812-012836-ladsgroup.json
[01:35:05] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:36:37] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:37:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:43:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P50549 and previous config saved to /var/cache/conftool/dbconfig/20230812-014342-ladsgroup.json
[01:49:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T342617)', diff saved to https://phabricator.wikimedia.org/P50550 and previous config saved to /var/cache/conftool/dbconfig/20230812-014901-ladsgroup.json
[01:49:05] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[01:58:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T342617)', diff saved to https://phabricator.wikimedia.org/P50551 and previous config saved to /var/cache/conftool/dbconfig/20230812-015849-ladsgroup.json
[01:58:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[01:58:53] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[01:59:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[01:59:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T342617)', diff saved to https://phabricator.wikimedia.org/P50552 and previous config saved to /var/cache/conftool/dbconfig/20230812-015910-ladsgroup.json
[02:04:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P50553 and previous config saved to /var/cache/conftool/dbconfig/20230812-020407-ladsgroup.json
[02:11:35] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:55] <icinga-wm>	 PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh6002 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[02:19:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P50554 and previous config saved to /var/cache/conftool/dbconfig/20230812-021913-ladsgroup.json
[02:25:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:30:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:30:35] <icinga-wm>	 PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh4002 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[02:31:35] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:34:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T342617)', diff saved to https://phabricator.wikimedia.org/P50555 and previous config saved to /var/cache/conftool/dbconfig/20230812-023419-ladsgroup.json
[02:34:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[02:34:24] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[02:34:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance
[02:34:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T342617)', diff saved to https://phabricator.wikimedia.org/P50556 and previous config saved to /var/cache/conftool/dbconfig/20230812-023441-ladsgroup.json
[02:47:57] <icinga-wm>	 PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh3002 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[03:02:59] <icinga-wm>	 PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh5001 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[03:17:39] <icinga-wm>	 PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh3001 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[03:36:09] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:37:39] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:52:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T342617)', diff saved to https://phabricator.wikimedia.org/P50557 and previous config saved to /var/cache/conftool/dbconfig/20230812-035205-ladsgroup.json
[03:52:09] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[04:07:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P50558 and previous config saved to /var/cache/conftool/dbconfig/20230812-040711-ladsgroup.json
[04:16:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T342617)', diff saved to https://phabricator.wikimedia.org/P50559 and previous config saved to /var/cache/conftool/dbconfig/20230812-041608-ladsgroup.json
[04:16:13] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[04:22:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P50560 and previous config saved to /var/cache/conftool/dbconfig/20230812-042217-ladsgroup.json
[04:31:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P50561 and previous config saved to /var/cache/conftool/dbconfig/20230812-043115-ladsgroup.json
[04:37:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T342617)', diff saved to https://phabricator.wikimedia.org/P50562 and previous config saved to /var/cache/conftool/dbconfig/20230812-043724-ladsgroup.json
[04:37:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[04:37:28] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[04:37:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[04:46:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P50563 and previous config saved to /var/cache/conftool/dbconfig/20230812-044621-ladsgroup.json
[05:01:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T342617)', diff saved to https://phabricator.wikimedia.org/P50564 and previous config saved to /var/cache/conftool/dbconfig/20230812-050127-ladsgroup.json
[05:01:42] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[05:19:30] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757)
[05:37:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:53:53] <wikibugs>	 (03CR) 10Anzx: add 'autopatrol' to Wikifunctions' functioneer group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757)
[05:56:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[05:56:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[05:56:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1219 (T342617)', diff saved to https://phabricator.wikimedia.org/P50565 and previous config saved to /var/cache/conftool/dbconfig/20230812-055651-ladsgroup.json
[05:56:55] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[05:59:16] <wikibugs>	 (03CR) 10Anzx: "adding Jdforrester as reviewer as per https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/946541/comments/38907160_7c178d16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757)
[06:09:03] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[06:09:09] <wikibugs>	 (03CR) 10Anzx: "adding Jdforrester as reviewer as per https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/946541/comments/38907160_7c178d16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757)
[06:14:37] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms
[06:33:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:38:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:50:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:55:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230812T0700)
[07:39:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T342617)', diff saved to https://phabricator.wikimedia.org/P50566 and previous config saved to /var/cache/conftool/dbconfig/20230812-073953-ladsgroup.json
[07:39:57] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[07:54:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P50567 and previous config saved to /var/cache/conftool/dbconfig/20230812-075459-ladsgroup.json
[08:10:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P50568 and previous config saved to /var/cache/conftool/dbconfig/20230812-081005-ladsgroup.json
[08:11:00] <wikibugs>	 (03CR) 10Sohom Datta: "Will schedule this for Monday afternoon (in all probability)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883 (owner: 10Sohom Datta)
[08:11:05] <wikibugs>	 (03PS2) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883
[08:15:05] <wikibugs>	 (03PS3) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883
[08:15:19] <wikibugs>	 (03PS4) 10Sohom Datta: Enable EditInSequence on all wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947883
[08:21:12] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Volunteer NDA for RhinosF1 - https://phabricator.wikimedia.org/T341272 (10RhinosF1) @KFrancis: signed
[08:25:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T342617)', diff saved to https://phabricator.wikimedia.org/P50569 and previous config saved to /var/cache/conftool/dbconfig/20230812-082511-ladsgroup.json
[08:25:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[08:25:15] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[08:25:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[08:34:26] <wikibugs>	 (03CR) 10RhinosF1: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757)
[09:37:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:47:17] <icinga-wm>	 PROBLEM - Host maps2009 is DOWN: PING CRITICAL - Packet loss = 100%
[09:50:13] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:45] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T344082 (10Aklapper)
[11:10:51] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T344083 (10Aklapper)
[11:10:53] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T344084 (10Aklapper)
[11:11:31] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T344082 (10Aklapper) 05Open→03Invalid @Gigermano: You provided zero data, thus closing. Please do not create empty tickets but provide required information.
[11:24:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:35] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:31] <wikibugs>	 (03PS1) 10Amire80: Add lucaswerkmeister.de to Planet [puppet] - 10https://gerrit.wikimedia.org/r/948203
[11:44:01] <icinga-wm>	 RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh4002 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[11:44:01] <icinga-wm>	 RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh3001 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[11:44:01] <icinga-wm>	 RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh3002 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[11:44:01] <icinga-wm>	 RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh6002 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[11:44:03] <icinga-wm>	 RECOVERY - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh5001 is OK: OK: pdns-recursor.service was restarted after /etc/powerdns/recursor.conf was changed. https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[11:46:02] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344096 (10phaultfinder)
[11:46:04] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[11:46:06] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344097 (10phaultfinder)
[11:46:08] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344095 (10phaultfinder)
[11:46:10] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344099 (10phaultfinder)
[11:46:12] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344100 (10phaultfinder)
[11:46:14] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344098 (10phaultfinder)
[11:51:02] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[11:51:04] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[11:51:06] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[12:05:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Reverse DNS includes for new ranges assigned to esams [dns] - 10https://gerrit.wikimedia.org/r/947830 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[12:06:05] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:09:21] <wikibugs>	 (03PS1) 10Cathal Mooney: Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214)
[12:10:13] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:10:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[12:11:01] <wikibugs>	 (03PS2) 10Cathal Mooney: Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214)
[12:11:35] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 1.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:11:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[12:12:03] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.965 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:23:12] <wikibugs>	 (03PS3) 10Cathal Mooney: Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214)
[12:24:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Include reverse entries for new esams LVS IPv6 VIPs [dns] - 10https://gerrit.wikimedia.org/r/948205 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[12:37:21] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:38:41] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:49:49] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:50:59] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:51:09] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:41] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:23] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:08:05] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:09:29] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:10:49] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:13:45] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.417 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:14:07] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.378 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:18:21] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:18:39] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:34:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:37:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:42:43] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:44:05] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:45:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:50:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:54:35] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:35] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:13:11] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.210 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:13:25] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 2.733 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:16:35] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:19:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:41] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:01] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:24:21] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:33:13] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:34:39] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:38:03] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:39:15] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:44:03] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:44:23] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:53:17] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.334 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:53:31] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 2.424 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:59:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:39] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:05:59] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:07:05] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:10:07] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:11:39] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:12:05] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 5.769 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:16:25] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:16:43] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:25:21] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:28:15] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:28:27] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.825 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:28:45] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 3.838 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:33:05] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:34:29] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.362 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:37:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:41:37] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:16] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[15:46:18] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[15:46:20] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[15:46:22] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[15:51:17] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[15:51:19] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[15:51:21] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[15:54:11] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:54:31] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:55:33] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:56:57] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:03:13] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.835 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:03:33] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 2.572 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:07:53] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:08:11] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:12:21] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.440 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:12:41] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.826 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:20:13] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:21:27] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:27:43] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.193 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:28:49] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:40:40] <wikibugs>	 (03CR) 10Mdaniels5757: "Marking resolved (I think)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757)
[17:03:07] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:07:43] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:37:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:01:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:14:47] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove dns300x from network device NTP servers [homer/public] - 10https://gerrit.wikimedia.org/r/948209 (https://phabricator.wikimedia.org/T329219)
[18:15:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove dns300x from network device NTP servers [homer/public] - 10https://gerrit.wikimedia.org/r/948209 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney)
[18:16:29] <wikibugs>	 (03Merged) 10jenkins-bot: Remove dns300x from network device NTP servers [homer/public] - 10https://gerrit.wikimedia.org/r/948209 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney)
[18:17:35] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:42:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:46:25] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:47:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:24:01] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:31:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:32:28] <wikibugs>	 (03CR) 10Ssingh: "Good catch, thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/948209 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney)
[19:35:55] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:50:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:51:02] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[19:51:04] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[19:51:06] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[19:51:08] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[19:54:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:56:03] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[19:56:05] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[19:56:07] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[20:00:21] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 6 day(s) (Sat 19 Aug 2023 04:23:22 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:01:55] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:02:46] <wikibugs>	 (03PS1) 10Cathal Mooney: Add / update network definitions to include new esams ranges [puppet] - 10https://gerrit.wikimedia.org/r/948216 (https://phabricator.wikimedia.org/T343214)
[20:05:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add / update network definitions to include new esams ranges [puppet] - 10https://gerrit.wikimedia.org/r/948216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[20:09:31] <wikibugs>	 (03PS2) 10Cathal Mooney: Add / update network definitions to include new esams ranges [puppet] - 10https://gerrit.wikimedia.org/r/948216 (https://phabricator.wikimedia.org/T343214)
[20:12:36] <wikibugs>	 (03PS3) 10Cathal Mooney: Add / update network definitions to include new esams ranges [puppet] - 10https://gerrit.wikimedia.org/r/948216 (https://phabricator.wikimedia.org/T343214)
[20:16:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:20:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:36:01] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:43:43] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:57:51] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:59:07] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:21] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:03:45] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:16:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:23:37] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:25:07] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:29:41] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:34:09] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:41:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:44:16] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344110 (10phaultfinder)
[22:42:01] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:58:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:13:47] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[23:35:03] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[23:41:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:45:41] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:51:17] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[23:51:19] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[23:51:21] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[23:51:23] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[23:56:18] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[23:56:20] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)
[23:56:22] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T344101 (10phaultfinder)