[00:00:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:02] <icinga-wm>	 RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:09:28] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[00:11:22] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[00:22:36] <icinga-wm>	 RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:41:29] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe)
[00:45:36] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:51:30] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:13:16] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[01:15:18] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:19:14] <wikibugs>	 (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[01:19:16] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms
[01:21:14] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:38:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:39:06] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack: manifests for glance, nova, keystone, placement version Y [puppet] - 10https://gerrit.wikimedia.org/r/851168 (https://phabricator.wikimedia.org/T305828)
[01:39:08] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack: Add manifests for Neutron version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851169 (https://phabricator.wikimedia.org/T305828)
[01:39:10] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack: Add manifests for Trove version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851170 (https://phabricator.wikimedia.org/T305828)
[01:39:12] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack: Add manifests for Heat and Magnum version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851171 (https://phabricator.wikimedia.org/T305828)
[01:39:14] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack: Add manifests for Cinder version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851172 (https://phabricator.wikimedia.org/T305828)
[01:39:16] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack: Add manifests for Barbican version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851173 (https://phabricator.wikimedia.org/T305828)
[01:39:18] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev openstack -> version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851174 (https://phabricator.wikimedia.org/T305828)
[01:45:04] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:47:31] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[01:48:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:00] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:52:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T0200)
[02:07:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.8 [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851030 (https://phabricator.wikimedia.org/T320513)
[02:07:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.8 [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851030 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot)
[02:07:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[02:08:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[02:08:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[02:09:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[02:22:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.8 [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851030 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot)
[02:29:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[02:30:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[02:30:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[02:30:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[02:36:34] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:54:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T0300)
[03:00:22] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:18] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851179 (https://phabricator.wikimedia.org/T320513)
[03:01:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851179 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot)
[03:02:04] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851179 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot)
[03:02:32] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.8  refs T320513
[03:02:41] <stashbot>	 T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513
[03:03:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:06:16] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:06:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[03:07:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[03:07:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[03:08:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[03:30:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:34:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:36:08] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:36:28] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.8  refs T320513 (duration: 33m 56s)
[03:36:34] <stashbot>	 T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513
[03:38:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[03:42:24] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[03:45:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[03:45:48] <icinga-wm>	 PROBLEM - SSH on mw1332.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:51:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[04:04:51] <wikibugs>	 (03PS1) 10DLynch: Bump sampling rate to 0.2 for various editing schemas on a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851182 (https://phabricator.wikimedia.org/T321734)
[04:06:00] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:33:48] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:45:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:51:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:05:32] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:07:46] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), No backups: 1 (dispatch-be1001), Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:15:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:21:08] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:52:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T0600).
[06:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:30:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:35:40] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:36:18] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:48:40] <icinga-wm>	 RECOVERY - SSH on mw1332.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:54:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T0700)
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:04:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:34:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:45:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:46:48] <wikibugs>	 (03PS2) 10Slyngshede: data.yaml: Move user mfossati from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/850409 (https://phabricator.wikimedia.org/T321772)
[07:48:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10SLyngshede-WMF) @MarkTraceur Will you approve, so we can move Marco to deployment?
[07:51:20] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:29] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove user faidon from Juniper access [homer/public] - 10https://gerrit.wikimedia.org/r/851590 (https://phabricator.wikimedia.org/T322101)
[08:05:32] <wikibugs>	 (03PS3) 10Slyngshede: C:idm::deployment of IDM. [puppet] - 10https://gerrit.wikimedia.org/r/851064 (https://phabricator.wikimedia.org/T320428)
[08:08:46] <wikibugs>	 (03CR) 10Slyngshede: C:idm::deployment of IDM. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851064 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede)
[08:13:01] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove user faidon from Juniper access [homer/public] - 10https://gerrit.wikimedia.org/r/851590 (https://phabricator.wikimedia.org/T322101) (owner: 10Cathal Mooney)
[08:13:40] <wikibugs>	 (03Merged) 10jenkins-bot: Remove user faidon from Juniper access [homer/public] - 10https://gerrit.wikimedia.org/r/851590 (https://phabricator.wikimedia.org/T322101) (owner: 10Cathal Mooney)
[08:15:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Faidon from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/851594
[08:19:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove Faidon from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/851594 (owner: 10Muehlenhoff)
[08:21:02] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:22:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for Faidon [puppet] - 10https://gerrit.wikimedia.org/r/851595 (https://phabricator.wikimedia.org/T322101)
[08:25:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for Faidon [puppet] - 10https://gerrit.wikimedia.org/r/851595 (https://phabricator.wikimedia.org/T322101) (owner: 10Muehlenhoff)
[08:26:58] <icinga-wm>	 PROBLEM - Disk space on cp5007 is CRITICAL: DISK CRITICAL - free space: / 342 MB (3% inode=77%): /tmp 342 MB (3% inode=77%): /var/tmp 342 MB (3% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops
[08:27:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Faidon Liambotis out of all services on: 802 hosts
[08:28:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Faidon Liambotis out of all services on: 802 hosts
[08:28:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Faidon Liambotis out of all services on: 1203 hosts
[08:28:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Faidon Liambotis out of all services on: 1203 hosts
[08:30:26] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:43] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] p::toolforge:harbor: use distro docker for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro)
[08:30:48] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] p::toolforge:harbor: use distro docker for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro)
[08:30:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: lvs500[1-3] are unable to establish BGP sessions with cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T321545 (10fgiunchedi)
[08:31:11] <wikibugs>	 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10fgiunchedi) 05Open→03Declined Ok! Declining for now; feel free to reopen as needed
[08:32:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: don't monitor /run/docker on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/850993 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[08:32:46] <moritzm>	 !log draining ganeti1028 for eventual reimage T311687
[08:32:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:52] <godog>	 dcausse: merged your change too
[08:33:25] <godog>	 nope, sorry, I meant dcaro which he's not here
[08:33:28] <godog>	 :shrug:
[08:33:30] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:34:18] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[08:43:19] <wikibugs>	 (03PS3) 10Filippo Giunchedi: smokeping: add ensure parameter, set to present [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860)
[08:43:21] <wikibugs>	 (03PS3) 10Filippo Giunchedi: profile: absent smokeping [puppet] - 10https://gerrit.wikimedia.org/r/850155 (https://phabricator.wikimedia.org/T169860)
[08:43:23] <wikibugs>	 (03PS3) 10Filippo Giunchedi: smokeping: remove module and profile [puppet] - 10https://gerrit.wikimedia.org/r/850156 (https://phabricator.wikimedia.org/T169860)
[08:43:25] <wikibugs>	 (03PS3) 10Filippo Giunchedi: smokeping: remove ancillary data [puppet] - 10https://gerrit.wikimedia.org/r/850157 (https://phabricator.wikimedia.org/T169860)
[08:43:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: smokeping: add ensure parameter, set to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[08:45:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:49:56] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:53:43] <wikibugs>	 (03PS1) 10Filippo Giunchedi: safer dnsmasq restart/reload [puppet] - 10https://gerrit.wikimedia.org/r/851597
[08:53:45] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: limit thanos retention [puppet] - 10https://gerrit.wikimedia.org/r/851598
[08:59:05] <wikibugs>	 (03CR) 10Jgiannelos: "Regarding access to logs I am not sure if this group grants the right access." [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight)
[08:59:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: limit thanos retention [puppet] - 10https://gerrit.wikimedia.org/r/851598 (owner: 10Filippo Giunchedi)
[09:00:03] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: limit thanos retention [puppet] - 10https://gerrit.wikimedia.org/r/851598
[09:00:30] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:04:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] safer dnsmasq restart/reload [puppet] - 10https://gerrit.wikimedia.org/r/851597 (owner: 10Filippo Giunchedi)
[09:06:28] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:09:46] <icinga-wm>	 RECOVERY - Disk space on alert1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=alert1001&var-datasource=eqiad+prometheus/ops
[09:11:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] safer dnsmasq restart/reload [puppet] - 10https://gerrit.wikimedia.org/r/851597 (owner: 10Filippo Giunchedi)
[09:13:23] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629) (owner: 10Ahmon Dancy)
[09:22:30] <wikibugs>	 (03CR) 10Awight: Invite some of WMDE Tech Wishes team to poke around maps instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight)
[09:26:07] <wikibugs>	 (03CR) 10Jgiannelos: "Its probably worth the effort to send the postgres logs to logstash instead of manually ssh-ing." [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight)
[09:28:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:29:49] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37875/console" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[09:30:16] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:30:33] <wikibugs>	 (03CR) 10Awight: Invite some of WMDE Tech Wishes team to poke around maps instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight)
[09:32:28] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[09:33:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:34:26] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] "lgtm, thanks for noticing that!" [puppet] - 10https://gerrit.wikimedia.org/r/850541 (owner: 10Dzahn)
[09:34:53] <wikibugs>	 (03PS2) 10Jelto: devtools: set profile::gitlab::runner::registration_token: private [puppet] - 10https://gerrit.wikimedia.org/r/850541 (owner: 10Dzahn)
[09:36:12] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:37:50] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:40:33] <wikibugs>	 (03PS3) 10David Caro: wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930
[09:43:49] <moritzm>	 !log imported quickstack  20161026-1+deb12u1 to apt.wikimedia.org/bookworm-wikimedia T321783
[09:43:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:58] <stashbot>	 T321783: Setup an initial bookworm host with Puppetdb 7 - https://phabricator.wikimedia.org/T321783
[09:46:05] <wikibugs>	 (03PS3) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410)
[09:46:14] <wikibugs>	 (03CR) 10Phuedx: [C: 04-1] "Thanks for submitting this patch!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming)
[09:47:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Create an initial IDM/LDAP image for tests and CI - https://phabricator.wikimedia.org/T320430 (10SLyngshede-WMF) The Docker image have also been included in the Bitu repo and can be built using docker-compose.
[09:47:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Create an initial IDM/LDAP image for tests and CI - https://phabricator.wikimedia.org/T320430 (10SLyngshede-WMF) 05Open→03Resolved
[09:47:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF)
[09:51:23] <wikibugs>	 (03CR) 10Jelto: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn)
[09:52:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[09:53:02] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[09:55:06] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[10:00:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:02:28] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[10:06:10] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:13:39] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Make Kubernetes version configurable [puppet] - 10https://gerrit.wikimedia.org/r/850449 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[10:19:41] <wikibugs>	 (03PS1) 10Urbanecm: Deploy Growth features to 100% users at all wikis but dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876)
[10:29:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: use default for ignored_devices [puppet] - 10https://gerrit.wikimedia.org/r/851605 (https://phabricator.wikimedia.org/T321783)
[10:33:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/851605 (https://phabricator.wikimedia.org/T321783) (owner: 10Filippo Giunchedi)
[10:34:18] <wikibugs>	 (03CR) 10David Caro: "LGTM, just a type hint issue maybe, feel free to ignore the nits" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond)
[10:37:49] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:39:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[10:39:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[10:39:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[10:39:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37285 and previous config saved to /var/cache/conftool/dbconfig/20221101-103934-ladsgroup.json
[10:39:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[10:39:46] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[10:40:06] <wikibugs>	 (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/851065 (owner: 10L10n-bot)
[10:40:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37876/console" [puppet] - 10https://gerrit.wikimedia.org/r/851605 (https://phabricator.wikimedia.org/T321783) (owner: 10Filippo Giunchedi)
[10:41:09] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, as arturo says, some users might be relying on this for something, checked on the toolforge sge nodes logs to see if there's anythin" [puppet] - 10https://gerrit.wikimedia.org/r/850633 (owner: 10Majavah)
[10:41:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[10:41:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[10:41:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance
[10:41:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37286 and previous config saved to /var/cache/conftool/dbconfig/20221101-104154-ladsgroup.json
[10:42:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance
[10:42:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T318955)', diff saved to https://phabricator.wikimedia.org/P37287 and previous config saved to /var/cache/conftool/dbconfig/20221101-104215-ladsgroup.json
[10:42:28] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[10:48:34] <moritzm>	 !log updating libdatetime-timezone-perl from latest Debian SUA update
[10:48:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: use default for ignored_devices [puppet] - 10https://gerrit.wikimedia.org/r/851605 (https://phabricator.wikimedia.org/T321783) (owner: 10Filippo Giunchedi)
[10:54:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[10:55:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318955)', diff saved to https://phabricator.wikimedia.org/P37288 and previous config saved to /var/cache/conftool/dbconfig/20221101-105534-ladsgroup.json
[10:55:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37289 and previous config saved to /var/cache/conftool/dbconfig/20221101-105557-ladsgroup.json
[10:56:52] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[10:59:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[10:59:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[10:59:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance
[10:59:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[11:00:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance
[11:00:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[11:00:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37290 and previous config saved to /var/cache/conftool/dbconfig/20221101-110019-ladsgroup.json
[11:00:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
[11:00:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
[11:00:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T318950)', diff saved to https://phabricator.wikimedia.org/P37291 and previous config saved to /var/cache/conftool/dbconfig/20221101-110045-ladsgroup.json
[11:01:07] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[11:02:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37292 and previous config saved to /var/cache/conftool/dbconfig/20221101-110232-ladsgroup.json
[11:03:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T318950)', diff saved to https://phabricator.wikimedia.org/P37293 and previous config saved to /var/cache/conftool/dbconfig/20221101-110311-ladsgroup.json
[11:04:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:05:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Cleanup obsolete binary packages after bookworm dist-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/851607
[11:06:51] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:47] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb-test2001.codfw.wmnet
[11:10:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P37294 and previous config saved to /var/cache/conftool/dbconfig/20221101-111042-ladsgroup.json
[11:11:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P37295 and previous config saved to /var/cache/conftool/dbconfig/20221101-111106-ladsgroup.json
[11:11:47] <wikibugs>	 (03PS2) 10Muehlenhoff: thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850469 (https://phabricator.wikimedia.org/T308013)
[11:14:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[11:14:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[11:17:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P37296 and previous config saved to /var/cache/conftool/dbconfig/20221101-111739-ladsgroup.json
[11:17:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37297 and previous config saved to /var/cache/conftool/dbconfig/20221101-111753-ladsgroup.json
[11:17:58] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[11:18:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P37298 and previous config saved to /var/cache/conftool/dbconfig/20221101-111819-ladsgroup.json
[11:19:15] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb-test2001.codfw.wmnet
[11:21:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850469 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:25:14] <wikibugs>	 (03PS2) 10Muehlenhoff: dumps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850471 (https://phabricator.wikimedia.org/T308013)
[11:25:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P37299 and previous config saved to /var/cache/conftool/dbconfig/20221101-112549-ladsgroup.json
[11:26:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P37300 and previous config saved to /var/cache/conftool/dbconfig/20221101-112612-ladsgroup.json
[11:27:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Initial IDM puppetisation - https://phabricator.wikimedia.org/T320428 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03Low a:03SLyngshede-WMF
[11:27:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF)
[11:28:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] dumps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850471 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:29:41] <wikibugs>	 (03PS2) 10Muehlenhoff: installserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850473 (https://phabricator.wikimedia.org/T308013)
[11:30:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:32:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P37301 and previous config saved to /var/cache/conftool/dbconfig/20221101-113248-ladsgroup.json
[11:32:59] <wikibugs>	 (03PS1) 10Hnowlan: Generate thumbor.key via prod entrypoint script [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/851608 (https://phabricator.wikimedia.org/T233196)
[11:33:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P37302 and previous config saved to /var/cache/conftool/dbconfig/20221101-113301-ladsgroup.json
[11:33:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P37303 and previous config saved to /var/cache/conftool/dbconfig/20221101-113327-ladsgroup.json
[11:34:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Generate thumbor.key via prod entrypoint script [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/851608 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[11:34:40] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: don't manage thumbor.key within Helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/851609 (https://phabricator.wikimedia.org/T233196)
[11:35:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:05] <wikibugs>	 (03PS2) 10Hnowlan: Generate thumbor.key via prod entrypoint script [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/851608 (https://phabricator.wikimedia.org/T233196)
[11:37:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] installserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850473 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:38:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:40:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318955)', diff saved to https://phabricator.wikimedia.org/P37304 and previous config saved to /var/cache/conftool/dbconfig/20221101-114057-ladsgroup.json
[11:40:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance
[11:41:03] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[11:41:09] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Add a stub base file for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850487 (owner: 10Muehlenhoff)
[11:41:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance
[11:41:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T318955)', diff saved to https://phabricator.wikimedia.org/P37305 and previous config saved to /var/cache/conftool/dbconfig/20221101-114121-ladsgroup.json
[11:41:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37306 and previous config saved to /var/cache/conftool/dbconfig/20221101-114123-ladsgroup.json
[11:41:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[11:41:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[11:41:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37307 and previous config saved to /var/cache/conftool/dbconfig/20221101-114145-ladsgroup.json
[11:45:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:28] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Use the PDF cropbox for rendering [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/805476 (https://phabricator.wikimedia.org/T167420) (owner: 10TheDJ)
[11:47:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[11:47:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37308 and previous config saved to /var/cache/conftool/dbconfig/20221101-114755-ladsgroup.json
[11:47:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[11:48:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[11:48:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P37309 and previous config saved to /var/cache/conftool/dbconfig/20221101-114811-ladsgroup.json
[11:48:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T318950)', diff saved to https://phabricator.wikimedia.org/P37310 and previous config saved to /var/cache/conftool/dbconfig/20221101-114820-ladsgroup.json
[11:48:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T318950)', diff saved to https://phabricator.wikimedia.org/P37311 and previous config saved to /var/cache/conftool/dbconfig/20221101-114835-ladsgroup.json
[11:48:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance
[11:48:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance
[11:48:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T318950)', diff saved to https://phabricator.wikimedia.org/P37312 and previous config saved to /var/cache/conftool/dbconfig/20221101-114858-ladsgroup.json
[11:49:00] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[11:49:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[11:49:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[11:49:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T318605)', diff saved to https://phabricator.wikimedia.org/P37313 and previous config saved to /var/cache/conftool/dbconfig/20221101-114943-ladsgroup.json
[11:51:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:51:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T318950)', diff saved to https://phabricator.wikimedia.org/P37314 and previous config saved to /var/cache/conftool/dbconfig/20221101-115122-ladsgroup.json
[11:52:42] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[11:53:31] <wikibugs>	 (03Merged) 10jenkins-bot: Use the PDF cropbox for rendering [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/805476 (https://phabricator.wikimedia.org/T167420) (owner: 10TheDJ)
[11:54:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318955)', diff saved to https://phabricator.wikimedia.org/P37315 and previous config saved to /var/cache/conftool/dbconfig/20221101-115426-ladsgroup.json
[11:56:01] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[11:56:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37316 and previous config saved to /var/cache/conftool/dbconfig/20221101-115638-ladsgroup.json
[11:57:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[11:57:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T318950)', diff saved to https://phabricator.wikimedia.org/P37317 and previous config saved to /var/cache/conftool/dbconfig/20221101-115734-ladsgroup.json
[11:57:40] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[12:01:13] <wikibugs>	 (03CR) 10Kosta Harlan: Deploy Growth features to 100% users at all wikis but dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm)
[12:03:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37318 and previous config saved to /var/cache/conftool/dbconfig/20221101-120318-ladsgroup.json
[12:03:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[12:03:30] <wikibugs>	 (03PS2) 10Urbanecm: Deploy GrowthExperiments to 100% users at all wikis but dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876)
[12:03:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[12:03:35] <wikibugs>	 (03CR) 10Urbanecm: Deploy GrowthExperiments to 100% users at all wikis but dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm)
[12:03:41] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[12:03:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37319 and previous config saved to /var/cache/conftool/dbconfig/20221101-120341-ladsgroup.json
[12:06:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P37320 and previous config saved to /var/cache/conftool/dbconfig/20221101-120630-ladsgroup.json
[12:09:27] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:09:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P37321 and previous config saved to /var/cache/conftool/dbconfig/20221101-120934-ladsgroup.json
[12:11:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P37322 and previous config saved to /var/cache/conftool/dbconfig/20221101-121147-ladsgroup.json
[12:11:55] <wikibugs>	 (03CR) 10Roman Stolar: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) (owner: 10Vlad.shapik)
[12:12:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P37323 and previous config saved to /var/cache/conftool/dbconfig/20221101-121242-ladsgroup.json
[12:20:49] <urbanecm>	 jouncebot: now
[12:20:49] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 39 minute(s)
[12:20:53] * urbanecm stashing at mwdebug1001
[12:21:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P37324 and previous config saved to /var/cache/conftool/dbconfig/20221101-122138-ladsgroup.json
[12:21:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1006.eqiad.wmnet
[12:23:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T318605)', diff saved to https://phabricator.wikimedia.org/P37325 and previous config saved to /var/cache/conftool/dbconfig/20221101-122329-ladsgroup.json
[12:23:34] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[12:24:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P37326 and previous config saved to /var/cache/conftool/dbconfig/20221101-122442-ladsgroup.json
[12:26:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P37327 and previous config saved to /var/cache/conftool/dbconfig/20221101-122654-ladsgroup.json
[12:27:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P37328 and previous config saved to /var/cache/conftool/dbconfig/20221101-122750-ladsgroup.json
[12:28:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1006.eqiad.wmnet
[12:30:54] <wikibugs>	 (03PS4) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410)
[12:32:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[12:36:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T318950)', diff saved to https://phabricator.wikimedia.org/P37329 and previous config saved to /var/cache/conftool/dbconfig/20221101-123646-ladsgroup.json
[12:36:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance
[12:36:52] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[12:37:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance
[12:37:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
[12:37:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance
[12:37:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T318950)', diff saved to https://phabricator.wikimedia.org/P37330 and previous config saved to /var/cache/conftool/dbconfig/20221101-123714-ladsgroup.json
[12:38:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P37331 and previous config saved to /var/cache/conftool/dbconfig/20221101-123839-ladsgroup.json
[12:39:02] <wikibugs>	 (03CR) 10Vlad.shapik: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/850453 (owner: 10Hnowlan)
[12:39:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T318950)', diff saved to https://phabricator.wikimedia.org/P37332 and previous config saved to /var/cache/conftool/dbconfig/20221101-123936-ladsgroup.json
[12:39:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318955)', diff saved to https://phabricator.wikimedia.org/P37333 and previous config saved to /var/cache/conftool/dbconfig/20221101-123949-ladsgroup.json
[12:39:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance
[12:40:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance
[12:40:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T318955)', diff saved to https://phabricator.wikimedia.org/P37334 and previous config saved to /var/cache/conftool/dbconfig/20221101-124012-ladsgroup.json
[12:41:06] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[12:42:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37335 and previous config saved to /var/cache/conftool/dbconfig/20221101-124202-ladsgroup.json
[12:42:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[12:42:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[12:42:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37336 and previous config saved to /var/cache/conftool/dbconfig/20221101-124225-ladsgroup.json
[12:42:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37337 and previous config saved to /var/cache/conftool/dbconfig/20221101-124253-ladsgroup.json
[12:42:59] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[12:43:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T318950)', diff saved to https://phabricator.wikimedia.org/P37338 and previous config saved to /var/cache/conftool/dbconfig/20221101-124301-ladsgroup.json
[12:43:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:43:07] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[12:43:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:43:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[12:43:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[12:43:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37339 and previous config saved to /var/cache/conftool/dbconfig/20221101-124334-ladsgroup.json
[12:45:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:45:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37340 and previous config saved to /var/cache/conftool/dbconfig/20221101-124548-ladsgroup.json
[12:47:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Cleanup obsolete binary packages after bookworm dist-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/851607 (owner: 10Muehlenhoff)
[12:48:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1008.eqiad.wmnet
[12:49:15] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] Deploy GrowthExperiments to 100% users at all wikis but dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm)
[12:50:33] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] Deploy GrowthExperiments to 100% users at all wikis but dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm)
[12:51:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:51:37] <wikibugs>	 (03PS1) 10Stang: viwiki: Increase autoconfirmed edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851618 (https://phabricator.wikimedia.org/T322105)
[12:52:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] viwiki: Increase autoconfirmed edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851618 (https://phabricator.wikimedia.org/T322105) (owner: 10Stang)
[12:52:34] <wikibugs>	 (03PS2) 10Stang: viwiki: Increase autoconfirmed edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851618 (https://phabricator.wikimedia.org/T322105)
[12:53:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318955)', diff saved to https://phabricator.wikimedia.org/P37341 and previous config saved to /var/cache/conftool/dbconfig/20221101-125331-ladsgroup.json
[12:53:37] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[12:53:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P37342 and previous config saved to /var/cache/conftool/dbconfig/20221101-125348-ladsgroup.json
[12:54:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: dispatch: add ipython for 'dispatch server shell' [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/851619 (https://phabricator.wikimedia.org/T313229)
[12:54:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P37343 and previous config saved to /var/cache/conftool/dbconfig/20221101-125443-ladsgroup.json
[12:55:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37344 and previous config saved to /var/cache/conftool/dbconfig/20221101-125516-ladsgroup.json
[12:55:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1008.eqiad.wmnet
[12:56:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1010.eqiad.wmnet
[12:57:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Cleanup obsolete binary packages after bookworm dist-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/851607 (owner: 10Muehlenhoff)
[12:57:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: dispatch: run wrapper with interactive/tty support [puppet] - 10https://gerrit.wikimedia.org/r/851620 (https://phabricator.wikimedia.org/T313229)
[12:58:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P37345 and previous config saved to /var/cache/conftool/dbconfig/20221101-125801-ladsgroup.json
[12:58:59] <wikibugs>	 (03PS1) 10JMeybohm: Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621
[12:59:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621 (owner: 10JMeybohm)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T1300).
[13:00:05] <jouncebot>	 koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T1300)
[13:00:14] <urbanecm>	 i can deploy today
[13:00:16] <koi>	 o/
[13:00:16] <urbanecm>	 hi koi!
[13:00:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851618 (https://phabricator.wikimedia.org/T322105) (owner: 10Stang)
[13:00:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P37346 and previous config saved to /var/cache/conftool/dbconfig/20221101-130056-ladsgroup.json
[13:01:07] <wikibugs>	 (03PS1) 10Stang: zhwikivoyage: Add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851622 (https://phabricator.wikimedia.org/T322133)
[13:02:00] <koi>	 Hi urbanecm, I added another patch for this window ^
[13:02:03] <urbanecm>	 sure, noted
[13:02:08] <wikibugs>	 (03Merged) 10jenkins-bot: viwiki: Increase autoconfirmed edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851618 (https://phabricator.wikimedia.org/T322105) (owner: 10Stang)
[13:02:50] <koi>	 I think the first one is not test-able, so maybe sync directly?
[13:02:57] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851618|viwiki: Increase autoconfirmed edit count to 10 (T322105)]]
[13:03:06] <urbanecm>	 koi: you can test it via Special:Userrights
[13:03:11] <stashbot>	 T322105: Change the minimum requirements of autoconfirmed users to 10 edits and 4 days old on viwiki - https://phabricator.wikimedia.org/T322105
[13:03:35] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:851618|viwiki: Increase autoconfirmed edit count to 10 (T322105)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[13:03:48] <koi>	 aha, sure about this?
[13:03:48] <urbanecm>	 koi: at https://vi.wikipedia.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Quy%E1%BB%81n_th%C3%A0nh_vi%C3%AAn/Martin_Urbanec, you see "Implicit member of: Autoconfirmed users"
[13:03:57] <urbanecm>	 (w/o mwdebug1001 at least)
[13:04:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1010.eqiad.wmnet
[13:04:15] <urbanecm>	 i actually have 80 edits, so it'll stay with mwdebug1001
[13:04:24] <urbanecm>	 but if you have between 0 to 10 edits, you can test that way
[13:04:32] <urbanecm>	 koi: let me know how it goes
[13:04:46] <koi>	 oh got it, let me randomly select one to test
[13:05:33] <urbanecm>	 sure
[13:05:52] <wikibugs>	 (03PS2) 10Urbanecm: zhwikivoyage: Add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851622 (https://phabricator.wikimedia.org/T322133) (owner: 10Stang)
[13:05:54] <wikibugs>	 (03PS1) 10Btullis: Bump the version of Datahub to v0.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/851624 (https://phabricator.wikimedia.org/T321907)
[13:05:56] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] zhwikivoyage: Add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851622 (https://phabricator.wikimedia.org/T322133) (owner: 10Stang)
[13:06:11] <koi>	 urbanecm: I checked a user with no edit on viwiki, and noticed they are not inside autoconfirmed group on mwdebug1001, so LGTM
[13:06:20] <urbanecm>	 yep, lgtm too, syncing!
[13:06:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1012.eqiad.wmnet
[13:07:34] <wikibugs>	 (03Merged) 10jenkins-bot: zhwikivoyage: Add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851622 (https://phabricator.wikimedia.org/T322133) (owner: 10Stang)
[13:08:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P37347 and previous config saved to /var/cache/conftool/dbconfig/20221101-130839-ladsgroup.json
[13:08:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T318605)', diff saved to https://phabricator.wikimedia.org/P37348 and previous config saved to /var/cache/conftool/dbconfig/20221101-130856-ladsgroup.json
[13:08:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[13:09:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[13:09:13] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[13:09:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T318605)', diff saved to https://phabricator.wikimedia.org/P37349 and previous config saved to /var/cache/conftool/dbconfig/20221101-130919-ladsgroup.json
[13:09:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P37350 and previous config saved to /var/cache/conftool/dbconfig/20221101-130952-ladsgroup.json
[13:10:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[13:10:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P37351 and previous config saved to /var/cache/conftool/dbconfig/20221101-131026-ladsgroup.json
[13:12:05] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Add digicert-2022 to available unified set [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack)
[13:13:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P37352 and previous config saved to /var/cache/conftool/dbconfig/20221101-131309-ladsgroup.json
[13:13:32] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851618|viwiki: Increase autoconfirmed edit count to 10 (T322105)]] (duration: 10m 35s)
[13:13:39] <urbanecm>	 this took a while
[13:13:42] <urbanecm>	 but it's live now koi 
[13:13:44] <stashbot>	 T322105: Change the minimum requirements of autoconfirmed users to 10 edits and 4 days old on viwiki - https://phabricator.wikimedia.org/T322105
[13:13:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851622 (https://phabricator.wikimedia.org/T322133) (owner: 10Stang)
[13:13:58] <urbanecm>	 doing the other one now
[13:14:15] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851622|zhwikivoyage: Add wordmark (T322133)]]
[13:14:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[13:14:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[13:14:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1012.eqiad.wmnet
[13:14:29] <stashbot>	 T322133: Add wordmark to zhwikivoyage - https://phabricator.wikimedia.org/T322133
[13:14:41] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:851622|zhwikivoyage: Add wordmark (T322133)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[13:14:49] <urbanecm>	 koi: second patch's at mwdebug1001 now
[13:14:54] <urbanecm>	 can you test please?
[13:15:00] <koi>	 looking
[13:15:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[13:16:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P37353 and previous config saved to /var/cache/conftool/dbconfig/20221101-131605-ladsgroup.json
[13:16:50] <koi>	 urbanecm: tested under vector-2022 and mobile, both zh-hans and its variant works as expected, so LGTM
[13:16:54] <urbanecm>	 excellent, syncing
[13:17:30] <wikibugs>	 (03CR) 10Urbanecm: Deploy GrowthExperiments to 100% users at all wikis but dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm)
[13:17:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1028.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[13:18:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1028.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[13:19:16] <wikibugs>	 (03PS1) 10Urbanecm: [GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626
[13:19:58] <wikibugs>	 (03PS2) 10Clare Ming: Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602)
[13:20:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[13:20:52] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851622|zhwikivoyage: Add wordmark (T322133)]] (duration: 06m 36s)
[13:20:56] <urbanecm>	 koi: and, live!
[13:21:11] <koi>	 thanks!
[13:21:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[13:21:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[13:21:24] <wikibugs>	 (03CR) 10Clare Ming: Add MP stream for VisualEditorFeatureUse instrument (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming)
[13:21:32] <urbanecm>	 and also purged the two URIs
[13:21:34] <urbanecm>	 koi: anything else?
[13:21:36] <urbanecm>	 (or anyone else)
[13:21:40] <koi>	 nope
[13:21:58] <urbanecm>	 !log UTC afternoon B&C window done
[13:22:01] <urbanecm>	 closing the window then :)
[13:22:09] <stashbot>	 T322133: Add wordmark to zhwikivoyage - https://phabricator.wikimedia.org/T322133
[13:22:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[13:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:45] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[13:23:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P37354 and previous config saved to /var/cache/conftool/dbconfig/20221101-132348-ladsgroup.json
[13:25:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T318950)', diff saved to https://phabricator.wikimedia.org/P37355 and previous config saved to /var/cache/conftool/dbconfig/20221101-132500-ladsgroup.json
[13:25:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance
[13:25:06] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[13:25:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance
[13:25:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37356 and previous config saved to /var/cache/conftool/dbconfig/20221101-132523-ladsgroup.json
[13:25:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P37357 and previous config saved to /var/cache/conftool/dbconfig/20221101-132537-ladsgroup.json
[13:25:39] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[13:27:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37358 and previous config saved to /var/cache/conftool/dbconfig/20221101-132745-ladsgroup.json
[13:28:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37359 and previous config saved to /var/cache/conftool/dbconfig/20221101-132817-ladsgroup.json
[13:28:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance
[13:28:22] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[13:28:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance
[13:28:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T318605)', diff saved to https://phabricator.wikimedia.org/P37360 and previous config saved to /var/cache/conftool/dbconfig/20221101-132841-ladsgroup.json
[13:30:15] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37361 and previous config saved to /var/cache/conftool/dbconfig/20221101-133113-ladsgroup.json
[13:31:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[13:31:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[13:31:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:31:22] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[13:31:24] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[13:31:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T318950)', diff saved to https://phabricator.wikimedia.org/P37362 and previous config saved to /var/cache/conftool/dbconfig/20221101-133132-ladsgroup.json
[13:32:12] <kindrobot>	 /b 10
[13:32:21] <kindrobot>	 Oops :)
[13:33:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T318950)', diff saved to https://phabricator.wikimedia.org/P37363 and previous config saved to /var/cache/conftool/dbconfig/20221101-133346-ladsgroup.json
[13:35:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF)
[13:36:03] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:26] <wikibugs>	 (03PS1) 10Zabe: "reverse-proxy-staging.php" -> "reverse-staging-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630
[13:36:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Initial Django project setup - https://phabricator.wikimedia.org/T319410 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03Low
[13:37:17] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[13:37:46] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "This isn't deploy-safe. You need to make the new copy as one commit, switch the use in a second commit, and remove the old copy in a third" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe)
[13:38:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318955)', diff saved to https://phabricator.wikimedia.org/P37364 and previous config saved to /var/cache/conftool/dbconfig/20221101-133857-ladsgroup.json
[13:39:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance
[13:39:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance
[13:39:28] <wikibugs>	 (03PS1) 10Zabe: Copy reverse-proxy-staging.php to reverse-proxy-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851631
[13:39:31] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[13:39:58] <wikibugs>	 (03PS2) 10Zabe: "reverse-proxy-staging.php" -> "reverse-staging-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630
[13:40:06] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wikimedia.org: add dispatch.w.o [dns] - 10https://gerrit.wikimedia.org/r/851632 (https://phabricator.wikimedia.org/T313229)
[13:40:08] <wikibugs>	 (03PS3) 10Zabe: "reverse-proxy-staging.php" -> "reverse-staging-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630
[13:40:15] <wikibugs>	 (03CR) 10Zabe: "reverse-proxy-staging.php" -> "reverse-staging-labs.php" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe)
[13:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37365 and previous config saved to /var/cache/conftool/dbconfig/20221101-134045-ladsgroup.json
[13:40:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[13:40:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4037.ulsfo.wmnet
[13:41:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[13:41:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T318955)', diff saved to https://phabricator.wikimedia.org/P37366 and previous config saved to /var/cache/conftool/dbconfig/20221101-134108-ladsgroup.json
[13:41:09] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] [GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626 (owner: 10Urbanecm)
[13:41:12] <wikibugs>	 (03PS1) 10Zabe: Delete "reverse-proxy-staging.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851633
[13:41:57] <godog>	 seeking a reviewer for an easy one: https://gerrit.wikimedia.org/r/c/operations/dns/+/851632
[13:42:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[13:42:27] <zabe>	 urbanecm, still around?
[13:42:29] <urbanecm>	 zabe: fyi, AFAIK scap now handles multi-file changes just fine (there's no technical need to split it into three patches)
[13:42:33] <urbanecm>	 heh, i was just writing you
[13:42:36] <urbanecm>	 yup
[13:42:46] <zabe>	 ok
[13:42:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P37367 and previous config saved to /var/cache/conftool/dbconfig/20221101-134252-ladsgroup.json
[13:42:53] <zabe>	 should I quash it back into one
[13:42:57] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thanks, looks good! When I was checking this for Wikidough, I realized that we can remove this safely and hence the unless install_from_co" [puppet] - 10https://gerrit.wikimedia.org/r/851148 (owner: 10Andrew Bogott)
[13:43:03] <urbanecm>	 up to you. i can handle both variants :)
[13:43:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T318605)', diff saved to https://phabricator.wikimedia.org/P37368 and previous config saved to /var/cache/conftool/dbconfig/20221101-134302-ladsgroup.json
[13:43:11] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[13:43:13] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on parse2012 is CRITICAL: etcd last index (1663589) is outdated compared to the master one (1663595) https://wikitech.wikimedia.org/wiki/Etcd
[13:43:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T318955)', diff saved to https://phabricator.wikimedia.org/P37369 and previous config saved to /var/cache/conftool/dbconfig/20221101-134318-ladsgroup.json
[13:43:26] <urbanecm>	 ftr, i asked Tyler to remove the "single-file" guidance from https://wikitech.wikimedia.org/wiki/Backport_windows
[13:43:49] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder)
[13:44:53] <zabe>	 tbh, I don't really care, you can also just merge all three at the same time and then sync through?
[13:45:06] <urbanecm>	 yeah
[13:45:09] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on parse2012 is OK: etcd last index (1663601) matches the master one (1663601) https://wikitech.wikimedia.org/wiki/Etcd
[13:45:26] <urbanecm>	 not really sure what that file's about, so I'd prefer a +1 first if possible
[13:45:33] <urbanecm>	 (if you asked if i was around for deployment)
[13:45:56] <urbanecm>	 wait, seems to be just a rename
[13:46:06] <zabe>	 yes, also it's kinda beta only
[13:46:54] <urbanecm>	 yeah
[13:46:55] <urbanecm>	 let's do it
[13:47:05] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Copy reverse-proxy-staging.php to reverse-proxy-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851631 (owner: 10Zabe)
[13:47:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "reverse-proxy-staging.php" -> "reverse-staging-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe)
[13:47:09] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Delete "reverse-proxy-staging.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851633 (owner: 10Zabe)
[13:47:13] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] wikimedia.org: add dispatch.w.o [dns] - 10https://gerrit.wikimedia.org/r/851632 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[13:47:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you!" [dns] - 10https://gerrit.wikimedia.org/r/851632 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[13:47:31] <wikibugs>	 (03PS2) 10Filippo Giunchedi: wikimedia.org: add dispatch.w.o [dns] - 10https://gerrit.wikimedia.org/r/851632 (https://phabricator.wikimedia.org/T313229)
[13:48:06] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "reverse-proxy-staging.php" -> "reverse-staging-labs.php" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe)
[13:48:40] <wikibugs>	 (03Merged) 10jenkins-bot: Copy reverse-proxy-staging.php to reverse-proxy-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851631 (owner: 10Zabe)
[13:48:42] <wikibugs>	 (03Merged) 10jenkins-bot: "reverse-proxy-staging.php" -> "reverse-staging-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe)
[13:48:44] <wikibugs>	 (03Merged) 10jenkins-bot: Delete "reverse-proxy-staging.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851633 (owner: 10Zabe)
[13:48:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851631 (owner: 10Zabe)
[13:48:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe)
[13:48:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851633 (owner: 10Zabe)
[13:48:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P37370 and previous config saved to /var/cache/conftool/dbconfig/20221101-134854-ladsgroup.json
[13:49:13] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851631|Copy reverse-proxy-staging.php to reverse-proxy-labs.php]], [[gerrit:851630|"reverse-proxy-staging.php" -> "reverse-staging-labs.php"]], [[gerrit:851633|Delete "reverse-proxy-staging.php"]]
[13:49:32] <urbanecm>	 zabe: double-checking: it's ok to skip mwdebug, right? i don't see anything to test here
[13:49:36] <zabe>	 yes
[13:49:41] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and zabe: Backport for [[gerrit:851631|Copy reverse-proxy-staging.php to reverse-proxy-labs.php]], [[gerrit:851630|"reverse-proxy-staging.php" -> "reverse-staging-labs.php"]], [[gerrit:851633|Delete "reverse-proxy-staging.php"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[13:49:46] <urbanecm>	 syncing
[13:49:49] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Bump the version of Datahub to v0.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/851624 (https://phabricator.wikimedia.org/T321907) (owner: 10Btullis)
[13:50:28] <wikibugs>	 (03PS2) 10Urbanecm: [GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626
[13:50:33] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626 (owner: 10Urbanecm)
[13:50:38] <urbanecm>	 sneaking ^^ out as well
[13:50:46] <wikibugs>	 (03PS1) 10Zabe: scap: Add reverse-staging-labs.php to beta-only files [puppet] - 10https://gerrit.wikimedia.org/r/851634
[13:50:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance
[13:51:08] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/851634 (owner: 10Zabe)
[13:51:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance
[13:51:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance
[13:51:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance
[13:51:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T318955)', diff saved to https://phabricator.wikimedia.org/P37371 and previous config saved to /var/cache/conftool/dbconfig/20221101-135120-ladsgroup.json
[13:51:35] <wikibugs>	 (03Merged) 10jenkins-bot: [GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626 (owner: 10Urbanecm)
[13:52:09] <wikibugs>	 (03PS2) 10Zabe: scap: Add reverse-staging-labs.php to beta-only files [puppet] - 10https://gerrit.wikimedia.org/r/851634
[13:52:22] <wikibugs>	 (03PS3) 10Zabe: scap: Add reverse-proxy-labs.php to beta-only files [puppet] - 10https://gerrit.wikimedia.org/r/851634
[13:52:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[13:53:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "...meh, filename. lgtm now 😄" [puppet] - 10https://gerrit.wikimedia.org/r/851634 (owner: 10Zabe)
[13:53:34] <wikibugs>	 (03Merged) 10jenkins-bot: Bump the version of Datahub to v0.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/851624 (https://phabricator.wikimedia.org/T321907) (owner: 10Btullis)
[13:53:37] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[13:53:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[13:53:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[13:53:43] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851631|Copy reverse-proxy-staging.php to reverse-proxy-labs.php]], [[gerrit:851630|"reverse-proxy-staging.php" -> "reverse-staging-labs.php"]], [[gerrit:851633|Delete "reverse-proxy-staging.php"]] (duration: 04m 30s)
[13:53:50] <urbanecm>	 zabe: it's live now
[13:53:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626 (owner: 10Urbanecm)
[13:54:08] <zabe>	 thanks :)
[13:54:17] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851626|[GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers]]
[13:54:32] <urbanecm>	 no problem
[13:54:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[13:54:47] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[13:54:53] <moritzm>	 !log installing exim4 security updates on buster
[13:55:10] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4037.ulsfo.wmnet
[13:55:56] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[13:56:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:11] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4038.ulsfo.wmnet
[13:56:25] <wikibugs>	 (03PS1) 10Ottomata: Enable rc0.mediawiki.page_change stream on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851636 (https://phabricator.wikimedia.org/T311129)
[13:56:54] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[13:57:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1028.eqiad.wmnet with OS bullseye
[13:57:37] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1028.eqiad.wmnet with OS bullseye
[13:57:47] <moritzm>	 !log draining ganeti1016 for eventual reimage T311687
[13:58:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P37372 and previous config saved to /var/cache/conftool/dbconfig/20221101-135800-ladsgroup.json
[13:58:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P37373 and previous config saved to /var/cache/conftool/dbconfig/20221101-135811-ladsgroup.json
[13:58:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P37374 and previous config saved to /var/cache/conftool/dbconfig/20221101-135827-ladsgroup.json
[13:58:49] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851626|[GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers]] (duration: 04m 32s)
[13:58:55] * urbanecm done
[13:59:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[13:59:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:53] <stashbot>	 T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687
[14:00:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:00:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:00:57] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Enable rc0.mediawiki.page_change stream on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851636 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata)
[14:01:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:02:28] <wikibugs>	 (03Merged) 10jenkins-bot: Enable rc0.mediawiki.page_change stream on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851636 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata)
[14:04:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P37375 and previous config saved to /var/cache/conftool/dbconfig/20221101-140402-ladsgroup.json
[14:04:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318955)', diff saved to https://phabricator.wikimedia.org/P37376 and previous config saved to /var/cache/conftool/dbconfig/20221101-140430-ladsgroup.json
[14:04:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T318605)', diff saved to https://phabricator.wikimedia.org/P37377 and previous config saved to /var/cache/conftool/dbconfig/20221101-140439-ladsgroup.json
[14:05:42] <wikibugs>	 (03PS1) 10Ottomata: rc0.mediawiki.page_change stream - use eventgate-analytics-external [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851637 (https://phabricator.wikimedia.org/T311129)
[14:06:03] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4038.ulsfo.wmnet
[14:06:33] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[14:06:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:06:41] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4039.ulsfo.wmnet
[14:06:45] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable rc0.mediawiki.page_change stream on testwiki - T311129 (duration: 03m 30s)
[14:07:31] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[14:07:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:07:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:08:09] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[14:08:10] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] rc0.mediawiki.page_change stream - use eventgate-analytics-external [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851637 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata)
[14:08:59] <wikibugs>	 (03Merged) 10jenkins-bot: rc0.mediawiki.page_change stream - use eventgate-analytics-external [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851637 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata)
[14:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:09] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[14:10:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1028.eqiad.wmnet with reason: host reimage
[14:10:11] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti1028.eqiad.wmnet with reason: host reimage
[14:11:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:12:45] <wikibugs>	 (03CR) 10Dmaza: rewrite.py: changes for Phonos deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal)
[14:13:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37378 and previous config saved to /var/cache/conftool/dbconfig/20221101-141308-ladsgroup.json
[14:13:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance
[14:13:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance
[14:13:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P37379 and previous config saved to /var/cache/conftool/dbconfig/20221101-141321-ladsgroup.json
[14:13:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T318950)', diff saved to https://phabricator.wikimedia.org/P37380 and previous config saved to /var/cache/conftool/dbconfig/20221101-141322-ladsgroup.json
[14:13:24] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[14:13:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P37381 and previous config saved to /var/cache/conftool/dbconfig/20221101-141335-ladsgroup.json
[14:14:23] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Use eventgate-analytics-external for rc0.mediawiki.page_change stream - T311129 (duration: 03m 42s)
[14:15:05] <stashbot>	 T311129: [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129
[14:15:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T318950)', diff saved to https://phabricator.wikimedia.org/P37382 and previous config saved to /var/cache/conftool/dbconfig/20221101-141544-ladsgroup.json
[14:16:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:16:47] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:17:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:17:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:18:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a few more package cleanups for bullseye->bookworm ABI changes [puppet] - 10https://gerrit.wikimedia.org/r/851638
[14:18:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for cdunn [puppet] - 10https://gerrit.wikimedia.org/r/851639
[14:18:25] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4039.ulsfo.wmnet
[14:18:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:18:52] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4040.ulsfo.wmnet
[14:19:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T318950)', diff saved to https://phabricator.wikimedia.org/P37383 and previous config saved to /var/cache/conftool/dbconfig/20221101-141913-ladsgroup.json
[14:19:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[14:19:18] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[14:19:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T318950)', diff saved to https://phabricator.wikimedia.org/P37384 and previous config saved to /var/cache/conftool/dbconfig/20221101-141924-ladsgroup.json
[14:19:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P37385 and previous config saved to /var/cache/conftool/dbconfig/20221101-141936-ladsgroup.json
[14:19:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P37386 and previous config saved to /var/cache/conftool/dbconfig/20221101-141945-ladsgroup.json
[14:21:08] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[14:21:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T318950)', diff saved to https://phabricator.wikimedia.org/P37387 and previous config saved to /var/cache/conftool/dbconfig/20221101-142136-ladsgroup.json
[14:23:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove LDAP access for cdunn [puppet] - 10https://gerrit.wikimedia.org/r/851639
[14:25:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for cdunn [puppet] - 10https://gerrit.wikimedia.org/r/851639 (owner: 10Muehlenhoff)
[14:28:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T318605)', diff saved to https://phabricator.wikimedia.org/P37388 and previous config saved to /var/cache/conftool/dbconfig/20221101-142832-ladsgroup.json
[14:28:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[14:28:37] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[14:28:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T318955)', diff saved to https://phabricator.wikimedia.org/P37389 and previous config saved to /var/cache/conftool/dbconfig/20221101-142842-ladsgroup.json
[14:28:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[14:28:47] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[14:28:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T318605)', diff saved to https://phabricator.wikimedia.org/P37390 and previous config saved to /var/cache/conftool/dbconfig/20221101-142854-ladsgroup.json
[14:28:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[14:29:00] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4040.ulsfo.wmnet
[14:29:58] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[14:30:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for ejoseph [puppet] - 10https://gerrit.wikimedia.org/r/851641
[14:30:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P37391 and previous config saved to /var/cache/conftool/dbconfig/20221101-143051-ladsgroup.json
[14:32:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1028.eqiad.wmnet with OS bullseye
[14:32:27] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1028.eqiad.wmnet with OS bullseye completed: - ganeti1028 (**WARN**)   - Downtimed on...
[14:34:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging EJoseph out of all services on: 1202 hosts
[14:34:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] pdns-recursor: remove delegation-only config setting [puppet] - 10https://gerrit.wikimedia.org/r/851148 (owner: 10Andrew Bogott)
[14:34:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P37392 and previous config saved to /var/cache/conftool/dbconfig/20221101-143445-ladsgroup.json
[14:34:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P37393 and previous config saved to /var/cache/conftool/dbconfig/20221101-143453-ladsgroup.json
[14:34:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging EJoseph out of all services on: 1202 hosts
[14:35:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging EJoseph out of all services on: 803 hosts
[14:35:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging EJoseph out of all services on: 803 hosts
[14:36:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37394 and previous config saved to /var/cache/conftool/dbconfig/20221101-143645-ladsgroup.json
[14:37:27] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4041.ulsfo.wmnet
[14:40:14] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] scap: Add reverse-proxy-labs.php to beta-only files [puppet] - 10https://gerrit.wikimedia.org/r/851634 (owner: 10Zabe)
[14:40:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[14:40:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[14:40:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[14:40:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T318955)', diff saved to https://phabricator.wikimedia.org/P37395 and previous config saved to /var/cache/conftool/dbconfig/20221101-144053-ladsgroup.json
[14:41:21] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:43:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T318955)', diff saved to https://phabricator.wikimedia.org/P37396 and previous config saved to /var/cache/conftool/dbconfig/20221101-144302-ladsgroup.json
[14:43:08] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[14:43:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Record extended MOU for Robert West [puppet] - 10https://gerrit.wikimedia.org/r/851644
[14:43:39] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: dispatch: configure http header auth provider [puppet] - 10https://gerrit.wikimedia.org/r/851645 (https://phabricator.wikimedia.org/T313229)
[14:46:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P37397 and previous config saved to /var/cache/conftool/dbconfig/20221101-144559-ladsgroup.json
[14:46:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: run wrapper with interactive/tty support [puppet] - 10https://gerrit.wikimedia.org/r/851620 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[14:47:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add a few more package cleanups for bullseye->bookworm ABI changes [puppet] - 10https://gerrit.wikimedia.org/r/851638 (owner: 10Muehlenhoff)
[14:48:01] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4041.ulsfo.wmnet
[14:48:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: configure http header auth provider [puppet] - 10https://gerrit.wikimedia.org/r/851645 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[14:48:20] <wikibugs>	 (03PS2) 10Filippo Giunchedi: dispatch: configure http header auth provider [puppet] - 10https://gerrit.wikimedia.org/r/851645 (https://phabricator.wikimedia.org/T313229)
[14:48:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] dispatch: add ipython for 'dispatch server shell' [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/851619 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[14:48:42] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 15.12 ms
[14:49:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318955)', diff saved to https://phabricator.wikimedia.org/P37398 and previous config saved to /var/cache/conftool/dbconfig/20221101-144954-ladsgroup.json
[14:49:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:50:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T318605)', diff saved to https://phabricator.wikimedia.org/P37399 and previous config saved to /var/cache/conftool/dbconfig/20221101-145004-ladsgroup.json
[14:50:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance
[14:50:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[14:50:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance
[14:50:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37400 and previous config saved to /var/cache/conftool/dbconfig/20221101-145019-ladsgroup.json
[14:50:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T318605)', diff saved to https://phabricator.wikimedia.org/P37401 and previous config saved to /var/cache/conftool/dbconfig/20221101-145026-ladsgroup.json
[14:51:10] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[14:51:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4042.ulsfo.wmnet
[14:51:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37402 and previous config saved to /var/cache/conftool/dbconfig/20221101-145152-ladsgroup.json
[14:52:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[14:52:25] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[14:55:14] <icinga-wm>	 PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:58:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P37403 and previous config saved to /var/cache/conftool/dbconfig/20221101-145813-ladsgroup.json
[14:58:47] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming)
[14:58:59] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:59:10] <logmsgbot>	 !log dancy@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.8  refs T320513
[14:59:25] <stashbot>	 T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513
[15:01:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T318950)', diff saved to https://phabricator.wikimedia.org/P37404 and previous config saved to /var/cache/conftool/dbconfig/20221101-150107-ladsgroup.json
[15:01:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance
[15:01:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance
[15:01:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37405 and previous config saved to /var/cache/conftool/dbconfig/20221101-150122-ladsgroup.json
[15:01:23] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[15:02:40] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4042.ulsfo.wmnet
[15:02:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T318605)', diff saved to https://phabricator.wikimedia.org/P37406 and previous config saved to /var/cache/conftool/dbconfig/20221101-150255-ladsgroup.json
[15:03:00] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[15:03:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37407 and previous config saved to /var/cache/conftool/dbconfig/20221101-150345-ladsgroup.json
[15:04:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:04:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37408 and previous config saved to /var/cache/conftool/dbconfig/20221101-150415-ladsgroup.json
[15:04:16] <logmsgbot>	 !log dancy@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.8  refs T320513 (duration: 05m 05s)
[15:04:24] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[15:04:29] <stashbot>	 T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513
[15:04:46] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] Add MP stream for VisualEditorFeatureUse instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming)
[15:04:50] <wikibugs>	 (03PS1) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650
[15:06:05] <logmsgbot>	 !log dancy@deploy1002 Pruned MediaWiki: 1.40.0-wmf.6 (duration: 01m 47s)
[15:07:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T318950)', diff saved to https://phabricator.wikimedia.org/P37409 and previous config saved to /var/cache/conftool/dbconfig/20221101-150659-ladsgroup.json
[15:07:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[15:07:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[15:07:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37410 and previous config saved to /var/cache/conftool/dbconfig/20221101-150711-ladsgroup.json
[15:07:12] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[15:08:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro)
[15:09:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37411 and previous config saved to /var/cache/conftool/dbconfig/20221101-150922-ladsgroup.json
[15:09:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[15:12:18] <icinga-wm>	 RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms
[15:13:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P37412 and previous config saved to /var/cache/conftool/dbconfig/20221101-151320-ladsgroup.json
[15:13:42] <wikibugs>	 (03PS2) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650
[15:13:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[15:13:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:14:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:15:21] <wikibugs>	 (03PS2) 10JMeybohm: Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621
[15:16:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621 (owner: 10JMeybohm)
[15:16:13] <wikibugs>	 (03PS3) 10Clare Ming: Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602)
[15:18:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P37413 and previous config saved to /var/cache/conftool/dbconfig/20221101-151803-ladsgroup.json
[15:18:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P37414 and previous config saved to /var/cache/conftool/dbconfig/20221101-151853-ladsgroup.json
[15:19:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P37415 and previous config saved to /var/cache/conftool/dbconfig/20221101-151923-ladsgroup.json
[15:21:20] <wikibugs>	 (03PS1) 10Clare Ming: Update Edit Attempt Step sampling rate to 1 for group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851652 (https://phabricator.wikimedia.org/T312016)
[15:23:13] <wikibugs>	 (03PS3) 10JMeybohm: Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621
[15:23:21] <wikibugs>	 (03PS1) 10Ssingh: esitest: add config file dependency on /run/esitest creation [puppet] - 10https://gerrit.wikimedia.org/r/851654
[15:24:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37416 and previous config saved to /var/cache/conftool/dbconfig/20221101-152430-ladsgroup.json
[15:26:01] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37878/console" [puppet] - 10https://gerrit.wikimedia.org/r/851654 (owner: 10Ssingh)
[15:28:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T318955)', diff saved to https://phabricator.wikimedia.org/P37417 and previous config saved to /var/cache/conftool/dbconfig/20221101-152827-ladsgroup.json
[15:28:30] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[15:28:33] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[15:28:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[15:28:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T318955)', diff saved to https://phabricator.wikimedia.org/P37418 and previous config saved to /var/cache/conftool/dbconfig/20221101-152850-ladsgroup.json
[15:30:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T318605)', diff saved to https://phabricator.wikimedia.org/P37419 and previous config saved to /var/cache/conftool/dbconfig/20221101-153049-ladsgroup.json
[15:30:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T318955)', diff saved to https://phabricator.wikimedia.org/P37420 and previous config saved to /var/cache/conftool/dbconfig/20221101-153059-ladsgroup.json
[15:31:23] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[15:33:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for USERNAME and kinit credentials - https://phabricator.wikimedia.org/T322145 (10Hghani)
[15:33:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P37421 and previous config saved to /var/cache/conftool/dbconfig/20221101-153311-ladsgroup.json
[15:33:14] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10HShaath-WMF)
[15:33:38] <wikibugs>	 (03PS4) 10JMeybohm: Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621
[15:33:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10ILooremeta-WMF)
[15:34:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P37422 and previous config saved to /var/cache/conftool/dbconfig/20221101-153400-ladsgroup.json
[15:34:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P37423 and previous config saved to /var/cache/conftool/dbconfig/20221101-153430-ladsgroup.json
[15:34:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani and kinit credentials - https://phabricator.wikimedia.org/T322145 (10Hghani)
[15:34:45] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] esitest: add config file dependency on /run/esitest creation [puppet] - 10https://gerrit.wikimedia.org/r/851654 (owner: 10Ssingh)
[15:35:37] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10KCVelaga_WMF)
[15:37:25] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] Update Edit Attempt Step sampling rate to 1 for group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851652 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming)
[15:38:13] <wikibugs>	 (03PS5) 10JMeybohm: Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621
[15:39:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:39:16] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "I've duplicated the hiera values on those projects with the correct values, will merge this now" [puppet] - 10https://gerrit.wikimedia.org/r/849483 (owner: 10David Caro)
[15:39:21] <wikibugs>	 (03PS4) 10David Caro: p::wmcs:nfs: Fix typo in the hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/849483
[15:39:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37424 and previous config saved to /var/cache/conftool/dbconfig/20221101-153938-ladsgroup.json
[15:41:35] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] esitest: add config file dependency on /run/esitest creation [puppet] - 10https://gerrit.wikimedia.org/r/851654 (owner: 10Ssingh)
[15:42:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4043.ulsfo.wmnet
[15:44:04] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Provide additional tests to cover errors caused by wrong engine commands [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) (owner: 10Vlad.shapik)
[15:45:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P37425 and previous config saved to /var/cache/conftool/dbconfig/20221101-154557-ladsgroup.json
[15:46:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P37426 and previous config saved to /var/cache/conftool/dbconfig/20221101-154607-ladsgroup.json
[15:47:13] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37879/console" [puppet] - 10https://gerrit.wikimedia.org/r/851621 (owner: 10JMeybohm)
[15:47:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[15:47:31] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[15:48:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T318605)', diff saved to https://phabricator.wikimedia.org/P37427 and previous config saved to /var/cache/conftool/dbconfig/20221101-154819-ladsgroup.json
[15:48:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[15:48:34] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[15:48:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[15:48:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T318605)', diff saved to https://phabricator.wikimedia.org/P37428 and previous config saved to /var/cache/conftool/dbconfig/20221101-154844-ladsgroup.json
[15:49:00] <wikibugs>	 (03Merged) 10jenkins-bot: Provide additional tests to cover errors caused by wrong engine commands [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) (owner: 10Vlad.shapik)
[15:49:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37429 and previous config saved to /var/cache/conftool/dbconfig/20221101-154907-ladsgroup.json
[15:49:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance
[15:49:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance
[15:49:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T318950)', diff saved to https://phabricator.wikimedia.org/P37430 and previous config saved to /var/cache/conftool/dbconfig/20221101-154919-ladsgroup.json
[15:49:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37431 and previous config saved to /var/cache/conftool/dbconfig/20221101-154938-ladsgroup.json
[15:49:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance
[15:49:51] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[15:49:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance
[15:50:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37432 and previous config saved to /var/cache/conftool/dbconfig/20221101-155002-ladsgroup.json
[15:51:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T318950)', diff saved to https://phabricator.wikimedia.org/P37433 and previous config saved to /var/cache/conftool/dbconfig/20221101-155142-ladsgroup.json
[15:51:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Record extended MOU for Robert West [puppet] - 10https://gerrit.wikimedia.org/r/851644 (owner: 10Muehlenhoff)
[15:51:44] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[15:51:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4043.ulsfo.wmnet
[15:52:31] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[15:53:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add a few more package cleanups for bullseye->bookworm ABI changes [puppet] - 10https://gerrit.wikimedia.org/r/851638 (owner: 10Muehlenhoff)
[15:54:18] <wikibugs>	 (03PS1) 10Andrew Bogott: rsyncd.pp: use gid 'nogroup' rather than 'nobody' [puppet] - 10https://gerrit.wikimedia.org/r/851661 (https://phabricator.wikimedia.org/T322149)
[15:54:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37434 and previous config saved to /var/cache/conftool/dbconfig/20221101-155446-ladsgroup.json
[15:54:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[15:54:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[15:54:54] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[15:54:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T318950)', diff saved to https://phabricator.wikimedia.org/P37435 and previous config saved to /var/cache/conftool/dbconfig/20221101-155458-ladsgroup.json
[15:55:32] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[15:55:44] <wikibugs>	 (03CR) 10Andrew Bogott: "I'm not sure that this fully resolves the attached bug but I think it's correct regardless." [puppet] - 10https://gerrit.wikimedia.org/r/851661 (https://phabricator.wikimedia.org/T322149) (owner: 10Andrew Bogott)
[15:57:28] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[15:57:31] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[16:00:05] <jouncebot>	 jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:48] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4044.ulsfo.wmnet
[16:01:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P37436 and previous config saved to /var/cache/conftool/dbconfig/20221101-160106-ladsgroup.json
[16:01:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P37437 and previous config saved to /var/cache/conftool/dbconfig/20221101-160116-ladsgroup.json
[16:03:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T318950)', diff saved to https://phabricator.wikimedia.org/P37438 and previous config saved to /var/cache/conftool/dbconfig/20221101-160308-ladsgroup.json
[16:03:13] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[16:03:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Additional MOU extensions [puppet] - 10https://gerrit.wikimedia.org/r/851665
[16:03:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37439 and previous config saved to /var/cache/conftool/dbconfig/20221101-160344-ladsgroup.json
[16:04:50] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[16:06:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P37440 and previous config saved to /var/cache/conftool/dbconfig/20221101-160649-ladsgroup.json
[16:10:55] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4044.ulsfo.wmnet
[16:11:58] <icinga-wm>	 PROBLEM - Host labstore1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:13:43] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: VIP for kubernetes api server [dns] - 10https://gerrit.wikimedia.org/r/851668
[16:15:38] <wikibugs>	 (03PS1) 10BCornwall: prometheus: Handle inactive trafficserver service [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815)
[16:16:04] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: VIP for kubernetes api server [dns] - 10https://gerrit.wikimedia.org/r/851668 (owner: 10JHathaway)
[16:16:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T318605)', diff saved to https://phabricator.wikimedia.org/P37441 and previous config saved to /var/cache/conftool/dbconfig/20221101-161614-ladsgroup.json
[16:16:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance
[16:16:20] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[16:16:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T318955)', diff saved to https://phabricator.wikimedia.org/P37442 and previous config saved to /var/cache/conftool/dbconfig/20221101-161625-ladsgroup.json
[16:16:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[16:16:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance
[16:16:32] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[16:16:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] admin: add mw on kubernetes namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert)
[16:16:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T318605)', diff saved to https://phabricator.wikimedia.org/P37443 and previous config saved to /var/cache/conftool/dbconfig/20221101-161636-ladsgroup.json
[16:16:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[16:16:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: Handle inactive trafficserver service [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall)
[16:16:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37444 and previous config saved to /var/cache/conftool/dbconfig/20221101-161648-ladsgroup.json
[16:16:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Additional MOU extensions [puppet] - 10https://gerrit.wikimedia.org/r/851665 (owner: 10Muehlenhoff)
[16:18:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37445 and previous config saved to /var/cache/conftool/dbconfig/20221101-161816-ladsgroup.json
[16:18:28] <wikibugs>	 (03PS2) 10BCornwall: prometheus: Handle inactive trafficserver service [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815)
[16:18:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P37446 and previous config saved to /var/cache/conftool/dbconfig/20221101-161851-ladsgroup.json
[16:19:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37447 and previous config saved to /var/cache/conftool/dbconfig/20221101-161859-ladsgroup.json
[16:19:22] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[16:21:34] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:21:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: dispatch: move frontend to its own module [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229)
[16:21:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P37448 and previous config saved to /var/cache/conftool/dbconfig/20221101-162158-ladsgroup.json
[16:22:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T318605)', diff saved to https://phabricator.wikimedia.org/P37449 and previous config saved to /var/cache/conftool/dbconfig/20221101-162206-ladsgroup.json
[16:22:20] <wikibugs>	 (03PS1) 10Ottomata: Declare rc0.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851673 (https://phabricator.wikimedia.org/T307959)
[16:22:20] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[16:24:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dispatch: move frontend to its own module [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[16:25:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[16:26:00] <wikibugs>	 (03PS2) 10Filippo Giunchedi: dispatch: move frontend to its own module [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229)
[16:27:11] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 (10Jclark-ctr)
[16:27:42] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:28:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Jclark-ctr)
[16:29:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37880/console" [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[16:31:03] <wikibugs>	 10SRE, 10ops-eqiad: msw-c5-eqiad offline - https://phabricator.wikimedia.org/T321311 (10Jclark-ctr) 05Open→03Resolved
[16:33:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37450 and previous config saved to /var/cache/conftool/dbconfig/20221101-163324-ladsgroup.json
[16:33:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P37451 and previous config saved to /var/cache/conftool/dbconfig/20221101-163358-ladsgroup.json
[16:34:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P37452 and previous config saved to /var/cache/conftool/dbconfig/20221101-163407-ladsgroup.json
[16:34:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KMorgan - https://phabricator.wikimedia.org/T322154 (10KMorgan-WMF)
[16:37:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T318950)', diff saved to https://phabricator.wikimedia.org/P37453 and previous config saved to /var/cache/conftool/dbconfig/20221101-163706-ladsgroup.json
[16:37:12] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[16:37:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P37454 and previous config saved to /var/cache/conftool/dbconfig/20221101-163713-ladsgroup.json
[16:38:16] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: add LVS config for API server [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137)
[16:41:10] <wikibugs>	 10SRE, 10Growth-Team, 10Notifications, 10serviceops, 10Wikimedia-production-error: Failed to fetch API response from {wiki}. Error code {code} - https://phabricator.wikimedia.org/T321409 (10LSobanski)
[16:41:25] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[16:42:36] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:42:39] <elukey>	 .12
[16:42:43] <elukey>	 err :)
[16:44:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations: service::docker with 'latest' version behaves poorly if the host runs out of disk space - https://phabricator.wikimedia.org/T321851 (10LSobanski)
[16:45:44] <wikibugs>	 (03PS1) 10Ssingh: esitest: add explicit require on /run/esitest [puppet] - 10https://gerrit.wikimedia.org/r/851677
[16:46:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Goal: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747 (10LSobanski)
[16:47:17] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] esitest: add explicit require on /run/esitest [puppet] - 10https://gerrit.wikimedia.org/r/851677 (owner: 10Ssingh)
[16:48:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T318950)', diff saved to https://phabricator.wikimedia.org/P37455 and previous config saved to /var/cache/conftool/dbconfig/20221101-164832-ladsgroup.json
[16:48:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[16:48:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[16:48:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T318950)', diff saved to https://phabricator.wikimedia.org/P37456 and previous config saved to /var/cache/conftool/dbconfig/20221101-164845-ladsgroup.json
[16:48:47] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4045.ulsfo.wmnet
[16:49:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37457 and previous config saved to /var/cache/conftool/dbconfig/20221101-164907-ladsgroup.json
[16:49:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance
[16:49:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P37458 and previous config saved to /var/cache/conftool/dbconfig/20221101-164914-ladsgroup.json
[16:49:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance
[16:49:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37459 and previous config saved to /var/cache/conftool/dbconfig/20221101-164930-ladsgroup.json
[16:49:37] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[16:50:30] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[16:50:42] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[16:50:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37460 and previous config saved to /var/cache/conftool/dbconfig/20221101-165042-ladsgroup.json
[16:51:00] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[16:51:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T318950)', diff saved to https://phabricator.wikimedia.org/P37461 and previous config saved to /var/cache/conftool/dbconfig/20221101-165100-ladsgroup.json
[16:52:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P37462 and previous config saved to /var/cache/conftool/dbconfig/20221101-165221-ladsgroup.json
[16:53:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T318605)', diff saved to https://phabricator.wikimedia.org/P37463 and previous config saved to /var/cache/conftool/dbconfig/20221101-165323-ladsgroup.json
[16:53:50] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[16:58:26] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4045.ulsfo.wmnet
[17:04:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37464 and previous config saved to /var/cache/conftool/dbconfig/20221101-170424-ladsgroup.json
[17:04:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[17:04:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[17:04:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T318955)', diff saved to https://phabricator.wikimedia.org/P37465 and previous config saved to /var/cache/conftool/dbconfig/20221101-170447-ladsgroup.json
[17:05:00] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4046.ulsfo.wmnet
[17:05:39] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[17:05:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P37466 and previous config saved to /var/cache/conftool/dbconfig/20221101-170550-ladsgroup.json
[17:06:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37467 and previous config saved to /var/cache/conftool/dbconfig/20221101-170608-ladsgroup.json
[17:06:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T318955)', diff saved to https://phabricator.wikimedia.org/P37468 and previous config saved to /var/cache/conftool/dbconfig/20221101-170656-ladsgroup.json
[17:07:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T318605)', diff saved to https://phabricator.wikimedia.org/P37469 and previous config saved to /var/cache/conftool/dbconfig/20221101-170730-ladsgroup.json
[17:07:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[17:07:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[17:07:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T318605)', diff saved to https://phabricator.wikimedia.org/P37470 and previous config saved to /var/cache/conftool/dbconfig/20221101-170752-ladsgroup.json
[17:08:00] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[17:08:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P37471 and previous config saved to /var/cache/conftool/dbconfig/20221101-170832-ladsgroup.json
[17:12:54] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:14:08] <wikibugs>	 (03PS1) 10David Caro: webservice: add toolforge-* link for it [puppet] - 10https://gerrit.wikimedia.org/r/851685
[17:14:37] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4046.ulsfo.wmnet
[17:14:44] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet
[17:14:44] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:19:16] <icinga-wm>	 PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:20:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P37472 and previous config saved to /var/cache/conftool/dbconfig/20221101-172058-ladsgroup.json
[17:21:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37473 and previous config saved to /var/cache/conftool/dbconfig/20221101-172116-ladsgroup.json
[17:22:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P37474 and previous config saved to /var/cache/conftool/dbconfig/20221101-172204-ladsgroup.json
[17:23:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P37475 and previous config saved to /var/cache/conftool/dbconfig/20221101-172341-ladsgroup.json
[17:24:32] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4047.ulsfo.wmnet
[17:24:39] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4048.ulsfo.wmnet
[17:26:24] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "I'd prefer to do this in the Debian packaging and not here." [puppet] - 10https://gerrit.wikimedia.org/r/851685 (owner: 10David Caro)
[17:31:18] <wikibugs>	 (03CR) 10David Caro: webservice: add toolforge-* link for it (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/851685 (owner: 10David Caro)
[17:33:04] <icinga-wm>	 RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:09] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4048.ulsfo.wmnet
[17:34:16] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4049.ulsfo.wmnet
[17:35:27] <wikibugs>	 (03PS3) 10BBlack: Switch drmrs, eqsin, esams to digicert-2022 [puppet] - 10https://gerrit.wikimedia.org/r/850287 (https://phabricator.wikimedia.org/T313328)
[17:35:44] <wikibugs>	 (03PS1) 10Ssingh: haproxy: use systemd::tmpfile to create /run/haproxy (and esitest) [puppet] - 10https://gerrit.wikimedia.org/r/851689
[17:36:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37476 and previous config saved to /var/cache/conftool/dbconfig/20221101-173607-ladsgroup.json
[17:36:15] <wikibugs>	 (03PS1) 10Btullis: Add a postgresql database for the airflow development [puppet] - 10https://gerrit.wikimedia.org/r/851690 (https://phabricator.wikimedia.org/T319440)
[17:36:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T318950)', diff saved to https://phabricator.wikimedia.org/P37477 and previous config saved to /var/cache/conftool/dbconfig/20221101-173624-ladsgroup.json
[17:36:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[17:36:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[17:36:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T318950)', diff saved to https://phabricator.wikimedia.org/P37478 and previous config saved to /var/cache/conftool/dbconfig/20221101-173636-ladsgroup.json
[17:36:41] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[17:36:46] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[17:37:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P37479 and previous config saved to /var/cache/conftool/dbconfig/20221101-173712-ladsgroup.json
[17:37:50] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a postgresql database for the airflow development [puppet] - 10https://gerrit.wikimedia.org/r/851690 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis)
[17:38:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T318605)', diff saved to https://phabricator.wikimedia.org/P37480 and previous config saved to /var/cache/conftool/dbconfig/20221101-173848-ladsgroup.json
[17:38:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[17:39:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[17:40:17] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Allow rsync to doc.discovery.wmnet from trusted runner containers [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629) (owner: 10Ahmon Dancy)
[17:40:33] <dancy>	 Thanks Jelto!
[17:40:58] <icinga-wm>	 PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:41:20] <sukhe>	 huh!
[17:41:28] <sukhe>	 this is new
[17:41:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T318605)', diff saved to https://phabricator.wikimedia.org/P37481 and previous config saved to /var/cache/conftool/dbconfig/20221101-174129-ladsgroup.json
[17:41:32] <wikibugs>	 (03PS2) 10David Caro: webservice: add toolforge-* link for it [puppet] - 10https://gerrit.wikimedia.org/r/851685
[17:41:34] <wikibugs>	 (03CR) 10David Caro: webservice: add toolforge-* link for it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851685 (owner: 10David Caro)
[17:42:26] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37886/console" [puppet] - 10https://gerrit.wikimedia.org/r/851685 (owner: 10David Caro)
[17:44:04] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4049.ulsfo.wmnet
[17:44:10] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4050.ulsfo.wmnet
[17:44:52] <wikibugs>	 (03PS1) 10Filippo Giunchedi: dispatch: refactor/simplify db profile [puppet] - 10https://gerrit.wikimedia.org/r/851693 (https://phabricator.wikimedia.org/T313229)
[17:47:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37887/console" [puppet] - 10https://gerrit.wikimedia.org/r/851693 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[17:48:52] <icinga-wm>	 RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:13] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] haproxy: use systemd::tmpfile to create /run/haproxy (and esitest) [puppet] - 10https://gerrit.wikimedia.org/r/851689 (owner: 10Ssingh)
[17:52:14] <wikibugs>	 (03PS6) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996)
[17:52:16] <wikibugs>	 (03PS2) 10Ssingh: haproxy: use systemd::tmpfile to create /run/haproxy (and esitest) [puppet] - 10https://gerrit.wikimedia.org/r/851689
[17:52:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T318955)', diff saved to https://phabricator.wikimedia.org/P37482 and previous config saved to /var/cache/conftool/dbconfig/20221101-175221-ladsgroup.json
[17:52:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[17:52:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[17:52:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T318955)', diff saved to https://phabricator.wikimedia.org/P37483 and previous config saved to /var/cache/conftool/dbconfig/20221101-175244-ladsgroup.json
[17:53:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:53:40] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4050.ulsfo.wmnet
[17:53:46] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4051.ulsfo.wmnet
[17:53:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T318955)', diff saved to https://phabricator.wikimedia.org/P37484 and previous config saved to /var/cache/conftool/dbconfig/20221101-175353-ladsgroup.json
[17:54:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37485 and previous config saved to /var/cache/conftool/dbconfig/20221101-175405-ladsgroup.json
[17:54:12] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] haproxy: use systemd::tmpfile to create /run/haproxy (and esitest) [puppet] - 10https://gerrit.wikimedia.org/r/851689 (owner: 10Ssingh)
[17:56:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P37486 and previous config saved to /var/cache/conftool/dbconfig/20221101-175639-ladsgroup.json
[17:57:53] <wikibugs>	 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10MusikAnimal)
[17:58:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:00:04] <jouncebot>	 jeena and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T1800).
[18:03:53] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4051.ulsfo.wmnet
[18:04:00] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4052.ulsfo.wmnet
[18:07:07] <logmsgbot>	 !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.8  refs T320513
[18:07:13] <stashbot>	 T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513
[18:08:14] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.0104 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[18:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:09:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P37487 and previous config saved to /var/cache/conftool/dbconfig/20221101-180902-ladsgroup.json
[18:09:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37488 and previous config saved to /var/cache/conftool/dbconfig/20221101-180913-ladsgroup.json
[18:09:23] <wikibugs>	 (03PS1) 10BCornwall: DO NOT MERGE: Testing the test suite [puppet] - 10https://gerrit.wikimedia.org/r/851696
[18:11:25] <logmsgbot>	 !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.8  refs T320513 (duration: 04m 18s)
[18:11:43] <wikibugs>	 (03PS2) 10BCornwall: DO NOT MERGE: Testing the test suite [puppet] - 10https://gerrit.wikimedia.org/r/851696
[18:11:45] <wikibugs>	 (03PS7) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996)
[18:11:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P37489 and previous config saved to /var/cache/conftool/dbconfig/20221101-181148-ladsgroup.json
[18:12:46] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851697 (https://phabricator.wikimedia.org/T320513)
[18:12:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851697 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot)
[18:12:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[18:13:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance
[18:13:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T318605)', diff saved to https://phabricator.wikimedia.org/P37490 and previous config saved to /var/cache/conftool/dbconfig/20221101-181310-ladsgroup.json
[18:13:28] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4052.ulsfo.wmnet
[18:14:04] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851697 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot)
[18:14:16] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[18:18:14] <logmsgbot>	 !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.8  refs T320513
[18:18:20] <stashbot>	 T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513
[18:20:08] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), No backups: 1 (dispatch-be1001), Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[18:21:21] <wikibugs>	 (03PS1) 10Ssingh: esitest: do not require on obsoleted file resource [puppet] - 10https://gerrit.wikimedia.org/r/851698
[18:21:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[18:22:13] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=ats-tls
[18:22:14] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=ats-be
[18:22:14] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=varnish-fe
[18:22:23] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] esitest: do not require on obsoleted file resource [puppet] - 10https://gerrit.wikimedia.org/r/851698 (owner: 10Ssingh)
[18:22:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[18:22:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[18:22:52] <wikibugs>	 (03PS2) 10JHathaway: aux-k8s: add LVS config for API server [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137)
[18:23:09] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[18:23:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[18:24:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P37491 and previous config saved to /var/cache/conftool/dbconfig/20221101-182412-ladsgroup.json
[18:24:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T318950)', diff saved to https://phabricator.wikimedia.org/P37492 and previous config saved to /var/cache/conftool/dbconfig/20221101-182421-ladsgroup.json
[18:24:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[18:24:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[18:25:48] <stashbot>	 T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950
[18:26:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T318605)', diff saved to https://phabricator.wikimedia.org/P37493 and previous config saved to /var/cache/conftool/dbconfig/20221101-182655-ladsgroup.json
[18:26:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[18:27:04] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[18:27:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[18:27:12] <wikibugs>	 (03PS1) 10Ssingh: esitest: do not require on obsoleted file resource (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/851699
[18:27:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[18:27:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[18:27:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T318605)', diff saved to https://phabricator.wikimedia.org/P37494 and previous config saved to /var/cache/conftool/dbconfig/20221101-182734-ladsgroup.json
[18:27:42] <icinga-wm>	 PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:27:48] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10Jgreen)
[18:27:57] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] esitest: do not require on obsoleted file resource (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/851699 (owner: 10Ssingh)
[18:28:41] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jgreen)
[18:29:05] <wikibugs>	 (03PS5) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304)
[18:29:07] <wikibugs>	 (03PS3) 10BCornwall: DO NOT MERGE: Testing the test suite [puppet] - 10https://gerrit.wikimedia.org/r/851696
[18:29:09] <wikibugs>	 (03PS8) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996)
[18:32:14] <icinga-wm>	 RECOVERY - Disk space on cp5007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops
[18:39:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T318955)', diff saved to https://phabricator.wikimedia.org/P37495 and previous config saved to /var/cache/conftool/dbconfig/20221101-183920-ladsgroup.json
[18:39:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[18:39:26] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[18:39:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[18:45:14] <icinga-wm>	 PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:45:25] <wikibugs>	 (03PS9) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996)
[18:45:33] <wikibugs>	 (03Abandoned) 10BCornwall: DO NOT MERGE: Testing the test suite [puppet] - 10https://gerrit.wikimedia.org/r/851696 (owner: 10BCornwall)
[18:47:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T318605)', diff saved to https://phabricator.wikimedia.org/P37496 and previous config saved to /var/cache/conftool/dbconfig/20221101-184758-ladsgroup.json
[18:48:15] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[18:56:54] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005446 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[18:59:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[19:01:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T318605)', diff saved to https://phabricator.wikimedia.org/P37497 and previous config saved to /var/cache/conftool/dbconfig/20221101-190132-ladsgroup.json
[19:01:40] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[19:03:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P37498 and previous config saved to /var/cache/conftool/dbconfig/20221101-190307-ladsgroup.json
[19:04:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:04:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:08:20] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Declare rc0.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851673 (https://phabricator.wikimedia.org/T307959) (owner: 10Ottomata)
[19:09:03] <wikibugs>	 (03Merged) 10jenkins-bot: Declare rc0.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851673 (https://phabricator.wikimedia.org/T307959) (owner: 10Ottomata)
[19:09:15] <ottomata>	 ahh jeena i just merged a mediawiki-config change, but i just noticed that the train window is now.  it is a total no-op
[19:09:59] <jeena>	 I've already deployed today's train an hour ago actually
[19:10:31] <jeena>	 Do you need it backported?
[19:10:39] <ottomata>	 oh okay!  no i can deploy it
[19:10:46] <jeena>	 Okay :)
[19:10:51] <ottomata>	 its just pre-declaring a new stream so gmodena can work on it later
[19:11:05] <ottomata>	 i also will deploy another that enables some stuff on group0 wikis too, if you don't mind.
[19:11:14] <ottomata>	 thank you!
[19:11:43] <jeena>	 No problem, that should be fine 👍
[19:11:45] <ottomata>	 ty
[19:11:54] <jeena>	 Thanks for checking in!
[19:14:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[19:14:33] <wikibugs>	 (03PS1) 10Ottomata: Enable rc0.mediawiki.page_change on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851705 (https://phabricator.wikimedia.org/T311129)
[19:15:30] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Declare rc0.mediawiki.page_content_change stream - T307959 T308017 (duration: 03m 42s)
[19:15:37] <stashbot>	 T308017: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017
[19:15:38] <stashbot>	 T307959: [Shared Event Platform] Design and Implement POC Flink Service to Combine Existing Streams, Enrich and Output to New Topic - https://phabricator.wikimedia.org/T307959
[19:15:55] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Looks right, I think 😊" [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[19:16:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P37499 and previous config saved to /var/cache/conftool/dbconfig/20221101-191639-ladsgroup.json
[19:17:56] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Enable rc0.mediawiki.page_change on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851705 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata)
[19:18:05] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: add LVS config for API server [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[19:18:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P37500 and previous config saved to /var/cache/conftool/dbconfig/20221101-191815-ladsgroup.json
[19:18:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[19:18:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[19:19:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable rc0.mediawiki.page_change on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851705 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata)
[19:19:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[19:20:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10RKemper)
[19:21:10] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:24:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[19:25:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[19:25:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[19:26:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[19:31:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P37501 and previous config saved to /var/cache/conftool/dbconfig/20221101-193148-ladsgroup.json
[19:33:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T318605)', diff saved to https://phabricator.wikimedia.org/P37502 and previous config saved to /var/cache/conftool/dbconfig/20221101-193323-ladsgroup.json
[19:33:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[19:33:39] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[19:33:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[19:33:50] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[19:33:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[19:34:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T318605)', diff saved to https://phabricator.wikimedia.org/P37503 and previous config saved to /var/cache/conftool/dbconfig/20221101-193404-ladsgroup.json
[19:36:01] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable rc0.mediawiki.page_change on group0 wikis - T311129 (duration: 03m 38s)
[19:36:06] <jinxer-wm>	 (ConfdResourceFailed) firing: confd resource _srv_config-master_pybal_eqiad_aux-k8s-ctrl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:37:58] <stashbot>	 T311129: [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129
[19:39:15] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:41:05] <logmsgbot>	 !log jhathaway@puppetmaster1001 conftool action : set/pooled=yes:weight=1; selector: cluster=aux-k8s,service=kubemaster
[19:44:41] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: enable LVS config for API server [puppet] - 10https://gerrit.wikimedia.org/r/851708 (https://phabricator.wikimedia.org/T321137)
[19:46:04] <icinga-wm>	 RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:46:06] <jinxer-wm>	 (ConfdResourceFailed) resolved: confd resource _srv_config-master_pybal_eqiad_aux-k8s-ctrl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[19:46:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T318605)', diff saved to https://phabricator.wikimedia.org/P37504 and previous config saved to /var/cache/conftool/dbconfig/20221101-194655-ladsgroup.json
[19:46:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[19:47:00] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[19:47:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[19:47:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T318605)', diff saved to https://phabricator.wikimedia.org/P37505 and previous config saved to /var/cache/conftool/dbconfig/20221101-194718-ladsgroup.json
[19:52:03] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: enable LVS config for API server [puppet] - 10https://gerrit.wikimedia.org/r/851708 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[19:55:05] <wikibugs>	 (03PS1) 10Bking: query_service: Ensure prometheus exporter depends on blazegraph service [puppet] - 10https://gerrit.wikimedia.org/r/851711 (https://phabricator.wikimedia.org/T322037)
[19:56:42] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/851711 (https://phabricator.wikimedia.org/T322037) (owner: 10Bking)
[19:57:11] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37899/console" [puppet] - 10https://gerrit.wikimedia.org/r/851711 (https://phabricator.wikimedia.org/T322037) (owner: 10Bking)
[19:59:00] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.74:6443]) https://wikitech.wikimedia.org/wiki/PyBal
[19:59:22] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 118 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T2000). Please do the needful.
[20:00:04] <jouncebot>	 MatmaRex and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/851711 (https://phabricator.wikimedia.org/T322037) (owner: 10Bking)
[20:00:13] <cjming>	 o/ i can deploy
[20:00:15] <MatmaRex>	 sup
[20:00:27] <wikibugs>	 (03PS4) 10Clare Ming: Enable DiscussionTools mobile visual enhancements at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843581 (https://phabricator.wikimedia.org/T318870) (owner: 10Bartosz Dziewoński)
[20:00:29] <MatmaRex>	 thanks
[20:00:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1016.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[20:01:01] <urbanecm>	 thanks cjming 
[20:01:08] <cjming>	 np!
[20:01:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1016.eqiad.wmnet with reason: Remove from cluster for eventual reimage
[20:01:38] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 72 connections established with conf1007.eqiad.wmnet:4001 (min=73) https://wikitech.wikimedia.org/wiki/PyBal
[20:02:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843581 (https://phabricator.wikimedia.org/T318870) (owner: 10Bartosz Dziewoński)
[20:02:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[20:02:56] <wikibugs>	 (03Merged) 10jenkins-bot: Enable DiscussionTools mobile visual enhancements at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843581 (https://phabricator.wikimedia.org/T318870) (owner: 10Bartosz Dziewoński)
[20:03:14] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.74:6443]) https://wikitech.wikimedia.org/wiki/PyBal
[20:03:17] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:843581|Enable DiscussionTools mobile visual enhancements at jawiki (T318870)]]
[20:03:42] <logmsgbot>	 !log cjming@deploy1002 cjming and matmarex: Backport for [[gerrit:843581|Enable DiscussionTools mobile visual enhancements at jawiki (T318870)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:03:47] <cjming>	 MatmaRex: your 1st patch is up on debug servers - shall i sync?
[20:04:32] <wikibugs>	 (03PS1) 10DDesouza: Remove Research Incentive survey from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851713 (https://phabricator.wikimedia.org/T318333)
[20:04:45] <MatmaRex>	 cjming: yup, looks good
[20:05:16] <cjming>	 cool - syncing - moving on to your 2nd patch
[20:06:43] <ryankemper>	 !log T322037 Disabled puppet across `A:wdqs-all` and `A:wcqs-public`
[20:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[20:07:06] <wikibugs>	 (03CR) 10Bking: [C: 03+2] query_service: Ensure prometheus exporter depends on blazegraph service [puppet] - 10https://gerrit.wikimedia.org/r/851711 (https://phabricator.wikimedia.org/T322037) (owner: 10Bking)
[20:07:13] <stashbot>	 T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service - https://phabricator.wikimedia.org/T322037
[20:07:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[20:07:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[20:07:56] <wikibugs>	 (03PS1) 10DDesouza: Deploy Research Incentive survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851714 (https://phabricator.wikimedia.org/T321930)
[20:08:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[20:08:49] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:843581|Enable DiscussionTools mobile visual enhancements at jawiki (T318870)]] (duration: 05m 31s)
[20:09:53] <wikibugs>	 (03PS3) 10Clare Ming: Enable DiscussionTools visual enhancements beta feature at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851125 (https://phabricator.wikimedia.org/T318127) (owner: 10Bartosz Dziewoński)
[20:09:55] <stashbot>	 T318870: [Config Change] Enable all DiscussionTools by default at ja.wiki (mobile) - https://phabricator.wikimedia.org/T318870
[20:10:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851125 (https://phabricator.wikimedia.org/T318127) (owner: 10Bartosz Dziewoński)
[20:11:16] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 119 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal
[20:11:41] <wikibugs>	 (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements beta feature at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851125 (https://phabricator.wikimedia.org/T318127) (owner: 10Bartosz Dziewoński)
[20:12:04] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:851125|Enable DiscussionTools visual enhancements beta feature at jawiki (T318127)]]
[20:12:12] <stashbot>	 T318127: [Config Change] Enable Topic Containers as beta feature at Phase 3 wikis (desktop) - https://phabricator.wikimedia.org/T318127
[20:12:27] <logmsgbot>	 !log cjming@deploy1002 cjming and matmarex: Backport for [[gerrit:851125|Enable DiscussionTools visual enhancements beta feature at jawiki (T318127)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:12:31] <cjming>	 MatmaRex: 2nd patch up on debugs servers if you want to check
[20:13:02] <MatmaRex>	 cjming: checked, also looks good
[20:13:09] <cjming>	 going live
[20:14:08] <icinga-wm>	 PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh2001 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (stale by 521310.76s). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check
[20:15:08] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:15:32] <sukhe>	 ha
[20:16:59] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:851125|Enable DiscussionTools visual enhancements beta feature at jawiki (T318127)]] (duration: 04m 55s)
[20:17:15] <cjming>	 MatmaRex: both patches should be live!
[20:17:23] <cjming>	 moving onto my patches next
[20:17:34] <MatmaRex>	 thanks!
[20:17:39] <cjming>	 np!
[20:18:04] <stashbot>	 T318127: [Config Change] Enable Topic Containers as beta feature at Phase 3 wikis (desktop) - https://phabricator.wikimedia.org/T318127
[20:18:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming)
[20:18:38] <wikibugs>	 (03PS4) 10Clare Ming: Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602)
[20:18:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[20:19:12] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:19:28] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 73 connections established with conf1007.eqiad.wmnet:4001 (min=73) https://wikitech.wikimedia.org/wiki/PyBal
[20:19:35] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming)
[20:19:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[20:19:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[20:20:20] <wikibugs>	 (03Merged) 10jenkins-bot: Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming)
[20:20:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[20:20:42] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:851128|Add MP stream for VisualEditorFeatureUse instrument (T309602)]]
[20:20:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T318605)', diff saved to https://phabricator.wikimedia.org/P37506 and previous config saved to /var/cache/conftool/dbconfig/20221101-202059-ladsgroup.json
[20:21:05] <logmsgbot>	 !log cjming@deploy1002 cjming and cjming: Backport for [[gerrit:851128|Add MP stream for VisualEditorFeatureUse instrument (T309602)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[20:21:09] <sukhe>	 m/win 14
[20:21:26] <stashbot>	 T309602: VisualEditorFeatureUse Migration to MP - https://phabricator.wikimedia.org/T309602
[20:21:34] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[20:22:46] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:23:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:24:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T318605)', diff saved to https://phabricator.wikimedia.org/P37507 and previous config saved to /var/cache/conftool/dbconfig/20221101-202449-ladsgroup.json
[20:25:18] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:851128|Add MP stream for VisualEditorFeatureUse instrument (T309602)]] (duration: 04m 36s)
[20:25:39] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: note that LVS config for API server is production [puppet] - 10https://gerrit.wikimedia.org/r/851717 (https://phabricator.wikimedia.org/T321137)
[20:25:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[20:25:50] <wikibugs>	 (03PS2) 10Clare Ming: Update Edit Attempt Step sampling rate to 1 for group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851652 (https://phabricator.wikimedia.org/T312016)
[20:26:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[20:26:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[20:26:52] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[20:27:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851652 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming)
[20:27:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[20:28:01] <stashbot>	 T309602: VisualEditorFeatureUse Migration to MP - https://phabricator.wikimedia.org/T309602
[20:28:32] <wikibugs>	 (03Merged) 10jenkins-bot: Update Edit Attempt Step sampling rate to 1 for group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851652 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming)
[20:28:54] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:851652|Update Edit Attempt Step sampling rate to 1 for group 0 wikis (T312016)]]
[20:29:17] <logmsgbot>	 !log cjming@deploy1002 cjming and cjming: Backport for [[gerrit:851652|Update Edit Attempt Step sampling rate to 1 for group 0 wikis (T312016)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[20:29:26] <icinga-wm>	 RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:29:53] <stashbot>	 T312016: Increase EditAttemptStep sampling rate(s) to 100% - https://phabricator.wikimedia.org/T312016
[20:31:34] <wikibugs>	 (03PS1) 10Jforrester: onSpecialSearchCreateLink: Handle null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851016 (https://phabricator.wikimedia.org/T320736)
[20:31:38] <wikibugs>	 (03PS1) 10Jforrester: onSpecialSearchCreateLink: Handle another null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736)
[20:32:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[20:33:23] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:851652|Update Edit Attempt Step sampling rate to 1 for group 0 wikis (T312016)]] (duration: 04m 29s)
[20:33:42] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: note that LVS config for API server is production [puppet] - 10https://gerrit.wikimedia.org/r/851717 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[20:33:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[20:33:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[20:34:25] <cjming>	 !log end of UTC late backport window
[20:34:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[20:35:21] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:36:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P37508 and previous config saved to /var/cache/conftool/dbconfig/20221101-203607-ladsgroup.json
[20:37:45] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 131 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:39:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P37509 and previous config saved to /var/cache/conftool/dbconfig/20221101-203957-ladsgroup.json
[20:44:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:45:07] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 14 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:49:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:51:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P37510 and previous config saved to /var/cache/conftool/dbconfig/20221101-205115-ladsgroup.json
[20:53:07] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:53:42] <wikibugs>	 (03PS1) 10Bking: Revert "query_service: Ensure prometheus exporter depends on blazegraph service" [puppet] - 10https://gerrit.wikimedia.org/r/851018
[20:55:02] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Revert "query_service: Ensure prometheus exporter depends on blazegraph service" [puppet] - 10https://gerrit.wikimedia.org/r/851018 (owner: 10Bking)
[20:55:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P37511 and previous config saved to /var/cache/conftool/dbconfig/20221101-205505-ladsgroup.json
[20:56:15] <ryankemper>	 !log T322037 Re-enabled puppet across `A:wdqs-all` and `A:wcqs-public`
[20:56:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:30] <stashbot>	 T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service - https://phabricator.wikimedia.org/T322037
[21:02:00] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 03+1] onSpecialSearchCreateLink: Handle null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851016 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester)
[21:02:05] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 03+1] onSpecialSearchCreateLink: Handle another null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester)
[21:06:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T318605)', diff saved to https://phabricator.wikimedia.org/P37512 and previous config saved to /var/cache/conftool/dbconfig/20221101-210622-ladsgroup.json
[21:06:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[21:06:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[21:06:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37513 and previous config saved to /var/cache/conftool/dbconfig/20221101-210658-ladsgroup.json
[21:07:37] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[21:10:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T318605)', diff saved to https://phabricator.wikimedia.org/P37514 and previous config saved to /var/cache/conftool/dbconfig/20221101-211013-ladsgroup.json
[21:10:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[21:10:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[21:15:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:17:41] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 124 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:19:39] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 47 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:20:36] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker1001.eqiad.wmnet
[21:20:37] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[21:21:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:23:51] <wikibugs>	 (03PS4) 10Jdlrobson: WIP: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223)
[21:27:31] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:28:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:28:23] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:28:23] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-worker1001.eqiad.wmnet on all recursors
[21:28:26] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker1001.eqiad.wmnet on all recursors
[21:28:51] <wikibugs>	 (03PS1) 10Clare Ming: Update config for Metrics Platform VEFU events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602)
[21:32:58] <wikibugs>	 (03CR) 10Clare Ming: "gah - somehow during a rebase, i accidentally added the MP stream for vefu events to group0 wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming)
[21:33:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:33:25] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 137 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:34:13] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:34:24] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi)
[21:35:39] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:35:50] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi)
[21:37:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:37:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 241 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:37:31] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:38:03] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.393 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:38:46] <wikibugs>	 (03PS1) 10JHathaway: aux-k8s: add partman config for workers [puppet] - 10https://gerrit.wikimedia.org/r/851724 (https://phabricator.wikimedia.org/T321137)
[21:39:15] <wikibugs>	 (03PS2) 10Clare Ming: Update config for Metrics Platform VEFU events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602)
[21:39:55] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:40:32] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux-k8s: add partman config for workers [puppet] - 10https://gerrit.wikimedia.org/r/851724 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway)
[21:40:57] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:41:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.225 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:41:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:41:45] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:42:19] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 163 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:42:58] <jhathaway>	 looking
[21:43:09] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:43:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37515 and previous config saved to /var/cache/conftool/dbconfig/20221101-214311-ladsgroup.json
[21:43:17] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[21:43:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1010.eqiad.wmnet, ms-fe1009.eqiad.wmnet are marked down but pooled: swift_80: Servers ms-fe1011.eqiad.wmnet, ms-fe1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:43:53] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:45:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:45:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:45:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_upload) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[21:45:57] <jhathaway>	 really high swift latency
[21:46:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:46:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[21:46:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[21:46:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T318605)', diff saved to https://phabricator.wikimedia.org/P37516 and previous config saved to /var/cache/conftool/dbconfig/20221101-214659-ladsgroup.json
[21:47:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:47:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1012.eqiad.wmnet, ms-fe1011.eqiad.wmnet are marked down but pooled: swift_80: Servers ms-fe1012.eqiad.wmnet, ms-fe1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:47:53] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:48:15] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 276 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:48:21] <jhathaway>	 any swift experts around?
[21:48:55] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.260 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:48:56] <wikibugs>	 (03PS3) 10Clare Ming: testwiki: Add mediawiki.visual_editor_feature_use stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602)
[21:49:15] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.270 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:49:17] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:49:18] <jinxer-wm>	 (ProbeDown) firing: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:49:31] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[21:49:34] <jinxer-wm>	 (FrontendUnavailable) firing: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[21:49:35] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:49:59] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.178 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:50:37] <wikibugs>	 (03CR) 10Clare Ming: testwiki: Add mediawiki.visual_editor_feature_use stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming)
[21:50:41] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[21:50:59] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_analytics:admin.service,swift-account-stats_docker:registry.service,swift-account-stats_mw:media.service,swift-container-stats_mw-media.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:51:07] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:51:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1012.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1009.eqiad.wmnet are marked down but pooled: swift_80: Servers ms-fe1012.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:52:03] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.170 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:52:07] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-worker1001.eqiad.wmnet
[21:53:19] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker1002.eqiad.wmnet
[21:53:20] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox
[21:54:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:54:59] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-container-stats_mw-media.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:55:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 9.555 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:55:51] <icinga-wm>	 PROBLEM - Docker registry health on registry1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 224 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker
[21:57:18] <jinxer-wm>	 (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:58:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 6.249 second response time https://wikitech.wikimedia.org/wiki/Swift
[21:58:11] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:58:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P37517 and previous config saved to /var/cache/conftool/dbconfig/20221101-215820-ladsgroup.json
[21:59:18] <jinxer-wm>	 (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:59:48] <wikibugs>	 (03PS5) 10Jdlrobson: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223)
[21:59:51] <icinga-wm>	 RECOVERY - Docker registry health on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker
[22:00:07] <icinga-wm>	 PROBLEM - Docker registry health on registry1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 228 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker
[22:02:18] <jinxer-wm>	 (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:02:27] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 0.624 second response time https://wikitech.wikimedia.org/wiki/Docker
[22:04:19] <jinxer-wm>	 (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:06:05] <icinga-wm>	 RECOVERY - Docker registry health on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker
[22:07:22] <wikibugs>	 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10Eevans) Since {T320835} appears to be in jeopardy (see: [[ https://phabricator.wikimedia.org/T320...
[22:07:33] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:08:35] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[22:08:56] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:08:56] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-worker1002.eqiad.wmnet on all recursors
[22:09:00] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker1002.eqiad.wmnet on all recursors
[22:09:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:09:18] <jinxer-wm>	 (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:10:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:13:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P37518 and previous config saved to /var/cache/conftool/dbconfig/20221101-221328-ladsgroup.json
[22:14:07] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 122 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:16:29] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 3.194 second response time https://wikitech.wikimedia.org/wiki/Docker
[22:18:31] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:18:53] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:19:18] <jinxer-wm>	 (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:19:29] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 6.078 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:20:02] <logmsgbot>	 !log jhathaway@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad
[22:20:43] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8647 bytes in 5.025 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:20:49] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:22:15] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:22:29] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[22:22:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T318605)', diff saved to https://phabricator.wikimedia.org/P37519 and previous config saved to /var/cache/conftool/dbconfig/20221101-222247-ladsgroup.json
[22:23:38] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[22:24:01] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[22:24:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:26:19] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Docker
[22:26:47] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:27:09] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Docker
[22:27:19] <jinxer-wm>	 (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:27:51] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:28:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37520 and previous config saved to /var/cache/conftool/dbconfig/20221101-222835-ladsgroup.json
[22:28:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:28:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:28:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37521 and previous config saved to /var/cache/conftool/dbconfig/20221101-222858-ladsgroup.json
[22:29:13] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[22:29:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:29:35] <jinxer-wm>	 (FrontendUnavailable) resolved: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[22:30:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_upload) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[22:31:03] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:31:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:32:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service docker-registry:443 has failed probes (http_docker-registry_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:32:35] <Emperor>	 !log rolling restart of eqiad swift front-ends
[22:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:38] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-worker1002.eqiad.wmnet
[22:33:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:33:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:33:35] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:33:35] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:33:45] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:34:41] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[22:34:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[22:34:53] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:35:59] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:36:43] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[22:37:01] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:37:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P37522 and previous config saved to /var/cache/conftool/dbconfig/20221101-223754-ladsgroup.json
[22:43:21] <logmsgbot>	 !log krinkle@deploy1002 Started deploy [integration/docroot@2ddd7d9]: (no justification provided)
[22:43:54] <logmsgbot>	 !log krinkle@deploy1002 Finished deploy [integration/docroot@2ddd7d9]: (no justification provided) (duration: 00m 33s)
[22:53:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P37523 and previous config saved to /var/cache/conftool/dbconfig/20221101-225303-ladsgroup.json
[22:55:02] <Emperor>	 !log depool ms-fe2009
[22:55:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:59:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[23:00:29] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:00:35] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:02:05] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:04:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37524 and previous config saved to /var/cache/conftool/dbconfig/20221101-230411-ladsgroup.json
[23:04:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:04:19] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.295 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:04:33] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[23:05:55] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48976 bytes in 1.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:06:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T318605)', diff saved to https://phabricator.wikimedia.org/P37525 and previous config saved to /var/cache/conftool/dbconfig/20221101-230811-ladsgroup.json
[23:08:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[23:08:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[23:08:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T318605)', diff saved to https://phabricator.wikimedia.org/P37526 and previous config saved to /var/cache/conftool/dbconfig/20221101-230833-ladsgroup.json
[23:08:45] <icinga-wm>	 PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:10:27] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:18:31] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:19:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P37527 and previous config saved to /var/cache/conftool/dbconfig/20221101-231919-ladsgroup.json
[23:25:06] <wikibugs>	 (03PS10) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996)
[23:30:01] <wikibugs>	 (03PS11) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996)
[23:30:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:07] <wikibugs>	 (03CR) 10BCornwall: "0 tests failed, 0 tests skipped, 34 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall)
[23:34:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P37528 and previous config saved to /var/cache/conftool/dbconfig/20221101-233427-ladsgroup.json
[23:36:25] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:43:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T318605)', diff saved to https://phabricator.wikimedia.org/P37529 and previous config saved to /var/cache/conftool/dbconfig/20221101-234346-ladsgroup.json
[23:43:57] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[23:49:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37530 and previous config saved to /var/cache/conftool/dbconfig/20221101-234935-ladsgroup.json
[23:49:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
[23:49:43] <stashbot>	 T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605
[23:49:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
[23:49:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T318605)', diff saved to https://phabricator.wikimedia.org/P37531 and previous config saved to /var/cache/conftool/dbconfig/20221101-234957-ladsgroup.json
[23:58:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P37532 and previous config saved to /var/cache/conftool/dbconfig/20221101-235853-ladsgroup.json