[00:00:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:02] RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:09:28] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [00:11:22] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [00:22:36] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:41:29] (03CR) 10BCornwall: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/829319 (owner: 10Zabe) [00:45:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:13:16] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:15:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:19:14] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [01:19:16] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [01:21:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:45] (JobUnavailable) firing: (9) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:06] (03PS1) 10Andrew Bogott: Openstack: manifests for glance, nova, keystone, placement version Y [puppet] - 10https://gerrit.wikimedia.org/r/851168 (https://phabricator.wikimedia.org/T305828) [01:39:08] (03PS1) 10Andrew Bogott: Openstack: Add manifests for Neutron version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851169 (https://phabricator.wikimedia.org/T305828) [01:39:10] (03PS1) 10Andrew Bogott: Openstack: Add manifests for Trove version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851170 (https://phabricator.wikimedia.org/T305828) [01:39:12] (03PS1) 10Andrew Bogott: Openstack: Add manifests for Heat and Magnum version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851171 (https://phabricator.wikimedia.org/T305828) [01:39:14] (03PS1) 10Andrew Bogott: Openstack: Add manifests for Cinder version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851172 (https://phabricator.wikimedia.org/T305828) [01:39:16] (03PS1) 10Andrew Bogott: Openstack: Add manifests for Barbican version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/851173 (https://phabricator.wikimedia.org/T305828) [01:39:18] (03PS1) 10Andrew Bogott: codfw1dev openstack -> version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/851174 (https://phabricator.wikimedia.org/T305828) [01:45:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [01:48:45] (JobUnavailable) firing: (10) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T0200) [02:07:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.8 [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851030 (https://phabricator.wikimedia.org/T320513) [02:07:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.8 [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851030 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot) [02:07:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [02:08:45] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [02:08:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [02:09:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [02:22:49] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.8 [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851030 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot) [02:29:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [02:30:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [02:30:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [02:30:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [02:36:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T0300) [03:00:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:18] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851179 (https://phabricator.wikimedia.org/T320513) [03:01:20] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851179 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot) [03:02:04] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851179 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot) [03:02:32] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.8 refs T320513 [03:02:41] T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513 [03:03:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:06:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:06:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [03:07:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [03:07:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [03:08:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [03:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:36:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:28] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.8 refs T320513 (duration: 33m 56s) [03:36:34] T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513 [03:38:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [03:42:24] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [03:45:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [03:45:48] PROBLEM - SSH on mw1332.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [04:04:51] (03PS1) 10DLynch: Bump sampling rate to 0.2 for various editing schemas on a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851182 (https://phabricator.wikimedia.org/T321734) [04:06:00] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:33:48] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:32] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:07:46] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), No backups: 1 (dispatch-be1001), Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:15:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:08] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [06:00:05] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T0600). [06:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:30:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:40] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:36:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:40] RECOVERY - SSH on mw1332.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:34:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:48] (03PS2) 10Slyngshede: data.yaml: Move user mfossati from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/850409 (https://phabricator.wikimedia.org/T321772) [07:48:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10SLyngshede-WMF) @MarkTraceur Will you approve, so we can move Marco to deployment? [07:51:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:29] (03PS1) 10Cathal Mooney: Remove user faidon from Juniper access [homer/public] - 10https://gerrit.wikimedia.org/r/851590 (https://phabricator.wikimedia.org/T322101) [08:05:32] (03PS3) 10Slyngshede: C:idm::deployment of IDM. [puppet] - 10https://gerrit.wikimedia.org/r/851064 (https://phabricator.wikimedia.org/T320428) [08:08:46] (03CR) 10Slyngshede: C:idm::deployment of IDM. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851064 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [08:13:01] (03CR) 10Cathal Mooney: [C: 03+2] Remove user faidon from Juniper access [homer/public] - 10https://gerrit.wikimedia.org/r/851590 (https://phabricator.wikimedia.org/T322101) (owner: 10Cathal Mooney) [08:13:40] (03Merged) 10jenkins-bot: Remove user faidon from Juniper access [homer/public] - 10https://gerrit.wikimedia.org/r/851590 (https://phabricator.wikimedia.org/T322101) (owner: 10Cathal Mooney) [08:15:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:40] (03PS1) 10Muehlenhoff: Remove Faidon from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/851594 [08:19:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove Faidon from Icinga config [puppet] - 10https://gerrit.wikimedia.org/r/851594 (owner: 10Muehlenhoff) [08:21:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:49] (03PS1) 10Muehlenhoff: Remove access for Faidon [puppet] - 10https://gerrit.wikimedia.org/r/851595 (https://phabricator.wikimedia.org/T322101) [08:25:48] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for Faidon [puppet] - 10https://gerrit.wikimedia.org/r/851595 (https://phabricator.wikimedia.org/T322101) (owner: 10Muehlenhoff) [08:26:58] PROBLEM - Disk space on cp5007 is CRITICAL: DISK CRITICAL - free space: / 342 MB (3% inode=77%): /tmp 342 MB (3% inode=77%): /var/tmp 342 MB (3% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops [08:27:57] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Faidon Liambotis out of all services on: 802 hosts [08:28:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Faidon Liambotis out of all services on: 802 hosts [08:28:25] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Faidon Liambotis out of all services on: 1203 hosts [08:28:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Faidon Liambotis out of all services on: 1203 hosts [08:30:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:43] (03CR) 10David Caro: [V: 03+1] p::toolforge:harbor: use distro docker for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [08:30:48] (03CR) 10David Caro: [V: 03+1 C: 03+2] p::toolforge:harbor: use distro docker for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [08:30:55] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: lvs500[1-3] are unable to establish BGP sessions with cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T321545 (10fgiunchedi) [08:31:11] 10SRE, 10Traffic: PyBalBGPUnstable didn't report T321545 - https://phabricator.wikimedia.org/T321547 (10fgiunchedi) 05Open→03Declined Ok! Declining for now; feel free to reopen as needed [08:32:30] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: don't monitor /run/docker on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/850993 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:32:46] !log draining ganeti1028 for eventual reimage T311687 [08:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:52] dcausse: merged your change too [08:33:25] nope, sorry, I meant dcaro which he's not here [08:33:28] :shrug: [08:33:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:18] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [08:43:19] (03PS3) 10Filippo Giunchedi: smokeping: add ensure parameter, set to present [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) [08:43:21] (03PS3) 10Filippo Giunchedi: profile: absent smokeping [puppet] - 10https://gerrit.wikimedia.org/r/850155 (https://phabricator.wikimedia.org/T169860) [08:43:23] (03PS3) 10Filippo Giunchedi: smokeping: remove module and profile [puppet] - 10https://gerrit.wikimedia.org/r/850156 (https://phabricator.wikimedia.org/T169860) [08:43:25] (03PS3) 10Filippo Giunchedi: smokeping: remove ancillary data [puppet] - 10https://gerrit.wikimedia.org/r/850157 (https://phabricator.wikimedia.org/T169860) [08:43:27] (03CR) 10Filippo Giunchedi: smokeping: add ensure parameter, set to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:43] (03PS1) 10Filippo Giunchedi: safer dnsmasq restart/reload [puppet] - 10https://gerrit.wikimedia.org/r/851597 [08:53:45] (03PS1) 10Filippo Giunchedi: pontoon: limit thanos retention [puppet] - 10https://gerrit.wikimedia.org/r/851598 [08:59:05] (03CR) 10Jgiannelos: "Regarding access to logs I am not sure if this group grants the right access." [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight) [08:59:58] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: limit thanos retention [puppet] - 10https://gerrit.wikimedia.org/r/851598 (owner: 10Filippo Giunchedi) [09:00:03] (03PS2) 10Filippo Giunchedi: pontoon: limit thanos retention [puppet] - 10https://gerrit.wikimedia.org/r/851598 [09:00:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:26] (03CR) 10JMeybohm: [C: 03+1] safer dnsmasq restart/reload [puppet] - 10https://gerrit.wikimedia.org/r/851597 (owner: 10Filippo Giunchedi) [09:06:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:46] RECOVERY - Disk space on alert1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=alert1001&var-datasource=eqiad+prometheus/ops [09:11:28] (03CR) 10Filippo Giunchedi: [C: 03+2] safer dnsmasq restart/reload [puppet] - 10https://gerrit.wikimedia.org/r/851597 (owner: 10Filippo Giunchedi) [09:13:23] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629) (owner: 10Ahmon Dancy) [09:22:30] (03CR) 10Awight: Invite some of WMDE Tech Wishes team to poke around maps instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight) [09:26:07] (03CR) 10Jgiannelos: "Its probably worth the effort to send the postgres logs to logstash instead of manually ssh-ing." [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight) [09:28:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:29:49] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37875/console" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [09:30:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:33] (03CR) 10Awight: Invite some of WMDE Tech Wishes team to poke around maps instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight) [09:32:28] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [09:33:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:34:26] (03CR) 10Jelto: [C: 03+2] "lgtm, thanks for noticing that!" [puppet] - 10https://gerrit.wikimedia.org/r/850541 (owner: 10Dzahn) [09:34:53] (03PS2) 10Jelto: devtools: set profile::gitlab::runner::registration_token: private [puppet] - 10https://gerrit.wikimedia.org/r/850541 (owner: 10Dzahn) [09:36:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:50] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:40:33] (03PS3) 10David Caro: wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 [09:43:49] !log imported quickstack 20161026-1+deb12u1 to apt.wikimedia.org/bookworm-wikimedia T321783 [09:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:58] T321783: Setup an initial bookworm host with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 [09:46:05] (03PS3) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) [09:46:14] (03CR) 10Phuedx: [C: 04-1] "Thanks for submitting this patch!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [09:47:29] 10SRE, 10Infrastructure-Foundations: Create an initial IDM/LDAP image for tests and CI - https://phabricator.wikimedia.org/T320430 (10SLyngshede-WMF) The Docker image have also been included in the Bitu repo and can be built using docker-compose. [09:47:38] 10SRE, 10Infrastructure-Foundations: Create an initial IDM/LDAP image for tests and CI - https://phabricator.wikimedia.org/T320430 (10SLyngshede-WMF) 05Open→03Resolved [09:47:40] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [09:51:23] (03CR) 10Jelto: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [09:52:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [09:53:02] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [09:55:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:00:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:28] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [10:06:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:13:39] (03CR) 10JMeybohm: [C: 03+2] Make Kubernetes version configurable [puppet] - 10https://gerrit.wikimedia.org/r/850449 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:19:41] (03PS1) 10Urbanecm: Deploy Growth features to 100% users at all wikis but dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) [10:29:25] (03PS1) 10Filippo Giunchedi: prometheus: use default for ignored_devices [puppet] - 10https://gerrit.wikimedia.org/r/851605 (https://phabricator.wikimedia.org/T321783) [10:33:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/851605 (https://phabricator.wikimedia.org/T321783) (owner: 10Filippo Giunchedi) [10:34:18] (03CR) 10David Caro: "LGTM, just a type hint issue maybe, feel free to ignore the nits" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (owner: 10Jbond) [10:37:49] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:39:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:39:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [10:39:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [10:39:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37285 and previous config saved to /var/cache/conftool/dbconfig/20221101-103934-ladsgroup.json [10:39:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [10:39:46] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:40:06] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/851065 (owner: 10L10n-bot) [10:40:09] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37876/console" [puppet] - 10https://gerrit.wikimedia.org/r/851605 (https://phabricator.wikimedia.org/T321783) (owner: 10Filippo Giunchedi) [10:41:09] (03CR) 10David Caro: [C: 03+1] "LGTM, as arturo says, some users might be relying on this for something, checked on the toolforge sge nodes logs to see if there's anythin" [puppet] - 10https://gerrit.wikimedia.org/r/850633 (owner: 10Majavah) [10:41:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [10:41:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1096.eqiad.wmnet with reason: Maintenance [10:41:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:41:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37286 and previous config saved to /var/cache/conftool/dbconfig/20221101-104154-ladsgroup.json [10:42:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:42:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T318955)', diff saved to https://phabricator.wikimedia.org/P37287 and previous config saved to /var/cache/conftool/dbconfig/20221101-104215-ladsgroup.json [10:42:28] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:48:34] !log updating libdatetime-timezone-perl from latest Debian SUA update [10:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:44] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: use default for ignored_devices [puppet] - 10https://gerrit.wikimedia.org/r/851605 (https://phabricator.wikimedia.org/T321783) (owner: 10Filippo Giunchedi) [10:54:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:55:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318955)', diff saved to https://phabricator.wikimedia.org/P37288 and previous config saved to /var/cache/conftool/dbconfig/20221101-105534-ladsgroup.json [10:55:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37289 and previous config saved to /var/cache/conftool/dbconfig/20221101-105557-ladsgroup.json [10:56:52] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [10:59:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:59:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:59:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:59:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:00:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [11:00:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:00:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37290 and previous config saved to /var/cache/conftool/dbconfig/20221101-110019-ladsgroup.json [11:00:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:00:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:00:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T318950)', diff saved to https://phabricator.wikimedia.org/P37291 and previous config saved to /var/cache/conftool/dbconfig/20221101-110045-ladsgroup.json [11:01:07] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:02:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37292 and previous config saved to /var/cache/conftool/dbconfig/20221101-110232-ladsgroup.json [11:03:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T318950)', diff saved to https://phabricator.wikimedia.org/P37293 and previous config saved to /var/cache/conftool/dbconfig/20221101-110311-ladsgroup.json [11:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:05:11] (03PS1) 10Muehlenhoff: Cleanup obsolete binary packages after bookworm dist-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/851607 [11:06:51] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:47] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb-test2001.codfw.wmnet [11:10:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P37294 and previous config saved to /var/cache/conftool/dbconfig/20221101-111042-ladsgroup.json [11:11:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P37295 and previous config saved to /var/cache/conftool/dbconfig/20221101-111106-ladsgroup.json [11:11:47] (03PS2) 10Muehlenhoff: thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850469 (https://phabricator.wikimedia.org/T308013) [11:14:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [11:14:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [11:17:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P37296 and previous config saved to /var/cache/conftool/dbconfig/20221101-111739-ladsgroup.json [11:17:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37297 and previous config saved to /var/cache/conftool/dbconfig/20221101-111753-ladsgroup.json [11:17:58] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:18:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P37298 and previous config saved to /var/cache/conftool/dbconfig/20221101-111819-ladsgroup.json [11:19:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb-test2001.codfw.wmnet [11:21:42] (03CR) 10Muehlenhoff: [C: 03+2] thanos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850469 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:25:14] (03PS2) 10Muehlenhoff: dumps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850471 (https://phabricator.wikimedia.org/T308013) [11:25:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P37299 and previous config saved to /var/cache/conftool/dbconfig/20221101-112549-ladsgroup.json [11:26:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P37300 and previous config saved to /var/cache/conftool/dbconfig/20221101-112612-ladsgroup.json [11:27:45] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Initial IDM puppetisation - https://phabricator.wikimedia.org/T320428 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03Low a:03SLyngshede-WMF [11:27:47] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [11:28:33] (03CR) 10Muehlenhoff: [C: 03+2] dumps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850471 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:29:41] (03PS2) 10Muehlenhoff: installserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850473 (https://phabricator.wikimedia.org/T308013) [11:30:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P37301 and previous config saved to /var/cache/conftool/dbconfig/20221101-113248-ladsgroup.json [11:32:59] (03PS1) 10Hnowlan: Generate thumbor.key via prod entrypoint script [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/851608 (https://phabricator.wikimedia.org/T233196) [11:33:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P37302 and previous config saved to /var/cache/conftool/dbconfig/20221101-113301-ladsgroup.json [11:33:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P37303 and previous config saved to /var/cache/conftool/dbconfig/20221101-113327-ladsgroup.json [11:34:06] (03CR) 10CI reject: [V: 04-1] Generate thumbor.key via prod entrypoint script [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/851608 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:34:40] (03PS1) 10Hnowlan: thumbor: don't manage thumbor.key within Helm [deployment-charts] - 10https://gerrit.wikimedia.org/r/851609 (https://phabricator.wikimedia.org/T233196) [11:35:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:05] (03PS2) 10Hnowlan: Generate thumbor.key via prod entrypoint script [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/851608 (https://phabricator.wikimedia.org/T233196) [11:37:13] (03CR) 10Muehlenhoff: [C: 03+2] installserver: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/850473 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:38:59] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:40:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318955)', diff saved to https://phabricator.wikimedia.org/P37304 and previous config saved to /var/cache/conftool/dbconfig/20221101-114057-ladsgroup.json [11:40:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance [11:41:03] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:41:09] (03Abandoned) 10Muehlenhoff: Add a stub base file for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850487 (owner: 10Muehlenhoff) [11:41:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance [11:41:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T318955)', diff saved to https://phabricator.wikimedia.org/P37305 and previous config saved to /var/cache/conftool/dbconfig/20221101-114121-ladsgroup.json [11:41:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37306 and previous config saved to /var/cache/conftool/dbconfig/20221101-114123-ladsgroup.json [11:41:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:41:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:41:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37307 and previous config saved to /var/cache/conftool/dbconfig/20221101-114145-ladsgroup.json [11:45:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:28] (03CR) 10Hnowlan: [C: 03+2] Use the PDF cropbox for rendering [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/805476 (https://phabricator.wikimedia.org/T167420) (owner: 10TheDJ) [11:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [11:47:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37308 and previous config saved to /var/cache/conftool/dbconfig/20221101-114755-ladsgroup.json [11:47:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:48:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [11:48:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P37309 and previous config saved to /var/cache/conftool/dbconfig/20221101-114811-ladsgroup.json [11:48:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T318950)', diff saved to https://phabricator.wikimedia.org/P37310 and previous config saved to /var/cache/conftool/dbconfig/20221101-114820-ladsgroup.json [11:48:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T318950)', diff saved to https://phabricator.wikimedia.org/P37311 and previous config saved to /var/cache/conftool/dbconfig/20221101-114835-ladsgroup.json [11:48:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [11:48:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [11:48:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T318950)', diff saved to https://phabricator.wikimedia.org/P37312 and previous config saved to /var/cache/conftool/dbconfig/20221101-114858-ladsgroup.json [11:49:00] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:49:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [11:49:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [11:49:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T318605)', diff saved to https://phabricator.wikimedia.org/P37313 and previous config saved to /var/cache/conftool/dbconfig/20221101-114943-ladsgroup.json [11:51:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T318950)', diff saved to https://phabricator.wikimedia.org/P37314 and previous config saved to /var/cache/conftool/dbconfig/20221101-115122-ladsgroup.json [11:52:42] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:53:31] (03Merged) 10jenkins-bot: Use the PDF cropbox for rendering [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/805476 (https://phabricator.wikimedia.org/T167420) (owner: 10TheDJ) [11:54:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318955)', diff saved to https://phabricator.wikimedia.org/P37315 and previous config saved to /var/cache/conftool/dbconfig/20221101-115426-ladsgroup.json [11:56:01] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [11:56:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37316 and previous config saved to /var/cache/conftool/dbconfig/20221101-115638-ladsgroup.json [11:57:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [11:57:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T318950)', diff saved to https://phabricator.wikimedia.org/P37317 and previous config saved to /var/cache/conftool/dbconfig/20221101-115734-ladsgroup.json [11:57:40] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:01:13] (03CR) 10Kosta Harlan: Deploy Growth features to 100% users at all wikis but dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm) [12:03:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37318 and previous config saved to /var/cache/conftool/dbconfig/20221101-120318-ladsgroup.json [12:03:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:03:30] (03PS2) 10Urbanecm: Deploy GrowthExperiments to 100% users at all wikis but dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) [12:03:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:03:35] (03CR) 10Urbanecm: Deploy GrowthExperiments to 100% users at all wikis but dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm) [12:03:41] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:03:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37319 and previous config saved to /var/cache/conftool/dbconfig/20221101-120341-ladsgroup.json [12:06:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P37320 and previous config saved to /var/cache/conftool/dbconfig/20221101-120630-ladsgroup.json [12:09:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:09:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P37321 and previous config saved to /var/cache/conftool/dbconfig/20221101-120934-ladsgroup.json [12:11:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P37322 and previous config saved to /var/cache/conftool/dbconfig/20221101-121147-ladsgroup.json [12:11:55] (03CR) 10Roman Stolar: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) (owner: 10Vlad.shapik) [12:12:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P37323 and previous config saved to /var/cache/conftool/dbconfig/20221101-121242-ladsgroup.json [12:20:49] jouncebot: now [12:20:49] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [12:20:53] * urbanecm stashing at mwdebug1001 [12:21:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P37324 and previous config saved to /var/cache/conftool/dbconfig/20221101-122138-ladsgroup.json [12:21:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1006.eqiad.wmnet [12:23:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T318605)', diff saved to https://phabricator.wikimedia.org/P37325 and previous config saved to /var/cache/conftool/dbconfig/20221101-122329-ladsgroup.json [12:23:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:24:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P37326 and previous config saved to /var/cache/conftool/dbconfig/20221101-122442-ladsgroup.json [12:26:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P37327 and previous config saved to /var/cache/conftool/dbconfig/20221101-122654-ladsgroup.json [12:27:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P37328 and previous config saved to /var/cache/conftool/dbconfig/20221101-122750-ladsgroup.json [12:28:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1006.eqiad.wmnet [12:30:54] (03PS4) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) [12:32:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:36:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T318950)', diff saved to https://phabricator.wikimedia.org/P37329 and previous config saved to /var/cache/conftool/dbconfig/20221101-123646-ladsgroup.json [12:36:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [12:36:52] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:37:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [12:37:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [12:37:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [12:37:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T318950)', diff saved to https://phabricator.wikimedia.org/P37330 and previous config saved to /var/cache/conftool/dbconfig/20221101-123714-ladsgroup.json [12:38:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P37331 and previous config saved to /var/cache/conftool/dbconfig/20221101-123839-ladsgroup.json [12:39:02] (03CR) 10Vlad.shapik: [C: 03+1] "LGTM" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/850453 (owner: 10Hnowlan) [12:39:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T318950)', diff saved to https://phabricator.wikimedia.org/P37332 and previous config saved to /var/cache/conftool/dbconfig/20221101-123936-ladsgroup.json [12:39:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318955)', diff saved to https://phabricator.wikimedia.org/P37333 and previous config saved to /var/cache/conftool/dbconfig/20221101-123949-ladsgroup.json [12:39:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [12:40:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2129.codfw.wmnet with reason: Maintenance [12:40:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T318955)', diff saved to https://phabricator.wikimedia.org/P37334 and previous config saved to /var/cache/conftool/dbconfig/20221101-124012-ladsgroup.json [12:41:06] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:42:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37335 and previous config saved to /var/cache/conftool/dbconfig/20221101-124202-ladsgroup.json [12:42:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:42:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:42:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37336 and previous config saved to /var/cache/conftool/dbconfig/20221101-124225-ladsgroup.json [12:42:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37337 and previous config saved to /var/cache/conftool/dbconfig/20221101-124253-ladsgroup.json [12:42:59] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:43:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T318950)', diff saved to https://phabricator.wikimedia.org/P37338 and previous config saved to /var/cache/conftool/dbconfig/20221101-124301-ladsgroup.json [12:43:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:43:07] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:43:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:43:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:43:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:43:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37339 and previous config saved to /var/cache/conftool/dbconfig/20221101-124334-ladsgroup.json [12:45:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37340 and previous config saved to /var/cache/conftool/dbconfig/20221101-124548-ladsgroup.json [12:47:41] (03CR) 10Filippo Giunchedi: [C: 03+1] Cleanup obsolete binary packages after bookworm dist-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/851607 (owner: 10Muehlenhoff) [12:48:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1008.eqiad.wmnet [12:49:15] (03CR) 10Kosta Harlan: [C: 03+1] Deploy GrowthExperiments to 100% users at all wikis but dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm) [12:50:33] (03CR) 10Kosta Harlan: [C: 03+1] Deploy GrowthExperiments to 100% users at all wikis but dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm) [12:51:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:37] (03PS1) 10Stang: viwiki: Increase autoconfirmed edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851618 (https://phabricator.wikimedia.org/T322105) [12:52:07] (03CR) 10CI reject: [V: 04-1] viwiki: Increase autoconfirmed edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851618 (https://phabricator.wikimedia.org/T322105) (owner: 10Stang) [12:52:34] (03PS2) 10Stang: viwiki: Increase autoconfirmed edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851618 (https://phabricator.wikimedia.org/T322105) [12:53:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318955)', diff saved to https://phabricator.wikimedia.org/P37341 and previous config saved to /var/cache/conftool/dbconfig/20221101-125331-ladsgroup.json [12:53:37] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [12:53:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P37342 and previous config saved to /var/cache/conftool/dbconfig/20221101-125348-ladsgroup.json [12:54:05] (03PS1) 10Filippo Giunchedi: dispatch: add ipython for 'dispatch server shell' [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/851619 (https://phabricator.wikimedia.org/T313229) [12:54:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P37343 and previous config saved to /var/cache/conftool/dbconfig/20221101-125443-ladsgroup.json [12:55:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37344 and previous config saved to /var/cache/conftool/dbconfig/20221101-125516-ladsgroup.json [12:55:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1008.eqiad.wmnet [12:56:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1010.eqiad.wmnet [12:57:30] (03CR) 10Muehlenhoff: [C: 03+2] Cleanup obsolete binary packages after bookworm dist-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/851607 (owner: 10Muehlenhoff) [12:57:55] (03PS1) 10Filippo Giunchedi: dispatch: run wrapper with interactive/tty support [puppet] - 10https://gerrit.wikimedia.org/r/851620 (https://phabricator.wikimedia.org/T313229) [12:58:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P37345 and previous config saved to /var/cache/conftool/dbconfig/20221101-125801-ladsgroup.json [12:58:59] (03PS1) 10JMeybohm: Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621 [12:59:46] (03CR) 10CI reject: [V: 04-1] Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621 (owner: 10JMeybohm) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T1300). [13:00:05] koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T1300) [13:00:14] i can deploy today [13:00:16] o/ [13:00:16] hi koi! [13:00:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851618 (https://phabricator.wikimedia.org/T322105) (owner: 10Stang) [13:00:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P37346 and previous config saved to /var/cache/conftool/dbconfig/20221101-130056-ladsgroup.json [13:01:07] (03PS1) 10Stang: zhwikivoyage: Add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851622 (https://phabricator.wikimedia.org/T322133) [13:02:00] Hi urbanecm, I added another patch for this window ^ [13:02:03] sure, noted [13:02:08] (03Merged) 10jenkins-bot: viwiki: Increase autoconfirmed edit count to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851618 (https://phabricator.wikimedia.org/T322105) (owner: 10Stang) [13:02:50] I think the first one is not test-able, so maybe sync directly? [13:02:57] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851618|viwiki: Increase autoconfirmed edit count to 10 (T322105)]] [13:03:06] koi: you can test it via Special:Userrights [13:03:11] T322105: Change the minimum requirements of autoconfirmed users to 10 edits and 4 days old on viwiki - https://phabricator.wikimedia.org/T322105 [13:03:35] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:851618|viwiki: Increase autoconfirmed edit count to 10 (T322105)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:03:48] aha, sure about this? [13:03:48] koi: at https://vi.wikipedia.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Quy%E1%BB%81n_th%C3%A0nh_vi%C3%AAn/Martin_Urbanec, you see "Implicit member of: Autoconfirmed users" [13:03:57] (w/o mwdebug1001 at least) [13:04:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1010.eqiad.wmnet [13:04:15] i actually have 80 edits, so it'll stay with mwdebug1001 [13:04:24] but if you have between 0 to 10 edits, you can test that way [13:04:32] koi: let me know how it goes [13:04:46] oh got it, let me randomly select one to test [13:05:33] sure [13:05:52] (03PS2) 10Urbanecm: zhwikivoyage: Add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851622 (https://phabricator.wikimedia.org/T322133) (owner: 10Stang) [13:05:54] (03PS1) 10Btullis: Bump the version of Datahub to v0.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/851624 (https://phabricator.wikimedia.org/T321907) [13:05:56] (03CR) 10Urbanecm: [C: 03+2] zhwikivoyage: Add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851622 (https://phabricator.wikimedia.org/T322133) (owner: 10Stang) [13:06:11] urbanecm: I checked a user with no edit on viwiki, and noticed they are not inside autoconfirmed group on mwdebug1001, so LGTM [13:06:20] yep, lgtm too, syncing! [13:06:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1012.eqiad.wmnet [13:07:34] (03Merged) 10jenkins-bot: zhwikivoyage: Add wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851622 (https://phabricator.wikimedia.org/T322133) (owner: 10Stang) [13:08:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P37347 and previous config saved to /var/cache/conftool/dbconfig/20221101-130839-ladsgroup.json [13:08:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T318605)', diff saved to https://phabricator.wikimedia.org/P37348 and previous config saved to /var/cache/conftool/dbconfig/20221101-130856-ladsgroup.json [13:08:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [13:09:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [13:09:13] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:09:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T318605)', diff saved to https://phabricator.wikimedia.org/P37349 and previous config saved to /var/cache/conftool/dbconfig/20221101-130919-ladsgroup.json [13:09:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P37350 and previous config saved to /var/cache/conftool/dbconfig/20221101-130952-ladsgroup.json [13:10:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:10:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P37351 and previous config saved to /var/cache/conftool/dbconfig/20221101-131026-ladsgroup.json [13:12:05] (03CR) 10BBlack: [C: 03+2] Add digicert-2022 to available unified set [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack) [13:13:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P37352 and previous config saved to /var/cache/conftool/dbconfig/20221101-131309-ladsgroup.json [13:13:32] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851618|viwiki: Increase autoconfirmed edit count to 10 (T322105)]] (duration: 10m 35s) [13:13:39] this took a while [13:13:42] but it's live now koi [13:13:44] T322105: Change the minimum requirements of autoconfirmed users to 10 edits and 4 days old on viwiki - https://phabricator.wikimedia.org/T322105 [13:13:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851622 (https://phabricator.wikimedia.org/T322133) (owner: 10Stang) [13:13:58] doing the other one now [13:14:15] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851622|zhwikivoyage: Add wordmark (T322133)]] [13:14:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:14:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:14:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1012.eqiad.wmnet [13:14:29] T322133: Add wordmark to zhwikivoyage - https://phabricator.wikimedia.org/T322133 [13:14:41] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:851622|zhwikivoyage: Add wordmark (T322133)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:14:49] koi: second patch's at mwdebug1001 now [13:14:54] can you test please? [13:15:00] looking [13:15:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:16:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P37353 and previous config saved to /var/cache/conftool/dbconfig/20221101-131605-ladsgroup.json [13:16:50] urbanecm: tested under vector-2022 and mobile, both zh-hans and its variant works as expected, so LGTM [13:16:54] excellent, syncing [13:17:30] (03CR) 10Urbanecm: Deploy GrowthExperiments to 100% users at all wikis but dewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851604 (https://phabricator.wikimedia.org/T320876) (owner: 10Urbanecm) [13:17:52] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1028.eqiad.wmnet with reason: Remove from cluster for eventual reimage [13:18:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1028.eqiad.wmnet with reason: Remove from cluster for eventual reimage [13:19:16] (03PS1) 10Urbanecm: [GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626 [13:19:58] (03PS2) 10Clare Ming: Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) [13:20:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:20:52] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851622|zhwikivoyage: Add wordmark (T322133)]] (duration: 06m 36s) [13:20:56] koi: and, live! [13:21:11] thanks! [13:21:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:21:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:21:24] (03CR) 10Clare Ming: Add MP stream for VisualEditorFeatureUse instrument (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [13:21:32] and also purged the two URIs [13:21:34] koi: anything else? [13:21:36] (or anyone else) [13:21:40] nope [13:21:58] !log UTC afternoon B&C window done [13:22:01] closing the window then :) [13:22:09] T322133: Add wordmark to zhwikivoyage - https://phabricator.wikimedia.org/T322133 [13:22:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:45] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [13:23:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P37354 and previous config saved to /var/cache/conftool/dbconfig/20221101-132348-ladsgroup.json [13:25:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T318950)', diff saved to https://phabricator.wikimedia.org/P37355 and previous config saved to /var/cache/conftool/dbconfig/20221101-132500-ladsgroup.json [13:25:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [13:25:06] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:25:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [13:25:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37356 and previous config saved to /var/cache/conftool/dbconfig/20221101-132523-ladsgroup.json [13:25:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P37357 and previous config saved to /var/cache/conftool/dbconfig/20221101-132537-ladsgroup.json [13:25:39] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [13:27:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37358 and previous config saved to /var/cache/conftool/dbconfig/20221101-132745-ladsgroup.json [13:28:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37359 and previous config saved to /var/cache/conftool/dbconfig/20221101-132817-ladsgroup.json [13:28:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [13:28:22] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:28:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1104.eqiad.wmnet with reason: Maintenance [13:28:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1104 (T318605)', diff saved to https://phabricator.wikimedia.org/P37360 and previous config saved to /var/cache/conftool/dbconfig/20221101-132841-ladsgroup.json [13:30:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37361 and previous config saved to /var/cache/conftool/dbconfig/20221101-133113-ladsgroup.json [13:31:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:31:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [13:31:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:31:22] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:31:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:31:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T318950)', diff saved to https://phabricator.wikimedia.org/P37362 and previous config saved to /var/cache/conftool/dbconfig/20221101-133132-ladsgroup.json [13:32:12] /b 10 [13:32:21] Oops :) [13:33:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T318950)', diff saved to https://phabricator.wikimedia.org/P37363 and previous config saved to /var/cache/conftool/dbconfig/20221101-133346-ladsgroup.json [13:35:59] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [13:36:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:26] (03PS1) 10Zabe: "reverse-proxy-staging.php" -> "reverse-staging-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 [13:36:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Initial Django project setup - https://phabricator.wikimedia.org/T319410 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03Low [13:37:17] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:37:46] (03CR) 10Jforrester: [C: 04-1] "This isn't deploy-safe. You need to make the new copy as one commit, switch the use in a second commit, and remove the old copy in a third" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe) [13:38:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318955)', diff saved to https://phabricator.wikimedia.org/P37364 and previous config saved to /var/cache/conftool/dbconfig/20221101-133857-ladsgroup.json [13:39:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:39:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:39:28] (03PS1) 10Zabe: Copy reverse-proxy-staging.php to reverse-proxy-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851631 [13:39:31] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:39:58] (03PS2) 10Zabe: "reverse-proxy-staging.php" -> "reverse-staging-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 [13:40:06] (03PS1) 10Filippo Giunchedi: wikimedia.org: add dispatch.w.o [dns] - 10https://gerrit.wikimedia.org/r/851632 (https://phabricator.wikimedia.org/T313229) [13:40:08] (03PS3) 10Zabe: "reverse-proxy-staging.php" -> "reverse-staging-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 [13:40:15] (03CR) 10Zabe: "reverse-proxy-staging.php" -> "reverse-staging-labs.php" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe) [13:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37365 and previous config saved to /var/cache/conftool/dbconfig/20221101-134045-ladsgroup.json [13:40:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance [13:40:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4037.ulsfo.wmnet [13:41:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1131.eqiad.wmnet with reason: Maintenance [13:41:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T318955)', diff saved to https://phabricator.wikimedia.org/P37366 and previous config saved to /var/cache/conftool/dbconfig/20221101-134108-ladsgroup.json [13:41:09] (03CR) 10Kosta Harlan: [C: 03+1] [GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626 (owner: 10Urbanecm) [13:41:12] (03PS1) 10Zabe: Delete "reverse-proxy-staging.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851633 [13:41:57] seeking a reviewer for an easy one: https://gerrit.wikimedia.org/r/c/operations/dns/+/851632 [13:42:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [13:42:27] urbanecm, still around? [13:42:29] zabe: fyi, AFAIK scap now handles multi-file changes just fine (there's no technical need to split it into three patches) [13:42:33] heh, i was just writing you [13:42:36] yup [13:42:46] ok [13:42:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P37367 and previous config saved to /var/cache/conftool/dbconfig/20221101-134252-ladsgroup.json [13:42:53] should I quash it back into one [13:42:57] (03CR) 10Ssingh: [C: 03+1] "Thanks, looks good! When I was checking this for Wikidough, I realized that we can remove this safely and hence the unless install_from_co" [puppet] - 10https://gerrit.wikimedia.org/r/851148 (owner: 10Andrew Bogott) [13:43:03] up to you. i can handle both variants :) [13:43:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T318605)', diff saved to https://phabricator.wikimedia.org/P37368 and previous config saved to /var/cache/conftool/dbconfig/20221101-134302-ladsgroup.json [13:43:11] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:43:13] PROBLEM - MediaWiki EtcdConfig up-to-date on parse2012 is CRITICAL: etcd last index (1663589) is outdated compared to the master one (1663595) https://wikitech.wikimedia.org/wiki/Etcd [13:43:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T318955)', diff saved to https://phabricator.wikimedia.org/P37369 and previous config saved to /var/cache/conftool/dbconfig/20221101-134318-ladsgroup.json [13:43:26] ftr, i asked Tyler to remove the "single-file" guidance from https://wikitech.wikimedia.org/wiki/Backport_windows [13:43:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [13:44:53] tbh, I don't really care, you can also just merge all three at the same time and then sync through? [13:45:06] yeah [13:45:09] RECOVERY - MediaWiki EtcdConfig up-to-date on parse2012 is OK: etcd last index (1663601) matches the master one (1663601) https://wikitech.wikimedia.org/wiki/Etcd [13:45:26] not really sure what that file's about, so I'd prefer a +1 first if possible [13:45:33] (if you asked if i was around for deployment) [13:45:56] wait, seems to be just a rename [13:46:06] yes, also it's kinda beta only [13:46:54] yeah [13:46:55] let's do it [13:47:05] (03CR) 10Urbanecm: [C: 03+2] Copy reverse-proxy-staging.php to reverse-proxy-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851631 (owner: 10Zabe) [13:47:07] (03CR) 10Urbanecm: [C: 03+2] "reverse-proxy-staging.php" -> "reverse-staging-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe) [13:47:09] (03CR) 10Urbanecm: [C: 03+2] Delete "reverse-proxy-staging.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851633 (owner: 10Zabe) [13:47:13] (03CR) 10Ssingh: [C: 03+1] wikimedia.org: add dispatch.w.o [dns] - 10https://gerrit.wikimedia.org/r/851632 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [13:47:27] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you!" [dns] - 10https://gerrit.wikimedia.org/r/851632 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [13:47:31] (03PS2) 10Filippo Giunchedi: wikimedia.org: add dispatch.w.o [dns] - 10https://gerrit.wikimedia.org/r/851632 (https://phabricator.wikimedia.org/T313229) [13:48:06] (03CR) 10Urbanecm: [C: 03+2] "reverse-proxy-staging.php" -> "reverse-staging-labs.php" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe) [13:48:40] (03Merged) 10jenkins-bot: Copy reverse-proxy-staging.php to reverse-proxy-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851631 (owner: 10Zabe) [13:48:42] (03Merged) 10jenkins-bot: "reverse-proxy-staging.php" -> "reverse-staging-labs.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe) [13:48:44] (03Merged) 10jenkins-bot: Delete "reverse-proxy-staging.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851633 (owner: 10Zabe) [13:48:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851631 (owner: 10Zabe) [13:48:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851630 (owner: 10Zabe) [13:48:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851633 (owner: 10Zabe) [13:48:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P37370 and previous config saved to /var/cache/conftool/dbconfig/20221101-134854-ladsgroup.json [13:49:13] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851631|Copy reverse-proxy-staging.php to reverse-proxy-labs.php]], [[gerrit:851630|"reverse-proxy-staging.php" -> "reverse-staging-labs.php"]], [[gerrit:851633|Delete "reverse-proxy-staging.php"]] [13:49:32] zabe: double-checking: it's ok to skip mwdebug, right? i don't see anything to test here [13:49:36] yes [13:49:41] !log urbanecm@deploy1002 urbanecm and zabe: Backport for [[gerrit:851631|Copy reverse-proxy-staging.php to reverse-proxy-labs.php]], [[gerrit:851630|"reverse-proxy-staging.php" -> "reverse-staging-labs.php"]], [[gerrit:851633|Delete "reverse-proxy-staging.php"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:49:46] syncing [13:49:49] (03CR) 10Btullis: [C: 03+2] Bump the version of Datahub to v0.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/851624 (https://phabricator.wikimedia.org/T321907) (owner: 10Btullis) [13:50:28] (03PS2) 10Urbanecm: [GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626 [13:50:33] (03CR) 10Urbanecm: [C: 03+2] [GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626 (owner: 10Urbanecm) [13:50:38] sneaking ^^ out as well [13:50:46] (03PS1) 10Zabe: scap: Add reverse-staging-labs.php to beta-only files [puppet] - 10https://gerrit.wikimedia.org/r/851634 [13:50:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [13:51:08] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/851634 (owner: 10Zabe) [13:51:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [13:51:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:51:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:51:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T318955)', diff saved to https://phabricator.wikimedia.org/P37371 and previous config saved to /var/cache/conftool/dbconfig/20221101-135120-ladsgroup.json [13:51:35] (03Merged) 10jenkins-bot: [GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626 (owner: 10Urbanecm) [13:52:09] (03PS2) 10Zabe: scap: Add reverse-staging-labs.php to beta-only files [puppet] - 10https://gerrit.wikimedia.org/r/851634 [13:52:22] (03PS3) 10Zabe: scap: Add reverse-proxy-labs.php to beta-only files [puppet] - 10https://gerrit.wikimedia.org/r/851634 [13:52:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:53:07] (03CR) 10Urbanecm: [C: 03+1] "...meh, filename. lgtm now 😄" [puppet] - 10https://gerrit.wikimedia.org/r/851634 (owner: 10Zabe) [13:53:34] (03Merged) 10jenkins-bot: Bump the version of Datahub to v0.9.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/851624 (https://phabricator.wikimedia.org/T321907) (owner: 10Btullis) [13:53:37] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:53:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:53:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:53:43] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851631|Copy reverse-proxy-staging.php to reverse-proxy-labs.php]], [[gerrit:851630|"reverse-proxy-staging.php" -> "reverse-staging-labs.php"]], [[gerrit:851633|Delete "reverse-proxy-staging.php"]] (duration: 04m 30s) [13:53:50] zabe: it's live now [13:53:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851626 (owner: 10Urbanecm) [13:54:08] thanks :) [13:54:17] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851626|[GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers]] [13:54:32] no problem [13:54:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:54:47] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:54:53] !log installing exim4 security updates on buster [13:55:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4037.ulsfo.wmnet [13:55:56] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [13:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4038.ulsfo.wmnet [13:56:25] (03PS1) 10Ottomata: Enable rc0.mediawiki.page_change stream on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851636 (https://phabricator.wikimedia.org/T311129) [13:56:54] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [13:57:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1028.eqiad.wmnet with OS bullseye [13:57:37] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1028.eqiad.wmnet with OS bullseye [13:57:47] !log draining ganeti1016 for eventual reimage T311687 [13:58:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P37372 and previous config saved to /var/cache/conftool/dbconfig/20221101-135800-ladsgroup.json [13:58:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P37373 and previous config saved to /var/cache/conftool/dbconfig/20221101-135811-ladsgroup.json [13:58:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P37374 and previous config saved to /var/cache/conftool/dbconfig/20221101-135827-ladsgroup.json [13:58:49] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851626|[GrowthExperiments] Remove wmgGEFeaturesMayBeAvailableToNewcomers]] (duration: 04m 32s) [13:58:55] * urbanecm done [13:59:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:53] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [14:00:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:00:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:00:57] (03CR) 10Ottomata: [C: 03+2] Enable rc0.mediawiki.page_change stream on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851636 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [14:01:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:02:28] (03Merged) 10jenkins-bot: Enable rc0.mediawiki.page_change stream on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851636 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [14:04:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P37375 and previous config saved to /var/cache/conftool/dbconfig/20221101-140402-ladsgroup.json [14:04:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318955)', diff saved to https://phabricator.wikimedia.org/P37376 and previous config saved to /var/cache/conftool/dbconfig/20221101-140430-ladsgroup.json [14:04:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T318605)', diff saved to https://phabricator.wikimedia.org/P37377 and previous config saved to /var/cache/conftool/dbconfig/20221101-140439-ladsgroup.json [14:05:42] (03PS1) 10Ottomata: rc0.mediawiki.page_change stream - use eventgate-analytics-external [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851637 (https://phabricator.wikimedia.org/T311129) [14:06:03] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4038.ulsfo.wmnet [14:06:33] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [14:06:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:06:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4039.ulsfo.wmnet [14:06:45] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable rc0.mediawiki.page_change stream on testwiki - T311129 (duration: 03m 30s) [14:07:31] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [14:07:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:07:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:08:09] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [14:08:10] (03CR) 10Ottomata: [C: 03+2] rc0.mediawiki.page_change stream - use eventgate-analytics-external [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851637 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [14:08:59] (03Merged) 10jenkins-bot: rc0.mediawiki.page_change stream - use eventgate-analytics-external [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851637 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [14:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:09] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [14:10:08] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1028.eqiad.wmnet with reason: host reimage [14:10:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti1028.eqiad.wmnet with reason: host reimage [14:11:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:12:45] (03CR) 10Dmaza: rewrite.py: changes for Phonos deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [14:13:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37378 and previous config saved to /var/cache/conftool/dbconfig/20221101-141308-ladsgroup.json [14:13:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [14:13:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [14:13:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P37379 and previous config saved to /var/cache/conftool/dbconfig/20221101-141321-ladsgroup.json [14:13:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T318950)', diff saved to https://phabricator.wikimedia.org/P37380 and previous config saved to /var/cache/conftool/dbconfig/20221101-141322-ladsgroup.json [14:13:24] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:13:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P37381 and previous config saved to /var/cache/conftool/dbconfig/20221101-141335-ladsgroup.json [14:14:23] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Use eventgate-analytics-external for rc0.mediawiki.page_change stream - T311129 (duration: 03m 42s) [14:15:05] T311129: [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 [14:15:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T318950)', diff saved to https://phabricator.wikimedia.org/P37382 and previous config saved to /var/cache/conftool/dbconfig/20221101-141544-ladsgroup.json [14:16:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:16:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:17:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:18:08] (03PS1) 10Muehlenhoff: Add a few more package cleanups for bullseye->bookworm ABI changes [puppet] - 10https://gerrit.wikimedia.org/r/851638 [14:18:10] (03PS1) 10Muehlenhoff: Remove LDAP access for cdunn [puppet] - 10https://gerrit.wikimedia.org/r/851639 [14:18:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4039.ulsfo.wmnet [14:18:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:18:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4040.ulsfo.wmnet [14:19:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T318950)', diff saved to https://phabricator.wikimedia.org/P37383 and previous config saved to /var/cache/conftool/dbconfig/20221101-141913-ladsgroup.json [14:19:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:19:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:19:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T318950)', diff saved to https://phabricator.wikimedia.org/P37384 and previous config saved to /var/cache/conftool/dbconfig/20221101-141924-ladsgroup.json [14:19:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P37385 and previous config saved to /var/cache/conftool/dbconfig/20221101-141936-ladsgroup.json [14:19:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P37386 and previous config saved to /var/cache/conftool/dbconfig/20221101-141945-ladsgroup.json [14:21:08] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:21:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T318950)', diff saved to https://phabricator.wikimedia.org/P37387 and previous config saved to /var/cache/conftool/dbconfig/20221101-142136-ladsgroup.json [14:23:26] (03PS2) 10Muehlenhoff: Remove LDAP access for cdunn [puppet] - 10https://gerrit.wikimedia.org/r/851639 [14:25:31] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for cdunn [puppet] - 10https://gerrit.wikimedia.org/r/851639 (owner: 10Muehlenhoff) [14:28:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T318605)', diff saved to https://phabricator.wikimedia.org/P37388 and previous config saved to /var/cache/conftool/dbconfig/20221101-142832-ladsgroup.json [14:28:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [14:28:37] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:28:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T318955)', diff saved to https://phabricator.wikimedia.org/P37389 and previous config saved to /var/cache/conftool/dbconfig/20221101-142842-ladsgroup.json [14:28:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [14:28:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [14:28:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2161 (T318605)', diff saved to https://phabricator.wikimedia.org/P37390 and previous config saved to /var/cache/conftool/dbconfig/20221101-142854-ladsgroup.json [14:28:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [14:29:00] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4040.ulsfo.wmnet [14:29:58] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:30:25] (03PS1) 10Muehlenhoff: Remove access for ejoseph [puppet] - 10https://gerrit.wikimedia.org/r/851641 [14:30:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P37391 and previous config saved to /var/cache/conftool/dbconfig/20221101-143051-ladsgroup.json [14:32:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1028.eqiad.wmnet with OS bullseye [14:32:27] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1028.eqiad.wmnet with OS bullseye completed: - ganeti1028 (**WARN**) - Downtimed on... [14:34:32] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging EJoseph out of all services on: 1202 hosts [14:34:42] (03CR) 10Andrew Bogott: [C: 03+2] pdns-recursor: remove delegation-only config setting [puppet] - 10https://gerrit.wikimedia.org/r/851148 (owner: 10Andrew Bogott) [14:34:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P37392 and previous config saved to /var/cache/conftool/dbconfig/20221101-143445-ladsgroup.json [14:34:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104', diff saved to https://phabricator.wikimedia.org/P37393 and previous config saved to /var/cache/conftool/dbconfig/20221101-143453-ladsgroup.json [14:34:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging EJoseph out of all services on: 1202 hosts [14:35:04] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging EJoseph out of all services on: 803 hosts [14:35:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging EJoseph out of all services on: 803 hosts [14:36:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37394 and previous config saved to /var/cache/conftool/dbconfig/20221101-143645-ladsgroup.json [14:37:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4041.ulsfo.wmnet [14:40:14] (03CR) 10Ahmon Dancy: [C: 03+1] scap: Add reverse-proxy-labs.php to beta-only files [puppet] - 10https://gerrit.wikimedia.org/r/851634 (owner: 10Zabe) [14:40:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:40:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:40:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:40:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:40:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T318955)', diff saved to https://phabricator.wikimedia.org/P37395 and previous config saved to /var/cache/conftool/dbconfig/20221101-144053-ladsgroup.json [14:41:21] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:43:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T318955)', diff saved to https://phabricator.wikimedia.org/P37396 and previous config saved to /var/cache/conftool/dbconfig/20221101-144302-ladsgroup.json [14:43:08] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:43:13] (03PS1) 10Muehlenhoff: Record extended MOU for Robert West [puppet] - 10https://gerrit.wikimedia.org/r/851644 [14:43:39] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:45:15] (03PS1) 10Filippo Giunchedi: dispatch: configure http header auth provider [puppet] - 10https://gerrit.wikimedia.org/r/851645 (https://phabricator.wikimedia.org/T313229) [14:46:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P37397 and previous config saved to /var/cache/conftool/dbconfig/20221101-144559-ladsgroup.json [14:46:34] (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: run wrapper with interactive/tty support [puppet] - 10https://gerrit.wikimedia.org/r/851620 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:47:49] (03CR) 10Filippo Giunchedi: [C: 03+1] Add a few more package cleanups for bullseye->bookworm ABI changes [puppet] - 10https://gerrit.wikimedia.org/r/851638 (owner: 10Muehlenhoff) [14:48:01] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4041.ulsfo.wmnet [14:48:15] (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: configure http header auth provider [puppet] - 10https://gerrit.wikimedia.org/r/851645 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:48:20] (03PS2) 10Filippo Giunchedi: dispatch: configure http header auth provider [puppet] - 10https://gerrit.wikimedia.org/r/851645 (https://phabricator.wikimedia.org/T313229) [14:48:35] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] dispatch: add ipython for 'dispatch server shell' [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/851619 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:48:42] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 15.12 ms [14:49:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318955)', diff saved to https://phabricator.wikimedia.org/P37398 and previous config saved to /var/cache/conftool/dbconfig/20221101-144954-ladsgroup.json [14:49:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:50:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1104 (T318605)', diff saved to https://phabricator.wikimedia.org/P37399 and previous config saved to /var/cache/conftool/dbconfig/20221101-145004-ladsgroup.json [14:50:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance [14:50:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:50:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1111.eqiad.wmnet with reason: Maintenance [14:50:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37400 and previous config saved to /var/cache/conftool/dbconfig/20221101-145019-ladsgroup.json [14:50:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1111 (T318605)', diff saved to https://phabricator.wikimedia.org/P37401 and previous config saved to /var/cache/conftool/dbconfig/20221101-145026-ladsgroup.json [14:51:10] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:51:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4042.ulsfo.wmnet [14:51:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P37402 and previous config saved to /var/cache/conftool/dbconfig/20221101-145152-ladsgroup.json [14:52:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:52:25] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:55:14] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:58:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P37403 and previous config saved to /var/cache/conftool/dbconfig/20221101-145813-ladsgroup.json [14:58:47] (03CR) 10Phuedx: [C: 03+1] Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [14:58:59] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:59:10] !log dancy@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.8 refs T320513 [14:59:25] T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513 [15:01:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T318950)', diff saved to https://phabricator.wikimedia.org/P37404 and previous config saved to /var/cache/conftool/dbconfig/20221101-150107-ladsgroup.json [15:01:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [15:01:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [15:01:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37405 and previous config saved to /var/cache/conftool/dbconfig/20221101-150122-ladsgroup.json [15:01:23] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:02:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4042.ulsfo.wmnet [15:02:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T318605)', diff saved to https://phabricator.wikimedia.org/P37406 and previous config saved to /var/cache/conftool/dbconfig/20221101-150255-ladsgroup.json [15:03:00] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:03:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37407 and previous config saved to /var/cache/conftool/dbconfig/20221101-150345-ladsgroup.json [15:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:04:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37408 and previous config saved to /var/cache/conftool/dbconfig/20221101-150415-ladsgroup.json [15:04:16] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.8 refs T320513 (duration: 05m 05s) [15:04:24] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:04:29] T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513 [15:04:46] (03CR) 10Phuedx: [C: 03+1] Add MP stream for VisualEditorFeatureUse instrument (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [15:04:50] (03PS1) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 [15:06:05] !log dancy@deploy1002 Pruned MediaWiki: 1.40.0-wmf.6 (duration: 01m 47s) [15:07:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T318950)', diff saved to https://phabricator.wikimedia.org/P37409 and previous config saved to /var/cache/conftool/dbconfig/20221101-150659-ladsgroup.json [15:07:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:07:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:07:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37410 and previous config saved to /var/cache/conftool/dbconfig/20221101-150711-ladsgroup.json [15:07:12] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:08:08] (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [15:09:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37411 and previous config saved to /var/cache/conftool/dbconfig/20221101-150922-ladsgroup.json [15:09:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:12:18] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [15:13:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P37412 and previous config saved to /var/cache/conftool/dbconfig/20221101-151320-ladsgroup.json [15:13:42] (03PS2) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 [15:13:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:13:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:14:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:15:21] (03PS2) 10JMeybohm: Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621 [15:16:01] (03CR) 10CI reject: [V: 04-1] Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621 (owner: 10JMeybohm) [15:16:13] (03PS3) 10Clare Ming: Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) [15:18:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P37413 and previous config saved to /var/cache/conftool/dbconfig/20221101-151803-ladsgroup.json [15:18:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P37414 and previous config saved to /var/cache/conftool/dbconfig/20221101-151853-ladsgroup.json [15:19:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P37415 and previous config saved to /var/cache/conftool/dbconfig/20221101-151923-ladsgroup.json [15:21:20] (03PS1) 10Clare Ming: Update Edit Attempt Step sampling rate to 1 for group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851652 (https://phabricator.wikimedia.org/T312016) [15:23:13] (03PS3) 10JMeybohm: Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621 [15:23:21] (03PS1) 10Ssingh: esitest: add config file dependency on /run/esitest creation [puppet] - 10https://gerrit.wikimedia.org/r/851654 [15:24:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37416 and previous config saved to /var/cache/conftool/dbconfig/20221101-152430-ladsgroup.json [15:26:01] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37878/console" [puppet] - 10https://gerrit.wikimedia.org/r/851654 (owner: 10Ssingh) [15:28:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T318955)', diff saved to https://phabricator.wikimedia.org/P37417 and previous config saved to /var/cache/conftool/dbconfig/20221101-152827-ladsgroup.json [15:28:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [15:28:33] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:28:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance [15:28:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T318955)', diff saved to https://phabricator.wikimedia.org/P37418 and previous config saved to /var/cache/conftool/dbconfig/20221101-152850-ladsgroup.json [15:30:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T318605)', diff saved to https://phabricator.wikimedia.org/P37419 and previous config saved to /var/cache/conftool/dbconfig/20221101-153049-ladsgroup.json [15:30:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T318955)', diff saved to https://phabricator.wikimedia.org/P37420 and previous config saved to /var/cache/conftool/dbconfig/20221101-153059-ladsgroup.json [15:31:23] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:33:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for USERNAME and kinit credentials - https://phabricator.wikimedia.org/T322145 (10Hghani) [15:33:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P37421 and previous config saved to /var/cache/conftool/dbconfig/20221101-153311-ladsgroup.json [15:33:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10HShaath-WMF) [15:33:38] (03PS4) 10JMeybohm: Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621 [15:33:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10ILooremeta-WMF) [15:34:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P37422 and previous config saved to /var/cache/conftool/dbconfig/20221101-153400-ladsgroup.json [15:34:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P37423 and previous config saved to /var/cache/conftool/dbconfig/20221101-153430-ladsgroup.json [15:34:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani and kinit credentials - https://phabricator.wikimedia.org/T322145 (10Hghani) [15:34:45] (03CR) 10BBlack: [C: 03+1] esitest: add config file dependency on /run/esitest creation [puppet] - 10https://gerrit.wikimedia.org/r/851654 (owner: 10Ssingh) [15:35:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10KCVelaga_WMF) [15:37:25] (03CR) 10Phuedx: [C: 03+1] Update Edit Attempt Step sampling rate to 1 for group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851652 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming) [15:38:13] (03PS5) 10JMeybohm: Move kubelet configuration to file [puppet] - 10https://gerrit.wikimedia.org/r/851621 [15:39:14] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:39:16] (03CR) 10David Caro: [C: 03+2] "I've duplicated the hiera values on those projects with the correct values, will merge this now" [puppet] - 10https://gerrit.wikimedia.org/r/849483 (owner: 10David Caro) [15:39:21] (03PS4) 10David Caro: p::wmcs:nfs: Fix typo in the hiera lookup [puppet] - 10https://gerrit.wikimedia.org/r/849483 [15:39:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P37424 and previous config saved to /var/cache/conftool/dbconfig/20221101-153938-ladsgroup.json [15:41:35] (03CR) 10Ssingh: [V: 03+1 C: 03+2] esitest: add config file dependency on /run/esitest creation [puppet] - 10https://gerrit.wikimedia.org/r/851654 (owner: 10Ssingh) [15:42:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4043.ulsfo.wmnet [15:44:04] (03CR) 10Hnowlan: [C: 03+2] Provide additional tests to cover errors caused by wrong engine commands [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) (owner: 10Vlad.shapik) [15:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P37425 and previous config saved to /var/cache/conftool/dbconfig/20221101-154557-ladsgroup.json [15:46:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P37426 and previous config saved to /var/cache/conftool/dbconfig/20221101-154607-ladsgroup.json [15:47:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37879/console" [puppet] - 10https://gerrit.wikimedia.org/r/851621 (owner: 10JMeybohm) [15:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:47:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:48:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T318605)', diff saved to https://phabricator.wikimedia.org/P37427 and previous config saved to /var/cache/conftool/dbconfig/20221101-154819-ladsgroup.json [15:48:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [15:48:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:48:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [15:48:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T318605)', diff saved to https://phabricator.wikimedia.org/P37428 and previous config saved to /var/cache/conftool/dbconfig/20221101-154844-ladsgroup.json [15:49:00] (03Merged) 10jenkins-bot: Provide additional tests to cover errors caused by wrong engine commands [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/841183 (https://phabricator.wikimedia.org/T318406) (owner: 10Vlad.shapik) [15:49:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37429 and previous config saved to /var/cache/conftool/dbconfig/20221101-154907-ladsgroup.json [15:49:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [15:49:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [15:49:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T318950)', diff saved to https://phabricator.wikimedia.org/P37430 and previous config saved to /var/cache/conftool/dbconfig/20221101-154919-ladsgroup.json [15:49:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37431 and previous config saved to /var/cache/conftool/dbconfig/20221101-154938-ladsgroup.json [15:49:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:49:51] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:49:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37432 and previous config saved to /var/cache/conftool/dbconfig/20221101-155002-ladsgroup.json [15:51:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T318950)', diff saved to https://phabricator.wikimedia.org/P37433 and previous config saved to /var/cache/conftool/dbconfig/20221101-155142-ladsgroup.json [15:51:43] (03CR) 10Muehlenhoff: [C: 03+2] Record extended MOU for Robert West [puppet] - 10https://gerrit.wikimedia.org/r/851644 (owner: 10Muehlenhoff) [15:51:44] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:51:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4043.ulsfo.wmnet [15:52:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:53:31] (03CR) 10Muehlenhoff: [C: 03+2] Add a few more package cleanups for bullseye->bookworm ABI changes [puppet] - 10https://gerrit.wikimedia.org/r/851638 (owner: 10Muehlenhoff) [15:54:18] (03PS1) 10Andrew Bogott: rsyncd.pp: use gid 'nogroup' rather than 'nobody' [puppet] - 10https://gerrit.wikimedia.org/r/851661 (https://phabricator.wikimedia.org/T322149) [15:54:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T318950)', diff saved to https://phabricator.wikimedia.org/P37434 and previous config saved to /var/cache/conftool/dbconfig/20221101-155446-ladsgroup.json [15:54:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:54:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:54:54] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:54:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T318950)', diff saved to https://phabricator.wikimedia.org/P37435 and previous config saved to /var/cache/conftool/dbconfig/20221101-155458-ladsgroup.json [15:55:32] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:55:44] (03CR) 10Andrew Bogott: "I'm not sure that this fully resolves the attached bug but I think it's correct regardless." [puppet] - 10https://gerrit.wikimedia.org/r/851661 (https://phabricator.wikimedia.org/T322149) (owner: 10Andrew Bogott) [15:57:28] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:57:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:00:05] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:48] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4044.ulsfo.wmnet [16:01:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111', diff saved to https://phabricator.wikimedia.org/P37436 and previous config saved to /var/cache/conftool/dbconfig/20221101-160106-ladsgroup.json [16:01:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P37437 and previous config saved to /var/cache/conftool/dbconfig/20221101-160116-ladsgroup.json [16:03:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T318950)', diff saved to https://phabricator.wikimedia.org/P37438 and previous config saved to /var/cache/conftool/dbconfig/20221101-160308-ladsgroup.json [16:03:13] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [16:03:44] (03PS1) 10Muehlenhoff: Additional MOU extensions [puppet] - 10https://gerrit.wikimedia.org/r/851665 [16:03:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37439 and previous config saved to /var/cache/conftool/dbconfig/20221101-160344-ladsgroup.json [16:04:50] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:06:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P37440 and previous config saved to /var/cache/conftool/dbconfig/20221101-160649-ladsgroup.json [16:10:55] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4044.ulsfo.wmnet [16:11:58] PROBLEM - Host labstore1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:13:43] (03PS1) 10JHathaway: aux-k8s: VIP for kubernetes api server [dns] - 10https://gerrit.wikimedia.org/r/851668 [16:15:38] (03PS1) 10BCornwall: prometheus: Handle inactive trafficserver service [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) [16:16:04] (03CR) 10JHathaway: [C: 03+2] aux-k8s: VIP for kubernetes api server [dns] - 10https://gerrit.wikimedia.org/r/851668 (owner: 10JHathaway) [16:16:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1111 (T318605)', diff saved to https://phabricator.wikimedia.org/P37441 and previous config saved to /var/cache/conftool/dbconfig/20221101-161614-ladsgroup.json [16:16:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance [16:16:20] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:16:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T318955)', diff saved to https://phabricator.wikimedia.org/P37442 and previous config saved to /var/cache/conftool/dbconfig/20221101-161625-ladsgroup.json [16:16:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [16:16:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1114.eqiad.wmnet with reason: Maintenance [16:16:32] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:16:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] admin: add mw on kubernetes namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [16:16:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T318605)', diff saved to https://phabricator.wikimedia.org/P37443 and previous config saved to /var/cache/conftool/dbconfig/20221101-161636-ladsgroup.json [16:16:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance [16:16:44] (03CR) 10CI reject: [V: 04-1] prometheus: Handle inactive trafficserver service [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [16:16:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37444 and previous config saved to /var/cache/conftool/dbconfig/20221101-161648-ladsgroup.json [16:16:54] (03CR) 10Muehlenhoff: [C: 03+2] Additional MOU extensions [puppet] - 10https://gerrit.wikimedia.org/r/851665 (owner: 10Muehlenhoff) [16:18:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37445 and previous config saved to /var/cache/conftool/dbconfig/20221101-161816-ladsgroup.json [16:18:28] (03PS2) 10BCornwall: prometheus: Handle inactive trafficserver service [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) [16:18:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P37446 and previous config saved to /var/cache/conftool/dbconfig/20221101-161851-ladsgroup.json [16:19:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37447 and previous config saved to /var/cache/conftool/dbconfig/20221101-161859-ladsgroup.json [16:19:22] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [16:21:34] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:21:46] (03PS1) 10Filippo Giunchedi: dispatch: move frontend to its own module [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229) [16:21:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P37448 and previous config saved to /var/cache/conftool/dbconfig/20221101-162158-ladsgroup.json [16:22:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T318605)', diff saved to https://phabricator.wikimedia.org/P37449 and previous config saved to /var/cache/conftool/dbconfig/20221101-162206-ladsgroup.json [16:22:20] (03PS1) 10Ottomata: Declare rc0.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851673 (https://phabricator.wikimedia.org/T307959) [16:22:20] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:24:10] (03CR) 10CI reject: [V: 04-1] dispatch: move frontend to its own module [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [16:25:33] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [16:26:00] (03PS2) 10Filippo Giunchedi: dispatch: move frontend to its own module [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229) [16:27:11] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 (10Jclark-ctr) [16:27:42] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:28:20] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Jclark-ctr) [16:29:48] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37880/console" [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [16:31:03] 10SRE, 10ops-eqiad: msw-c5-eqiad offline - https://phabricator.wikimedia.org/T321311 (10Jclark-ctr) 05Open→03Resolved [16:33:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P37450 and previous config saved to /var/cache/conftool/dbconfig/20221101-163324-ladsgroup.json [16:33:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P37451 and previous config saved to /var/cache/conftool/dbconfig/20221101-163358-ladsgroup.json [16:34:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P37452 and previous config saved to /var/cache/conftool/dbconfig/20221101-163407-ladsgroup.json [16:34:19] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KMorgan - https://phabricator.wikimedia.org/T322154 (10KMorgan-WMF) [16:37:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T318950)', diff saved to https://phabricator.wikimedia.org/P37453 and previous config saved to /var/cache/conftool/dbconfig/20221101-163706-ladsgroup.json [16:37:12] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [16:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P37454 and previous config saved to /var/cache/conftool/dbconfig/20221101-163713-ladsgroup.json [16:38:16] (03PS1) 10JHathaway: aux-k8s: add LVS config for API server [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) [16:41:10] 10SRE, 10Growth-Team, 10Notifications, 10serviceops, 10Wikimedia-production-error: Failed to fetch API response from {wiki}. Error code {code} - https://phabricator.wikimedia.org/T321409 (10LSobanski) [16:41:25] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [16:42:36] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:42:39] .12 [16:42:43] err :) [16:44:18] 10SRE, 10Infrastructure-Foundations: service::docker with 'latest' version behaves poorly if the host runs out of disk space - https://phabricator.wikimedia.org/T321851 (10LSobanski) [16:45:44] (03PS1) 10Ssingh: esitest: add explicit require on /run/esitest [puppet] - 10https://gerrit.wikimedia.org/r/851677 [16:46:48] 10SRE, 10Infrastructure-Foundations, 10Goal: Meta task "Revamp user authentication" - https://phabricator.wikimedia.org/T116747 (10LSobanski) [16:47:17] (03CR) 10Ssingh: [C: 03+2] esitest: add explicit require on /run/esitest [puppet] - 10https://gerrit.wikimedia.org/r/851677 (owner: 10Ssingh) [16:48:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T318950)', diff saved to https://phabricator.wikimedia.org/P37455 and previous config saved to /var/cache/conftool/dbconfig/20221101-164832-ladsgroup.json [16:48:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:48:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:48:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T318950)', diff saved to https://phabricator.wikimedia.org/P37456 and previous config saved to /var/cache/conftool/dbconfig/20221101-164845-ladsgroup.json [16:48:47] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4045.ulsfo.wmnet [16:49:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318955)', diff saved to https://phabricator.wikimedia.org/P37457 and previous config saved to /var/cache/conftool/dbconfig/20221101-164907-ladsgroup.json [16:49:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:49:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P37458 and previous config saved to /var/cache/conftool/dbconfig/20221101-164914-ladsgroup.json [16:49:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:49:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37459 and previous config saved to /var/cache/conftool/dbconfig/20221101-164930-ladsgroup.json [16:49:37] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [16:50:30] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:50:42] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [16:50:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37460 and previous config saved to /var/cache/conftool/dbconfig/20221101-165042-ladsgroup.json [16:51:00] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [16:51:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T318950)', diff saved to https://phabricator.wikimedia.org/P37461 and previous config saved to /var/cache/conftool/dbconfig/20221101-165100-ladsgroup.json [16:52:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P37462 and previous config saved to /var/cache/conftool/dbconfig/20221101-165221-ladsgroup.json [16:53:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T318605)', diff saved to https://phabricator.wikimedia.org/P37463 and previous config saved to /var/cache/conftool/dbconfig/20221101-165323-ladsgroup.json [16:53:50] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:58:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4045.ulsfo.wmnet [17:04:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37464 and previous config saved to /var/cache/conftool/dbconfig/20221101-170424-ladsgroup.json [17:04:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [17:04:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance [17:04:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T318955)', diff saved to https://phabricator.wikimedia.org/P37465 and previous config saved to /var/cache/conftool/dbconfig/20221101-170447-ladsgroup.json [17:05:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4046.ulsfo.wmnet [17:05:39] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:05:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P37466 and previous config saved to /var/cache/conftool/dbconfig/20221101-170550-ladsgroup.json [17:06:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37467 and previous config saved to /var/cache/conftool/dbconfig/20221101-170608-ladsgroup.json [17:06:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T318955)', diff saved to https://phabricator.wikimedia.org/P37468 and previous config saved to /var/cache/conftool/dbconfig/20221101-170656-ladsgroup.json [17:07:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T318605)', diff saved to https://phabricator.wikimedia.org/P37469 and previous config saved to /var/cache/conftool/dbconfig/20221101-170730-ladsgroup.json [17:07:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [17:07:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [17:07:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T318605)', diff saved to https://phabricator.wikimedia.org/P37470 and previous config saved to /var/cache/conftool/dbconfig/20221101-170752-ladsgroup.json [17:08:00] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:08:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P37471 and previous config saved to /var/cache/conftool/dbconfig/20221101-170832-ladsgroup.json [17:12:54] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:14:08] (03PS1) 10David Caro: webservice: add toolforge-* link for it [puppet] - 10https://gerrit.wikimedia.org/r/851685 [17:14:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4046.ulsfo.wmnet [17:14:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet [17:14:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:19:16] PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:20:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P37472 and previous config saved to /var/cache/conftool/dbconfig/20221101-172058-ladsgroup.json [17:21:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P37473 and previous config saved to /var/cache/conftool/dbconfig/20221101-172116-ladsgroup.json [17:22:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P37474 and previous config saved to /var/cache/conftool/dbconfig/20221101-172204-ladsgroup.json [17:23:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P37475 and previous config saved to /var/cache/conftool/dbconfig/20221101-172341-ladsgroup.json [17:24:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4047.ulsfo.wmnet [17:24:39] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4048.ulsfo.wmnet [17:26:24] (03CR) 10Majavah: [C: 04-1] "I'd prefer to do this in the Debian packaging and not here." [puppet] - 10https://gerrit.wikimedia.org/r/851685 (owner: 10David Caro) [17:31:18] (03CR) 10David Caro: webservice: add toolforge-* link for it (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/851685 (owner: 10David Caro) [17:33:04] RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4048.ulsfo.wmnet [17:34:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4049.ulsfo.wmnet [17:35:27] (03PS3) 10BBlack: Switch drmrs, eqsin, esams to digicert-2022 [puppet] - 10https://gerrit.wikimedia.org/r/850287 (https://phabricator.wikimedia.org/T313328) [17:35:44] (03PS1) 10Ssingh: haproxy: use systemd::tmpfile to create /run/haproxy (and esitest) [puppet] - 10https://gerrit.wikimedia.org/r/851689 [17:36:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318955)', diff saved to https://phabricator.wikimedia.org/P37476 and previous config saved to /var/cache/conftool/dbconfig/20221101-173607-ladsgroup.json [17:36:15] (03PS1) 10Btullis: Add a postgresql database for the airflow development [puppet] - 10https://gerrit.wikimedia.org/r/851690 (https://phabricator.wikimedia.org/T319440) [17:36:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T318950)', diff saved to https://phabricator.wikimedia.org/P37477 and previous config saved to /var/cache/conftool/dbconfig/20221101-173624-ladsgroup.json [17:36:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:36:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:36:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T318950)', diff saved to https://phabricator.wikimedia.org/P37478 and previous config saved to /var/cache/conftool/dbconfig/20221101-173636-ladsgroup.json [17:36:41] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:36:46] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [17:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P37479 and previous config saved to /var/cache/conftool/dbconfig/20221101-173712-ladsgroup.json [17:37:50] (03CR) 10Btullis: [C: 03+2] Add a postgresql database for the airflow development [puppet] - 10https://gerrit.wikimedia.org/r/851690 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [17:38:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T318605)', diff saved to https://phabricator.wikimedia.org/P37480 and previous config saved to /var/cache/conftool/dbconfig/20221101-173848-ladsgroup.json [17:38:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1116.eqiad.wmnet with reason: Maintenance [17:39:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1116.eqiad.wmnet with reason: Maintenance [17:40:17] (03CR) 10Jelto: [C: 03+2] Allow rsync to doc.discovery.wmnet from trusted runner containers [puppet] - 10https://gerrit.wikimedia.org/r/850635 (https://phabricator.wikimedia.org/T321629) (owner: 10Ahmon Dancy) [17:40:33] Thanks Jelto! [17:40:58] PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:20] huh! [17:41:28] this is new [17:41:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T318605)', diff saved to https://phabricator.wikimedia.org/P37481 and previous config saved to /var/cache/conftool/dbconfig/20221101-174129-ladsgroup.json [17:41:32] (03PS2) 10David Caro: webservice: add toolforge-* link for it [puppet] - 10https://gerrit.wikimedia.org/r/851685 [17:41:34] (03CR) 10David Caro: webservice: add toolforge-* link for it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851685 (owner: 10David Caro) [17:42:26] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37886/console" [puppet] - 10https://gerrit.wikimedia.org/r/851685 (owner: 10David Caro) [17:44:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4049.ulsfo.wmnet [17:44:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4050.ulsfo.wmnet [17:44:52] (03PS1) 10Filippo Giunchedi: dispatch: refactor/simplify db profile [puppet] - 10https://gerrit.wikimedia.org/r/851693 (https://phabricator.wikimedia.org/T313229) [17:47:36] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37887/console" [puppet] - 10https://gerrit.wikimedia.org/r/851693 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [17:48:52] RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:13] (03CR) 10CDanis: [C: 03+1] haproxy: use systemd::tmpfile to create /run/haproxy (and esitest) [puppet] - 10https://gerrit.wikimedia.org/r/851689 (owner: 10Ssingh) [17:52:14] (03PS6) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [17:52:16] (03PS2) 10Ssingh: haproxy: use systemd::tmpfile to create /run/haproxy (and esitest) [puppet] - 10https://gerrit.wikimedia.org/r/851689 [17:52:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T318955)', diff saved to https://phabricator.wikimedia.org/P37482 and previous config saved to /var/cache/conftool/dbconfig/20221101-175221-ladsgroup.json [17:52:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [17:52:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1201.eqiad.wmnet with reason: Maintenance [17:52:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T318955)', diff saved to https://phabricator.wikimedia.org/P37483 and previous config saved to /var/cache/conftool/dbconfig/20221101-175244-ladsgroup.json [17:53:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:53:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4050.ulsfo.wmnet [17:53:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4051.ulsfo.wmnet [17:53:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T318955)', diff saved to https://phabricator.wikimedia.org/P37484 and previous config saved to /var/cache/conftool/dbconfig/20221101-175353-ladsgroup.json [17:54:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37485 and previous config saved to /var/cache/conftool/dbconfig/20221101-175405-ladsgroup.json [17:54:12] (03CR) 10Ssingh: [C: 03+2] haproxy: use systemd::tmpfile to create /run/haproxy (and esitest) [puppet] - 10https://gerrit.wikimedia.org/r/851689 (owner: 10Ssingh) [17:56:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P37486 and previous config saved to /var/cache/conftool/dbconfig/20221101-175639-ladsgroup.json [17:57:53] 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10MusikAnimal) [17:58:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:00:04] jeena and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T1800). [18:03:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4051.ulsfo.wmnet [18:04:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp4052.ulsfo.wmnet [18:07:07] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.8 refs T320513 [18:07:13] T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513 [18:08:14] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.0104 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:09:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P37487 and previous config saved to /var/cache/conftool/dbconfig/20221101-180902-ladsgroup.json [18:09:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P37488 and previous config saved to /var/cache/conftool/dbconfig/20221101-180913-ladsgroup.json [18:09:23] (03PS1) 10BCornwall: DO NOT MERGE: Testing the test suite [puppet] - 10https://gerrit.wikimedia.org/r/851696 [18:11:25] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.8 refs T320513 (duration: 04m 18s) [18:11:43] (03PS2) 10BCornwall: DO NOT MERGE: Testing the test suite [puppet] - 10https://gerrit.wikimedia.org/r/851696 [18:11:45] (03PS7) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [18:11:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P37489 and previous config saved to /var/cache/conftool/dbconfig/20221101-181148-ladsgroup.json [18:12:46] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851697 (https://phabricator.wikimedia.org/T320513) [18:12:48] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851697 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot) [18:12:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [18:13:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1126.eqiad.wmnet with reason: Maintenance [18:13:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T318605)', diff saved to https://phabricator.wikimedia.org/P37490 and previous config saved to /var/cache/conftool/dbconfig/20221101-181310-ladsgroup.json [18:13:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4052.ulsfo.wmnet [18:14:04] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851697 (https://phabricator.wikimedia.org/T320513) (owner: 10TrainBranchBot) [18:14:16] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:18:14] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.8 refs T320513 [18:18:20] T320513: 1.40.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T320513 [18:20:08] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), No backups: 1 (dispatch-be1001), Fresh: 123 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [18:21:21] (03PS1) 10Ssingh: esitest: do not require on obsoleted file resource [puppet] - 10https://gerrit.wikimedia.org/r/851698 [18:21:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:22:13] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=ats-tls [18:22:14] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=ats-be [18:22:14] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4052.ulsfo.wmnet,service=varnish-fe [18:22:23] (03CR) 10Ssingh: [C: 03+2] esitest: do not require on obsoleted file resource [puppet] - 10https://gerrit.wikimedia.org/r/851698 (owner: 10Ssingh) [18:22:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:22:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:22:52] (03PS2) 10JHathaway: aux-k8s: add LVS config for API server [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) [18:23:09] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [18:23:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:24:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P37491 and previous config saved to /var/cache/conftool/dbconfig/20221101-182412-ladsgroup.json [18:24:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T318950)', diff saved to https://phabricator.wikimedia.org/P37492 and previous config saved to /var/cache/conftool/dbconfig/20221101-182421-ladsgroup.json [18:24:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:24:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:25:48] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [18:26:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T318605)', diff saved to https://phabricator.wikimedia.org/P37493 and previous config saved to /var/cache/conftool/dbconfig/20221101-182655-ladsgroup.json [18:26:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [18:27:04] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:27:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [18:27:12] (03PS1) 10Ssingh: esitest: do not require on obsoleted file resource (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/851699 [18:27:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [18:27:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [18:27:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T318605)', diff saved to https://phabricator.wikimedia.org/P37494 and previous config saved to /var/cache/conftool/dbconfig/20221101-182734-ladsgroup.json [18:27:42] PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:27:48] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10Jgreen) [18:27:57] (03CR) 10Ssingh: [C: 03+2] esitest: do not require on obsoleted file resource (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/851699 (owner: 10Ssingh) [18:28:41] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jgreen) [18:29:05] (03PS5) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) [18:29:07] (03PS3) 10BCornwall: DO NOT MERGE: Testing the test suite [puppet] - 10https://gerrit.wikimedia.org/r/851696 [18:29:09] (03PS8) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [18:32:14] RECOVERY - Disk space on cp5007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops [18:39:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T318955)', diff saved to https://phabricator.wikimedia.org/P37495 and previous config saved to /var/cache/conftool/dbconfig/20221101-183920-ladsgroup.json [18:39:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [18:39:26] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:39:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [18:45:14] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:45:25] (03PS9) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [18:45:33] (03Abandoned) 10BCornwall: DO NOT MERGE: Testing the test suite [puppet] - 10https://gerrit.wikimedia.org/r/851696 (owner: 10BCornwall) [18:47:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T318605)', diff saved to https://phabricator.wikimedia.org/P37496 and previous config saved to /var/cache/conftool/dbconfig/20221101-184758-ladsgroup.json [18:48:15] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:56:54] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005446 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:59:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:01:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T318605)', diff saved to https://phabricator.wikimedia.org/P37497 and previous config saved to /var/cache/conftool/dbconfig/20221101-190132-ladsgroup.json [19:01:40] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:03:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P37498 and previous config saved to /var/cache/conftool/dbconfig/20221101-190307-ladsgroup.json [19:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:04:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:20] (03CR) 10Ottomata: [C: 03+2] Declare rc0.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851673 (https://phabricator.wikimedia.org/T307959) (owner: 10Ottomata) [19:09:03] (03Merged) 10jenkins-bot: Declare rc0.mediawiki.page_content_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851673 (https://phabricator.wikimedia.org/T307959) (owner: 10Ottomata) [19:09:15] ahh jeena i just merged a mediawiki-config change, but i just noticed that the train window is now. it is a total no-op [19:09:59] I've already deployed today's train an hour ago actually [19:10:31] Do you need it backported? [19:10:39] oh okay! no i can deploy it [19:10:46] Okay :) [19:10:51] its just pre-declaring a new stream so gmodena can work on it later [19:11:05] i also will deploy another that enables some stuff on group0 wikis too, if you don't mind. [19:11:14] thank you! [19:11:43] No problem, that should be fine 👍 [19:11:45] ty [19:11:54] Thanks for checking in! [19:14:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:14:33] (03PS1) 10Ottomata: Enable rc0.mediawiki.page_change on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851705 (https://phabricator.wikimedia.org/T311129) [19:15:30] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Declare rc0.mediawiki.page_content_change stream - T307959 T308017 (duration: 03m 42s) [19:15:37] T308017: Design Schema for page state and page state with content (enriched) streams - https://phabricator.wikimedia.org/T308017 [19:15:38] T307959: [Shared Event Platform] Design and Implement POC Flink Service to Combine Existing Streams, Enrich and Output to New Topic - https://phabricator.wikimedia.org/T307959 [19:15:55] (03CR) 10BBlack: [C: 03+1] "Looks right, I think 😊" [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [19:16:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P37499 and previous config saved to /var/cache/conftool/dbconfig/20221101-191639-ladsgroup.json [19:17:56] (03CR) 10Ottomata: [C: 03+2] Enable rc0.mediawiki.page_change on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851705 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [19:18:05] (03CR) 10JHathaway: [C: 03+2] aux-k8s: add LVS config for API server [puppet] - 10https://gerrit.wikimedia.org/r/851676 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [19:18:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P37500 and previous config saved to /var/cache/conftool/dbconfig/20221101-191815-ladsgroup.json [19:18:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:18:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:19:12] (03Merged) 10jenkins-bot: Enable rc0.mediawiki.page_change on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851705 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [19:19:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:20:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q1:rerack elastic10[53-67] - https://phabricator.wikimedia.org/T322082 (10RKemper) [19:21:10] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:24:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [19:25:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [19:25:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [19:26:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [19:31:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P37501 and previous config saved to /var/cache/conftool/dbconfig/20221101-193148-ladsgroup.json [19:33:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T318605)', diff saved to https://phabricator.wikimedia.org/P37502 and previous config saved to /var/cache/conftool/dbconfig/20221101-193323-ladsgroup.json [19:33:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [19:33:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [19:33:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:33:50] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:33:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:34:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T318605)', diff saved to https://phabricator.wikimedia.org/P37503 and previous config saved to /var/cache/conftool/dbconfig/20221101-193404-ladsgroup.json [19:36:01] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable rc0.mediawiki.page_change on group0 wikis - T311129 (duration: 03m 38s) [19:36:06] (ConfdResourceFailed) firing: confd resource _srv_config-master_pybal_eqiad_aux-k8s-ctrl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:37:58] T311129: [Shared Event Platform] Produce new mediawiki.page-change stream from MediaWiki EventBus - https://phabricator.wikimedia.org/T311129 [19:39:15] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:41:05] !log jhathaway@puppetmaster1001 conftool action : set/pooled=yes:weight=1; selector: cluster=aux-k8s,service=kubemaster [19:44:41] (03PS1) 10JHathaway: aux-k8s: enable LVS config for API server [puppet] - 10https://gerrit.wikimedia.org/r/851708 (https://phabricator.wikimedia.org/T321137) [19:46:04] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:46:06] (ConfdResourceFailed) resolved: confd resource _srv_config-master_pybal_eqiad_aux-k8s-ctrl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:46:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T318605)', diff saved to https://phabricator.wikimedia.org/P37504 and previous config saved to /var/cache/conftool/dbconfig/20221101-194655-ladsgroup.json [19:46:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [19:47:00] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:47:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [19:47:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T318605)', diff saved to https://phabricator.wikimedia.org/P37505 and previous config saved to /var/cache/conftool/dbconfig/20221101-194718-ladsgroup.json [19:52:03] (03CR) 10JHathaway: [C: 03+2] aux-k8s: enable LVS config for API server [puppet] - 10https://gerrit.wikimedia.org/r/851708 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [19:55:05] (03PS1) 10Bking: query_service: Ensure prometheus exporter depends on blazegraph service [puppet] - 10https://gerrit.wikimedia.org/r/851711 (https://phabricator.wikimedia.org/T322037) [19:56:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/851711 (https://phabricator.wikimedia.org/T322037) (owner: 10Bking) [19:57:11] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37899/console" [puppet] - 10https://gerrit.wikimedia.org/r/851711 (https://phabricator.wikimedia.org/T322037) (owner: 10Bking) [19:59:00] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.74:6443]) https://wikitech.wikimedia.org/wiki/PyBal [19:59:22] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 118 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal [20:00:04] RoanKattouw, Urbanecm, cjming, and TheresNoTime: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221101T2000). Please do the needful. [20:00:04] MatmaRex and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/851711 (https://phabricator.wikimedia.org/T322037) (owner: 10Bking) [20:00:13] o/ i can deploy [20:00:15] sup [20:00:27] (03PS4) 10Clare Ming: Enable DiscussionTools mobile visual enhancements at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843581 (https://phabricator.wikimedia.org/T318870) (owner: 10Bartosz Dziewoński) [20:00:29] thanks [20:00:58] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1016.eqiad.wmnet with reason: Remove from cluster for eventual reimage [20:01:01] thanks cjming [20:01:08] np! [20:01:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1016.eqiad.wmnet with reason: Remove from cluster for eventual reimage [20:01:38] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 72 connections established with conf1007.eqiad.wmnet:4001 (min=73) https://wikitech.wikimedia.org/wiki/PyBal [20:02:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843581 (https://phabricator.wikimedia.org/T318870) (owner: 10Bartosz Dziewoński) [20:02:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (2) Elasticsearch instance elastic1083-production-search-psi-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:02:56] (03Merged) 10jenkins-bot: Enable DiscussionTools mobile visual enhancements at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843581 (https://phabricator.wikimedia.org/T318870) (owner: 10Bartosz Dziewoński) [20:03:14] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.74:6443]) https://wikitech.wikimedia.org/wiki/PyBal [20:03:17] !log cjming@deploy1002 Started scap: Backport for [[gerrit:843581|Enable DiscussionTools mobile visual enhancements at jawiki (T318870)]] [20:03:42] !log cjming@deploy1002 cjming and matmarex: Backport for [[gerrit:843581|Enable DiscussionTools mobile visual enhancements at jawiki (T318870)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:03:47] MatmaRex: your 1st patch is up on debug servers - shall i sync? [20:04:32] (03PS1) 10DDesouza: Remove Research Incentive survey from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851713 (https://phabricator.wikimedia.org/T318333) [20:04:45] cjming: yup, looks good [20:05:16] cool - syncing - moving on to your 2nd patch [20:06:43] !log T322037 Disabled puppet across `A:wdqs-all` and `A:wcqs-public` [20:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:07:06] (03CR) 10Bking: [C: 03+2] query_service: Ensure prometheus exporter depends on blazegraph service [puppet] - 10https://gerrit.wikimedia.org/r/851711 (https://phabricator.wikimedia.org/T322037) (owner: 10Bking) [20:07:13] T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service - https://phabricator.wikimedia.org/T322037 [20:07:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:07:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:07:56] (03PS1) 10DDesouza: Deploy Research Incentive survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851714 (https://phabricator.wikimedia.org/T321930) [20:08:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:08:49] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:843581|Enable DiscussionTools mobile visual enhancements at jawiki (T318870)]] (duration: 05m 31s) [20:09:53] (03PS3) 10Clare Ming: Enable DiscussionTools visual enhancements beta feature at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851125 (https://phabricator.wikimedia.org/T318127) (owner: 10Bartosz Dziewoński) [20:09:55] T318870: [Config Change] Enable all DiscussionTools by default at ja.wiki (mobile) - https://phabricator.wikimedia.org/T318870 [20:10:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851125 (https://phabricator.wikimedia.org/T318127) (owner: 10Bartosz Dziewoński) [20:11:16] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 119 connections established with conf1007.eqiad.wmnet:4001 (min=119) https://wikitech.wikimedia.org/wiki/PyBal [20:11:41] (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements beta feature at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851125 (https://phabricator.wikimedia.org/T318127) (owner: 10Bartosz Dziewoński) [20:12:04] !log cjming@deploy1002 Started scap: Backport for [[gerrit:851125|Enable DiscussionTools visual enhancements beta feature at jawiki (T318127)]] [20:12:12] T318127: [Config Change] Enable Topic Containers as beta feature at Phase 3 wikis (desktop) - https://phabricator.wikimedia.org/T318127 [20:12:27] !log cjming@deploy1002 cjming and matmarex: Backport for [[gerrit:851125|Enable DiscussionTools visual enhancements beta feature at jawiki (T318127)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:12:31] MatmaRex: 2nd patch up on debugs servers if you want to check [20:13:02] cjming: checked, also looks good [20:13:09] going live [20:14:08] PROBLEM - Check if pdns-recursor.service has been restarted after /etc/powerdns/recursor.conf was changed on doh2001 is CRITICAL: CRITICAL: Service pdns-recursor.service has not been restarted after /etc/powerdns/recursor.conf was changed (stale by 521310.76s). https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Service_Restart_Check [20:15:08] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:15:32] ha [20:16:59] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:851125|Enable DiscussionTools visual enhancements beta feature at jawiki (T318127)]] (duration: 04m 55s) [20:17:15] MatmaRex: both patches should be live! [20:17:23] moving onto my patches next [20:17:34] thanks! [20:17:39] np! [20:18:04] T318127: [Config Change] Enable Topic Containers as beta feature at Phase 3 wikis (desktop) - https://phabricator.wikimedia.org/T318127 [20:18:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [20:18:38] (03PS4) 10Clare Ming: Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) [20:18:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:19:12] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wdqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:28] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 73 connections established with conf1007.eqiad.wmnet:4001 (min=73) https://wikitech.wikimedia.org/wiki/PyBal [20:19:35] (03CR) 10TrainBranchBot: "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [20:19:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:19:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:20:20] (03Merged) 10jenkins-bot: Add MP stream for VisualEditorFeatureUse instrument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851128 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [20:20:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:20:42] !log cjming@deploy1002 Started scap: Backport for [[gerrit:851128|Add MP stream for VisualEditorFeatureUse instrument (T309602)]] [20:20:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T318605)', diff saved to https://phabricator.wikimedia.org/P37506 and previous config saved to /var/cache/conftool/dbconfig/20221101-202059-ladsgroup.json [20:21:05] !log cjming@deploy1002 cjming and cjming: Backport for [[gerrit:851128|Add MP stream for VisualEditorFeatureUse instrument (T309602)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:21:09] m/win 14 [20:21:26] T309602: VisualEditorFeatureUse Migration to MP - https://phabricator.wikimedia.org/T309602 [20:21:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:22:46] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:23:59] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:24:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T318605)', diff saved to https://phabricator.wikimedia.org/P37507 and previous config saved to /var/cache/conftool/dbconfig/20221101-202449-ladsgroup.json [20:25:18] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:851128|Add MP stream for VisualEditorFeatureUse instrument (T309602)]] (duration: 04m 36s) [20:25:39] (03PS1) 10JHathaway: aux-k8s: note that LVS config for API server is production [puppet] - 10https://gerrit.wikimedia.org/r/851717 (https://phabricator.wikimedia.org/T321137) [20:25:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:25:50] (03PS2) 10Clare Ming: Update Edit Attempt Step sampling rate to 1 for group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851652 (https://phabricator.wikimedia.org/T312016) [20:26:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:26:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:26:52] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:27:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851652 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming) [20:27:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:28:01] T309602: VisualEditorFeatureUse Migration to MP - https://phabricator.wikimedia.org/T309602 [20:28:32] (03Merged) 10jenkins-bot: Update Edit Attempt Step sampling rate to 1 for group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851652 (https://phabricator.wikimedia.org/T312016) (owner: 10Clare Ming) [20:28:54] !log cjming@deploy1002 Started scap: Backport for [[gerrit:851652|Update Edit Attempt Step sampling rate to 1 for group 0 wikis (T312016)]] [20:29:17] !log cjming@deploy1002 cjming and cjming: Backport for [[gerrit:851652|Update Edit Attempt Step sampling rate to 1 for group 0 wikis (T312016)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [20:29:26] RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:29:53] T312016: Increase EditAttemptStep sampling rate(s) to 100% - https://phabricator.wikimedia.org/T312016 [20:31:34] (03PS1) 10Jforrester: onSpecialSearchCreateLink: Handle null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851016 (https://phabricator.wikimedia.org/T320736) [20:31:38] (03PS1) 10Jforrester: onSpecialSearchCreateLink: Handle another null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) [20:32:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:33:23] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:851652|Update Edit Attempt Step sampling rate to 1 for group 0 wikis (T312016)]] (duration: 04m 29s) [20:33:42] (03CR) 10JHathaway: [C: 03+2] aux-k8s: note that LVS config for API server is production [puppet] - 10https://gerrit.wikimedia.org/r/851717 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [20:33:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:33:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:34:25] !log end of UTC late backport window [20:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:35:21] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P37508 and previous config saved to /var/cache/conftool/dbconfig/20221101-203607-ladsgroup.json [20:37:45] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 131 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:39:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P37509 and previous config saved to /var/cache/conftool/dbconfig/20221101-203957-ladsgroup.json [20:44:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:45:07] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 14 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:49:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:51:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P37510 and previous config saved to /var/cache/conftool/dbconfig/20221101-205115-ladsgroup.json [20:53:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:53:42] (03PS1) 10Bking: Revert "query_service: Ensure prometheus exporter depends on blazegraph service" [puppet] - 10https://gerrit.wikimedia.org/r/851018 [20:55:02] (03CR) 10Bking: [C: 03+2] Revert "query_service: Ensure prometheus exporter depends on blazegraph service" [puppet] - 10https://gerrit.wikimedia.org/r/851018 (owner: 10Bking) [20:55:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P37511 and previous config saved to /var/cache/conftool/dbconfig/20221101-205505-ladsgroup.json [20:56:15] !log T322037 Re-enabled puppet across `A:wdqs-all` and `A:wcqs-public` [20:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:30] T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service - https://phabricator.wikimedia.org/T322037 [21:02:00] (03CR) 10Jon Harald Søby: [C: 03+1] onSpecialSearchCreateLink: Handle null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851016 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [21:02:05] (03CR) 10Jon Harald Søby: [C: 03+1] onSpecialSearchCreateLink: Handle another null from Title::newFromText [extensions/WikimediaIncubator] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/851017 (https://phabricator.wikimedia.org/T320736) (owner: 10Jforrester) [21:06:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T318605)', diff saved to https://phabricator.wikimedia.org/P37512 and previous config saved to /var/cache/conftool/dbconfig/20221101-210622-ladsgroup.json [21:06:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [21:06:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [21:06:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37513 and previous config saved to /var/cache/conftool/dbconfig/20221101-210658-ladsgroup.json [21:07:37] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:10:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T318605)', diff saved to https://phabricator.wikimedia.org/P37514 and previous config saved to /var/cache/conftool/dbconfig/20221101-211013-ladsgroup.json [21:10:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [21:10:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [21:15:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:41] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 124 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:19:39] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 47 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:20:36] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker1001.eqiad.wmnet [21:20:37] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [21:21:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:51] (03PS4) 10Jdlrobson: WIP: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223) [21:27:31] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:28:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:28:23] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:28:23] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-worker1001.eqiad.wmnet on all recursors [21:28:26] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker1001.eqiad.wmnet on all recursors [21:28:51] (03PS1) 10Clare Ming: Update config for Metrics Platform VEFU events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) [21:32:58] (03CR) 10Clare Ming: "gah - somehow during a rebase, i accidentally added the MP stream for vefu events to group0 wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [21:33:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:33:25] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 137 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:34:13] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:34:24] (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [21:35:39] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:35:50] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/851672 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [21:37:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:37:23] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 241 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:37:31] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Swift [21:38:03] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.393 second response time https://wikitech.wikimedia.org/wiki/Swift [21:38:46] (03PS1) 10JHathaway: aux-k8s: add partman config for workers [puppet] - 10https://gerrit.wikimedia.org/r/851724 (https://phabricator.wikimedia.org/T321137) [21:39:15] (03PS2) 10Clare Ming: Update config for Metrics Platform VEFU events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) [21:39:55] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:40:32] (03CR) 10JHathaway: [C: 03+2] aux-k8s: add partman config for workers [puppet] - 10https://gerrit.wikimedia.org/r/851724 (https://phabricator.wikimedia.org/T321137) (owner: 10JHathaway) [21:40:57] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:41:17] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.225 second response time https://wikitech.wikimedia.org/wiki/Swift [21:41:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:41:45] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Swift [21:42:19] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 163 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:42:58] looking [21:43:09] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift [21:43:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37515 and previous config saved to /var/cache/conftool/dbconfig/20221101-214311-ladsgroup.json [21:43:17] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:43:17] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1010.eqiad.wmnet, ms-fe1009.eqiad.wmnet are marked down but pooled: swift_80: Servers ms-fe1011.eqiad.wmnet, ms-fe1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:43:53] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:45:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:15] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:45:35] (FrontendUnavailable) firing: HAProxy (cache_upload) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [21:45:57] really high swift latency [21:46:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:46:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [21:46:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [21:46:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T318605)', diff saved to https://phabricator.wikimedia.org/P37516 and previous config saved to /var/cache/conftool/dbconfig/20221101-214659-ladsgroup.json [21:47:18] (ProbeDown) firing: (2) Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:47:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1012.eqiad.wmnet, ms-fe1011.eqiad.wmnet are marked down but pooled: swift_80: Servers ms-fe1012.eqiad.wmnet, ms-fe1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:47:53] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:48:15] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 276 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:48:21] any swift experts around? [21:48:55] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.260 second response time https://wikitech.wikimedia.org/wiki/Swift [21:48:56] (03PS3) 10Clare Ming: testwiki: Add mediawiki.visual_editor_feature_use stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) [21:49:15] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.270 second response time https://wikitech.wikimedia.org/wiki/Swift [21:49:17] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:49:18] (ProbeDown) firing: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:49:31] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [21:49:34] (FrontendUnavailable) firing: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [21:49:35] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:49:59] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.178 second response time https://wikitech.wikimedia.org/wiki/Swift [21:50:37] (03CR) 10Clare Ming: testwiki: Add mediawiki.visual_editor_feature_use stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851723 (https://phabricator.wikimedia.org/T309602) (owner: 10Clare Ming) [21:50:41] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [21:50:59] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_analytics:admin.service,swift-account-stats_docker:registry.service,swift-account-stats_mw:media.service,swift-container-stats_mw-media.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:51:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:51:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1012.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1009.eqiad.wmnet are marked down but pooled: swift_80: Servers ms-fe1012.eqiad.wmnet, ms-fe1010.eqiad.wmnet, ms-fe1011.eqiad.wmnet, ms-fe1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:52:03] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.170 second response time https://wikitech.wikimedia.org/wiki/Swift [21:52:07] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-worker1001.eqiad.wmnet [21:53:19] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker1002.eqiad.wmnet [21:53:20] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [21:54:18] (ProbeDown) firing: (2) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:54:59] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-container-stats_mw-media.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:55:35] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 9.555 second response time https://wikitech.wikimedia.org/wiki/Swift [21:55:51] PROBLEM - Docker registry health on registry1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 224 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker [21:57:18] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:58:01] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 6.249 second response time https://wikitech.wikimedia.org/wiki/Swift [21:58:11] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:58:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P37517 and previous config saved to /var/cache/conftool/dbconfig/20221101-215820-ladsgroup.json [21:59:18] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:59:48] (03PS5) 10Jdlrobson: Fix remaining Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849175 (https://phabricator.wikimedia.org/T319223) [21:59:51] RECOVERY - Docker registry health on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker [22:00:07] PROBLEM - Docker registry health on registry1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - pattern not found - 228 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker [22:02:18] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:02:27] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 0.624 second response time https://wikitech.wikimedia.org/wiki/Docker [22:04:19] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:06:05] RECOVERY - Docker registry health on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker [22:07:22] 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10Eevans) Since {T320835} appears to be in jeopardy (see: [[ https://phabricator.wikimedia.org/T320... [22:07:33] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:08:35] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [22:08:56] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:08:56] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache aux-k8s-worker1002.eqiad.wmnet on all recursors [22:09:00] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker1002.eqiad.wmnet on all recursors [22:09:00] (JobUnavailable) firing: Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:09:18] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:10:05] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:13:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P37518 and previous config saved to /var/cache/conftool/dbconfig/20221101-221328-ladsgroup.json [22:14:07] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 122 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:16:29] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 3.194 second response time https://wikitech.wikimedia.org/wiki/Docker [22:18:31] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:18:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:19:18] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:19:29] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 6.078 second response time https://wikitech.wikimedia.org/wiki/Swift [22:20:02] !log jhathaway@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad [22:20:43] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8647 bytes in 5.025 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:20:49] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:22:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:22:29] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [22:22:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T318605)', diff saved to https://phabricator.wikimedia.org/P37519 and previous config saved to /var/cache/conftool/dbconfig/20221101-222247-ladsgroup.json [22:23:38] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:24:01] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [22:24:18] (ProbeDown) firing: (2) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:26:19] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 0.131 second response time https://wikitech.wikimedia.org/wiki/Docker [22:26:47] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:09] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Docker [22:27:19] (ProbeDown) firing: (3) Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:27:51] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift [22:28:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37520 and previous config saved to /var/cache/conftool/dbconfig/20221101-222835-ladsgroup.json [22:28:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:28:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:28:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37521 and previous config saved to /var/cache/conftool/dbconfig/20221101-222858-ladsgroup.json [22:29:13] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:29:18] (ProbeDown) resolved: (2) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:29:35] (FrontendUnavailable) resolved: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [22:30:35] (FrontendUnavailable) resolved: HAProxy (cache_upload) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [22:31:03] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Swift [22:31:03] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:32:18] (ProbeDown) resolved: (2) Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:32:35] !log rolling restart of eqiad swift front-ends [22:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:38] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-worker1002.eqiad.wmnet [22:33:01] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.037 second response time https://wikitech.wikimedia.org/wiki/Swift [22:33:05] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:33:35] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [22:33:35] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Swift [22:33:45] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift [22:34:41] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [22:34:43] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [22:34:53] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:59] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:36:43] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [22:37:01] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:37:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P37522 and previous config saved to /var/cache/conftool/dbconfig/20221101-223754-ladsgroup.json [22:43:21] !log krinkle@deploy1002 Started deploy [integration/docroot@2ddd7d9]: (no justification provided) [22:43:54] !log krinkle@deploy1002 Finished deploy [integration/docroot@2ddd7d9]: (no justification provided) (duration: 00m 33s) [22:53:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P37523 and previous config saved to /var/cache/conftool/dbconfig/20221101-225303-ladsgroup.json [22:55:02] !log depool ms-fe2009 [22:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:13] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:00:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:00:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37524 and previous config saved to /var/cache/conftool/dbconfig/20221101-230411-ladsgroup.json [23:04:13] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:04:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.295 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:33] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:05:55] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48976 bytes in 1.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:06:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T318605)', diff saved to https://phabricator.wikimedia.org/P37525 and previous config saved to /var/cache/conftool/dbconfig/20221101-230811-ladsgroup.json [23:08:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [23:08:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [23:08:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T318605)', diff saved to https://phabricator.wikimedia.org/P37526 and previous config saved to /var/cache/conftool/dbconfig/20221101-230833-ladsgroup.json [23:08:45] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:10:27] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:31] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P37527 and previous config saved to /var/cache/conftool/dbconfig/20221101-231919-ladsgroup.json [23:25:06] (03PS10) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [23:30:01] (03PS11) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [23:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:07] (03CR) 10BCornwall: "0 tests failed, 0 tests skipped, 34 tests passed" [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [23:34:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P37528 and previous config saved to /var/cache/conftool/dbconfig/20221101-233427-ladsgroup.json [23:36:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T318605)', diff saved to https://phabricator.wikimedia.org/P37529 and previous config saved to /var/cache/conftool/dbconfig/20221101-234346-ladsgroup.json [23:43:57] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:49:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T318605)', diff saved to https://phabricator.wikimedia.org/P37530 and previous config saved to /var/cache/conftool/dbconfig/20221101-234935-ladsgroup.json [23:49:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [23:49:43] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:49:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [23:49:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T318605)', diff saved to https://phabricator.wikimedia.org/P37531 and previous config saved to /var/cache/conftool/dbconfig/20221101-234957-ladsgroup.json [23:58:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P37532 and previous config saved to /var/cache/conftool/dbconfig/20221101-235853-ladsgroup.json