[00:08:29] (03PS1) 10Cwhite: logstash remove wikifunctions response field [puppet] - 10https://gerrit.wikimedia.org/r/942799 (https://phabricator.wikimedia.org/T180051) [00:09:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P49861 and previous config saved to /var/cache/conftool/dbconfig/20230801-000948-ladsgroup.json [00:11:58] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:12:14] (03CR) 10Cwhite: [C: 03+2] logstash remove wikifunctions response field [puppet] - 10https://gerrit.wikimedia.org/r/942799 (https://phabricator.wikimedia.org/T180051) (owner: 10Cwhite) [00:12:22] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:16] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new cloud nodes DNS and switch config - pt1979@cumin2002" [00:15:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new cloud nodes DNS and switch config - pt1979@cumin2002" [00:15:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:15:03] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10Papaul) [00:15:22] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10Papaul) [00:15:40] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:28] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10Papaul) [00:17:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudcontrol2006-dev.mgmt.codfw.wmnet with reboot policy FORCED [00:20:06] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudcontrol2007-dev.mgmt.codfw.wmnet with reboot policy FORCED [00:24:12] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P49862 and previous config saved to /var/cache/conftool/dbconfig/20230801-002454-ladsgroup.json [00:26:12] PROBLEM - cinder-api http on cloudcontrol1005 is CRITICAL: connect to address 10.64.151.3 and port 18776: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:26:54] PROBLEM - cinder-scheduler process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:32:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:35:14] RECOVERY - cinder-api http on cloudcontrol1005 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 663 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:35:58] RECOVERY - cinder-scheduler process on cloudcontrol1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:38:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/942800 [00:38:43] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/942800 (owner: 10TrainBranchBot) [00:40:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T342617)', diff saved to https://phabricator.wikimedia.org/P49863 and previous config saved to /var/cache/conftool/dbconfig/20230801-004000-ladsgroup.json [00:40:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:40:06] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:40:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [00:54:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2006-dev.mgmt.codfw.wmnet with reboot policy FORCED [00:55:06] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/942800 (owner: 10TrainBranchBot) [00:58:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2007-dev.mgmt.codfw.wmnet with reboot policy FORCED [01:00:23] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10Papaul) [01:03:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T343180 (10phaultfinder) [01:44:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T342617)', diff saved to https://phabricator.wikimedia.org/P49864 and previous config saved to /var/cache/conftool/dbconfig/20230801-014452-ladsgroup.json [01:44:55] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:52:24] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:52:38] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:59:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P49865 and previous config saved to /var/cache/conftool/dbconfig/20230801-015958-ladsgroup.json [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T0200) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.20 [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/942801 (https://phabricator.wikimedia.org/T340248) [02:07:18] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.20 [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/942801 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [02:11:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P49866 and previous config saved to /var/cache/conftool/dbconfig/20230801-021504-ladsgroup.json [02:18:34] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:30] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:03] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.20 [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/942801 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [02:28:21] 10SRE, 10Wikimedia-Site-requests, 10serviceops, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Vladis13) >>! In T275319#9057445, @Reedy wrote: > None of this is helping move the discussion forward. > > Timo's comment in T27... [02:30:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T342617)', diff saved to https://phabricator.wikimedia.org/P49867 and previous config saved to /var/cache/conftool/dbconfig/20230801-023010-ladsgroup.json [02:30:14] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [02:30:32] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [02:46:32] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:04] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T0300) [03:01:27] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943654 (https://phabricator.wikimedia.org/T340248) [03:01:29] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943654 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [03:02:11] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943654 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [03:02:44] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.20 refs T340248 [03:02:48] T340248: 1.41.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T340248 [03:33:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:35:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:38:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:40:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:54:50] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.20 refs T340248 (duration: 52m 06s) [03:54:54] T340248: 1.41.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T340248 [03:57:02] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.18 (duration: 02m 09s) [04:49:25] (03PS2) 10KartikMistry: cxserver: Remove Youdao MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/942748 (https://phabricator.wikimedia.org/T329137) [04:59:17] * kart_ updating cxserver.. [05:00:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:35] cxserver-staging-tls-proxy: image: docker-registry.discovery.wmnet/envoy:1.23.10-1 -> image: docker-registry.discovery.wmnet/envoy:1.23.10-2 -- is it OK to go ahead with this? Change applied, not deployed. [05:03:36] Seems mesh change. I'll go ahead. [05:04:36] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-07-13-063245-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937578 (https://phabricator.wikimedia.org/T340953) (owner: 10Santhosh) [05:05:21] (03Merged) 10jenkins-bot: Update cxserver to 2023-07-13-063245-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/937578 (https://phabricator.wikimedia.org/T340953) (owner: 10Santhosh) [05:06:39] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:07:00] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:12:20] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:12:54] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:13:37] (03PS5) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) [05:14:13] (03CR) 10CI reject: [V: 04-1] Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [05:15:16] !log dbmaint s4 testcommonswiki eqiad T343175 [05:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:19] T343175: Remove old fields 'cuc_user' and 'cuc_user_text' as well as index 'cuc_user_ip_time' from a few production wikis - https://phabricator.wikimedia.org/T343175 [05:16:29] !log dbmaint s4 labswiki (wikitech) eqiad T343175 [05:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:29] !log dbmaint s4 testcommonswiki eqiad T343174 [05:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:32] T343174: Add missing column cuc_only_for_read_old to testcommonswiki - https://phabricator.wikimedia.org/T343174 [05:21:34] (03PS6) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) [05:22:10] (03CR) 10CI reject: [V: 04-1] Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [05:23:10] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 129 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:23:49] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:24:36] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:26:25] (03PS3) 10KartikMistry: cxserver: Remove Youdao MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/942748 (https://phabricator.wikimedia.org/T329137) [05:26:54] !log Updated cxserver to 2023-07-13-063245-production (T340953) [05:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:58] T340953: Enable MinT for all the remaining languages supported by NLLB-200 - https://phabricator.wikimedia.org/T340953 [05:30:16] I'm doing another cxserver deployment. [05:32:43] (03CR) 10KartikMistry: [C: 03+2] cxserver: Remove Youdao MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/942748 (https://phabricator.wikimedia.org/T329137) (owner: 10KartikMistry) [05:33:26] (03Merged) 10jenkins-bot: cxserver: Remove Youdao MT service [deployment-charts] - 10https://gerrit.wikimedia.org/r/942748 (https://phabricator.wikimedia.org/T329137) (owner: 10KartikMistry) [05:34:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:11] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:36:32] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:41:03] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:41:40] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:45:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:42] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:46:20] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:48:22] !log cxserver: Remove Youdao MT service (T329137) [05:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:25] T329137: Deprecate Youdao MT service - https://phabricator.wikimedia.org/T329137 [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T0600) [06:00:05] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Primary database switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T0600). [06:16:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:21:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:21:37] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::jumbo: apply thread settings [puppet] - 10https://gerrit.wikimedia.org/r/941840 (owner: 10Elukey) [06:24:28] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [06:25:22] (03PS7) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) [06:29:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:30:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:31:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:31:51] (03PS8) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) [06:32:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:32:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] services: upgrade changeprop instances to Buster [deployment-charts] - 10https://gerrit.wikimedia.org/r/943037 (https://phabricator.wikimedia.org/T341140) (owner: 10Elukey) [06:33:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] changeprop: allow to tune monitoring container's resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/943038 (https://phabricator.wikimedia.org/T328683) (owner: 10Elukey) [06:34:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] services: shift changeprop's cpu resources from main app to the prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/943039 (https://phabricator.wikimedia.org/T328683) (owner: 10Elukey) [06:41:43] (03CR) 10Elukey: [C: 03+2] services: upgrade changeprop instances to Buster [deployment-charts] - 10https://gerrit.wikimedia.org/r/943037 (https://phabricator.wikimedia.org/T341140) (owner: 10Elukey) [06:54:26] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [06:54:50] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [07:00:05] Amir1, Urbanecm, and taavi: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:07:01] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [07:07:24] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [07:13:50] (03PS2) 10Elukey: eventgate: set a more performant default for queue.buffering.max.ms [deployment-charts] - 10https://gerrit.wikimedia.org/r/937432 (https://phabricator.wikimedia.org/T338357) [07:15:38] RECOVERY - Check systemd state on db2114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:44] (03CR) 10Elukey: [C: 03+2] changeprop: allow to tune monitoring container's resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/943038 (https://phabricator.wikimedia.org/T328683) (owner: 10Elukey) [07:15:53] (03PS6) 10Elukey: services: shift changeprop's cpu resources from main app to the prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/943039 (https://phabricator.wikimedia.org/T328683) [07:17:15] (03CR) 10Elukey: [C: 03+2] services: shift changeprop's cpu resources from main app to the prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/943039 (https://phabricator.wikimedia.org/T328683) (owner: 10Elukey) [07:18:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/943579 (owner: 10Elukey) [07:22:28] (03CR) 10Elukey: [C: 03+2] aptrepo: add new key for ROCm repositories [puppet] - 10https://gerrit.wikimedia.org/r/943579 (owner: 10Elukey) [07:23:36] (03CR) 10Filippo Giunchedi: profile::pyrra::api: create profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [07:30:43] (03PS1) 10Muehlenhoff: Remove access for ntsako [puppet] - 10https://gerrit.wikimedia.org/r/944150 [07:34:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/941398 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [07:34:49] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for ntsako [puppet] - 10https://gerrit.wikimedia.org/r/944150 (owner: 10Muehlenhoff) [07:36:08] (03CR) 10Filippo Giunchedi: [C: 03+1] pyrra: deploy to thanos-fe hosts [puppet] - 10https://gerrit.wikimedia.org/r/929734 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [07:36:14] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Nmaphophe out of all services on: 1277 hosts [07:36:16] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [07:37:04] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nmaphophe out of all services on: 1277 hosts [07:37:22] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Nmaphophe out of all services on: 24 hosts [07:37:27] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nmaphophe out of all services on: 24 hosts [07:37:33] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Nmaphophe out of all services on: 732 hosts [07:37:47] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nmaphophe out of all services on: 732 hosts [07:41:13] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [07:41:25] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [07:42:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for dbrant - https://phabricator.wikimedia.org/T343122 (10MoritzMuehlenhoff) @thcipriani This needs your signoff [07:44:08] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [07:44:23] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [07:49:30] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [07:49:43] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [07:51:32] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:20] (03CR) 10Vgutierrez: [C: 03+1] varnish: add requestctl to X-analytics for static actions too [puppet] - 10https://gerrit.wikimedia.org/r/941448 (https://phabricator.wikimedia.org/T342577) (owner: 10Giuseppe Lavagetto) [08:15:02] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:41] (03PS1) 10Elukey: services: add higher limits for cp-jobqueue's exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/944155 (https://phabricator.wikimedia.org/T328683) [08:17:41] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [08:17:53] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [08:20:15] jouncebot: nowandnext [08:20:15] No deployments scheduled for the next 1 hour(s) and 39 minute(s) [08:20:15] In 1 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1000) [08:20:26] (03PS3) 10Urbanecm: GrowthExperiments: enable AddLink task frontend in 10th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940347 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [08:20:31] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: enable AddLink task frontend in 10th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940347 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [08:21:37] (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink task frontend in 10th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940347 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [08:21:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940347 (https://phabricator.wikimedia.org/T308135) (owner: 10Sergio Gimeno) [08:22:27] !log installing Linux 4.19.289 on Buster hosts [08:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:38] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:940347|GrowthExperiments: enable AddLink task frontend in 10th round of wikis (T308135)]] [08:22:41] T308135: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 [08:24:20] !log urbanecm@deploy1002 sgimeno and urbanecm: Backport for [[gerrit:940347|GrowthExperiments: enable AddLink task frontend in 10th round of wikis (T308135)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:27:31] !log urbanecm@deploy1002 sgimeno and urbanecm: Continuing with sync [08:27:39] oh, a new log entry! [08:29:24] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [08:30:05] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [08:32:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] services: add higher limits for cp-jobqueue's exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/944155 (https://phabricator.wikimedia.org/T328683) (owner: 10Elukey) [08:33:30] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:940347|GrowthExperiments: enable AddLink task frontend in 10th round of wikis (T308135)]] (duration: 10m 52s) [08:33:33] T308135: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 [08:33:56] * urbanecm done [08:38:35] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1076.eqiad.wmnet with OS bullseye [08:39:14] PROBLEM - Check systemd state on db1140 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:10] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1077.eqiad.wmnet with OS bullseye [08:45:55] (03CR) 10Elukey: [C: 03+2] services: add higher limits for cp-jobqueue's exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/944155 (https://phabricator.wikimedia.org/T328683) (owner: 10Elukey) [08:48:52] (03PS1) 10Giuseppe Lavagetto: thumbor: do not redeploy for a mcrouter config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/944158 [08:48:54] (03PS1) 10Giuseppe Lavagetto: function-orchestrator: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/944159 (https://phabricator.wikimedia.org/T297815) [08:48:56] (03PS1) 10Giuseppe Lavagetto: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/944160 (https://phabricator.wikimedia.org/T297815) [08:49:34] (03CR) 10Jbond: [C: 03+1] "lgtm some minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [08:49:48] (03CR) 10CI reject: [V: 04-1] wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/944160 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [08:50:05] (03CR) 10CI reject: [V: 04-1] function-orchestrator: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/944159 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [08:50:35] (03CR) 10Hnowlan: [C: 03+2] thumbor: do not redeploy for a mcrouter config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/944158 (owner: 10Giuseppe Lavagetto) [08:51:35] (03Merged) 10jenkins-bot: thumbor: do not redeploy for a mcrouter config changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/944158 (owner: 10Giuseppe Lavagetto) [08:51:49] (03CR) 10Jelto: [C: 03+2] gitlab: Use gitlab-settings v1.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/943583 (https://phabricator.wikimedia.org/T320390) (owner: 10Ahmon Dancy) [08:59:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [08:59:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [09:00:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [09:00:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [09:02:27] (03PS1) 10MVernon: thanos: fake credential for gitlab account [labs/private] - 10https://gerrit.wikimedia.org/r/944163 (https://phabricator.wikimedia.org/T336234) [09:03:08] (03PS1) 10MVernon: thanos: add gitlab user [puppet] - 10https://gerrit.wikimedia.org/r/944164 (https://phabricator.wikimedia.org/T336234) [09:03:20] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host analytics1077.eqiad.wmnet with OS bullseye [09:03:44] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1077.eqiad.wmnet with OS bullseye [09:06:53] (03CR) 10Marostegui: [C: 03+1] thanos: fake credential for gitlab account [labs/private] - 10https://gerrit.wikimedia.org/r/944163 (https://phabricator.wikimedia.org/T336234) (owner: 10MVernon) [09:07:09] (03CR) 10Marostegui: [C: 03+1] thanos: add gitlab user [puppet] - 10https://gerrit.wikimedia.org/r/944164 (https://phabricator.wikimedia.org/T336234) (owner: 10MVernon) [09:08:18] (03CR) 10MVernon: [C: 03+2] thanos: add gitlab user [puppet] - 10https://gerrit.wikimedia.org/r/944164 (https://phabricator.wikimedia.org/T336234) (owner: 10MVernon) [09:08:42] (03CR) 10MVernon: [V: 03+2] thanos: fake credential for gitlab account [labs/private] - 10https://gerrit.wikimedia.org/r/944163 (https://phabricator.wikimedia.org/T336234) (owner: 10MVernon) [09:08:45] (03CR) 10MVernon: [V: 03+2 C: 03+2] thanos: fake credential for gitlab account [labs/private] - 10https://gerrit.wikimedia.org/r/944163 (https://phabricator.wikimedia.org/T336234) (owner: 10MVernon) [09:11:33] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [09:12:01] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [09:15:13] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [09:19:13] Is it intentional that the task-id doesn't end up in that log line? I specified it on the command line [09:20:25] (03PS1) 10Urbanecm: Revert "Fixes: Echo notification count disappears on load in mobile skin" [extensions/Echo] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/943605 (https://phabricator.wikimedia.org/T335273) [09:20:37] jouncebot: nowandnext [09:20:37] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [09:20:37] In 0 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1000) [09:20:46] (03CR) 10Urbanecm: [C: 03+2] Revert "Fixes: Echo notification count disappears on load in mobile skin" [extensions/Echo] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/943605 (https://phabricator.wikimedia.org/T335273) (owner: 10Urbanecm) [09:20:49] shipping a fix for a train blocker [09:21:10] (cc Lucas_WMDE; thanks for noticing, somewhat didn't think of testing it elsewhere than V22 and minerva) [09:21:32] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [09:22:27] np [09:22:41] 10SRE-swift-storage, 10collaboration-services: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10MatthewVernon) @eoghan the account is created in thanos-swift and ready for use (and the credential can be templated via puppet). If for whatever reason you decide not to go ahea... [09:22:42] looking at the change I’m also confused how it causes the issue, it looks safe enough for non-minerva skins [09:23:20] Lucas_WMDE: as far as i understand it it's about merging skinSkins in extension.json with https://www.mediawiki.org/wiki/Manual:$wgResourceModuleSkinStyles [09:23:23] I guess it’s indirectly caused by T342907 [09:23:24] T342907: Mobile Echo code scattered between Minerva, Echo and MobileFrontend extensions - https://phabricator.wikimedia.org/T342907 [09:23:32] (03PS2) 10Giuseppe Lavagetto: function-orchestrator: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/944159 (https://phabricator.wikimedia.org/T297815) [09:23:34] (03PS2) 10Giuseppe Lavagetto: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/944160 (https://phabricator.wikimedia.org/T297815) [09:23:57] sometimes skin's SkinStyles removes Echo-provided default, even though it shouldn't have. [09:31:51] (03PS1) 10Dreamy Jazz: Write new on group0 for event table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944168 (https://phabricator.wikimedia.org/T330158) [09:32:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [09:33:05] !log btullis@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host analytics1076.eqiad.wmnet with OS bullseye [09:33:18] (03PS1) 10Fabfur: Release 0.1-3 [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/944169 (https://phabricator.wikimedia.org/T342154) [09:33:26] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1076.eqiad.wmnet with OS bullseye [09:34:57] (03Merged) 10jenkins-bot: Revert "Fixes: Echo notification count disappears on load in mobile skin" [extensions/Echo] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/943605 (https://phabricator.wikimedia.org/T335273) (owner: 10Urbanecm) [09:35:49] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:943605|Revert "Fixes: Echo notification count disappears on load in mobile skin" (T335273 T343192)]] [09:35:53] T343192: Repeated notification icons on 1.41.0-wmf.20 when using legacy Vector - https://phabricator.wikimedia.org/T343192 [09:35:54] T335273: Echo notification count disappears on load in mobile skin - https://phabricator.wikimedia.org/T335273 [09:36:08] (03CR) 10Fabfur: [C: 03+2] fifo-log-demux: Add socat as companion package [puppet] - 10https://gerrit.wikimedia.org/r/942446 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [09:36:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:37:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [09:37:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49868 and previous config saved to /var/cache/conftool/dbconfig/20230801-093717-ladsgroup.json [09:37:21] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [09:37:26] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:943605|Revert "Fixes: Echo notification count disappears on load in mobile skin" (T335273 T343192)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [09:38:02] issue resolved, deploying [09:38:04] !log urbanecm@deploy1002 urbanecm: Continuing with sync [09:39:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for dbrant - https://phabricator.wikimedia.org/T343122 (10fgiunchedi) [09:39:13] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [09:39:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for dbrant - https://phabricator.wikimedia.org/T343122 (10fgiunchedi) [09:39:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for dbrant - https://phabricator.wikimedia.org/T343122 (10fgiunchedi) 05Open→03In progress [09:40:03] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [09:40:14] (03CR) 10Muehlenhoff: Release 0.1-3 (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/944169 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [09:42:07] (03PS2) 10Fabfur: Release 0.1-3 [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/944169 (https://phabricator.wikimedia.org/T342154) [09:43:27] (03CR) 10Fabfur: "Thanks, as suggested I removed the Uploaders section using only Maintainers" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/944169 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [09:43:29] (03PS4) 10Jelto: gitlab: remove cas support [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390) [09:43:31] (03PS1) 10Jelto: gitlab: remove cas omniauth_provider [puppet] - 10https://gerrit.wikimedia.org/r/944170 (https://phabricator.wikimedia.org/T320390) [09:43:34] test.wikidata.org looks good to me on mwdebug now [09:43:57] (also on non-mwdebug, but that just means I got lucky with the backend server, I don’t think the scap is finished yet ^^) [09:44:13] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10fgiunchedi) [09:45:05] it's at the restart php stage [09:45:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [09:45:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [09:45:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T342617)', diff saved to https://phabricator.wikimedia.org/P49869 and previous config saved to /var/cache/conftool/dbconfig/20230801-094538-ladsgroup.json [09:45:41] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42735/console" [puppet] - 10https://gerrit.wikimedia.org/r/944170 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [09:45:42] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [09:45:48] 10SRE-swift-storage, 10collaboration-services: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) We will of course, thanks for getting that done so fast! [09:47:17] (03PS3) 10Fabfur: Release 0.1-3 [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/944169 (https://phabricator.wikimedia.org/T342154) [09:47:20] (03CR) 10Jelto: gitlab: remove cas support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [09:47:24] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:943605|Revert "Fixes: Echo notification count disappears on load in mobile skin" (T335273 T343192)]] (duration: 11m 35s) [09:47:28] T343192: Repeated notification icons on 1.41.0-wmf.20 when using legacy Vector - https://phabricator.wikimedia.org/T343192 [09:47:29] T335273: Echo notification count disappears on load in mobile skin - https://phabricator.wikimedia.org/T335273 [09:47:38] yay [09:48:18] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [09:48:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/944170 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [09:49:46] (03CR) 10Jbond: [C: 03+1] gitlab: remove cas omniauth_provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944170 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [09:49:53] (03CR) 10Jbond: [C: 03+1] gitlab: remove cas support [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [09:50:02] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) [09:51:07] (03PS4) 10Fabfur: Release 0.1-3 [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/944169 (https://phabricator.wikimedia.org/T342154) [09:51:16] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1005 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [09:53:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/944169 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [09:53:56] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10fgiunchedi) @KFrancis hello, we'd need verification that this user has an NDA on file, would you mind checking? Thank you in advance! [09:53:58] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) @KFrancis hello, we'd need verification that this user has an NDA on file, would you mind checking? Thank you in advance! [09:54:16] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10fgiunchedi) @KFrancis hello, we'd need verification that this user has an NDA on file, would you mind checking? Thank you in advance! [09:54:43] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10fgiunchedi) @KFrancis hello, we'd need verification that this user has an NDA on file, would you mind checking? Thank you in advance! [09:57:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Maryana Pinchuk - https://phabricator.wikimedia.org/T342797 (10fgiunchedi) @odimitrijevic @Milimetric hello, we're seeking approval for this request -- thank you! [09:57:55] (03CR) 10Jelto: [V: 03+1] gitlab: remove cas omniauth_provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944170 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [09:58:13] (03PS1) 10Jbond: idp_test: add datahub as a OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/944172 (https://phabricator.wikimedia.org/T305874) [09:59:12] (03CR) 10Jelto: gitlab: remove cas support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/943563 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [09:59:16] (03PS2) 10Jbond: idp_test: add datahub as a OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/944172 (https://phabricator.wikimedia.org/T305874) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1000) [10:01:17] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944172 (https://phabricator.wikimedia.org/T305874) (owner: 10Jbond) [10:04:19] (03PS1) 10Filippo Giunchedi: admin: add radimer-ctr to ldap_users [puppet] - 10https://gerrit.wikimedia.org/r/944174 (https://phabricator.wikimedia.org/T342591) [10:04:56] (03CR) 10Fabfur: [C: 03+2] Release 0.1-3 [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/944169 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [10:06:05] (03PS3) 10Giuseppe Lavagetto: function-orchestrator: add mcrouter support [deployment-charts] - 10https://gerrit.wikimedia.org/r/944159 (https://phabricator.wikimedia.org/T297815) [10:06:07] (03PS3) 10Giuseppe Lavagetto: wikifunctions: enable mcrouter for orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/944160 (https://phabricator.wikimedia.org/T297815) [10:06:12] 10SRE-tools, 10DBA, 10Infrastructure-Foundations: Create a cookbook for cloning a mariadb database into another - https://phabricator.wikimedia.org/T340048 (10Ladsgroup) 05Open→03Resolved [10:09:37] (03CR) 10Jbond: [C: 03+2] idp_test: add datahub as a OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/944172 (https://phabricator.wikimedia.org/T305874) (owner: 10Jbond) [10:16:08] (03PS5) 10Clément Goubert: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) [10:17:10] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10WMDE-leszek) @fgiunchedi don't know if that suffices as a confirmation, but the person in question has fairly recently started at WMDE and signed the NDA as a part of T335941. [10:18:25] (03CR) 10Clément Goubert: mediawiki: set requests based on php.workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [10:18:29] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1077.eqiad.wmnet with OS bullseye [10:20:32] (03CR) 10Ladsgroup: [C: 03+1] noc: don't use on-disk files but etcd directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942672 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [10:21:25] !log imported prometheus-varnishkafka-exporter package into bookworm-wikimedia (https://gerrit.wikimedia.org/r/c/operations/debs/prometheus-varnishkafka-exporter/+/944169) T342154 [10:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:28] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [10:22:30] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [10:23:13] (03PS9) 10Slyngshede: Facter: PHP Version [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) [10:23:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49870 and previous config saved to /var/cache/conftool/dbconfig/20230801-102340-ladsgroup.json [10:23:43] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:24:04] (03CR) 10Ladsgroup: [C: 03+1] noc: centralize file list management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942673 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [10:24:19] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) After importing the required dependencies in bookworm-wikimedia I start working on the `purged` package [10:24:47] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10WMDE-leszek) @fgiunchedi don't know if that suffices as a confirmation, but the person in question has fairly recently started at WMDE and signed the NDA as a part of T335941. [10:25:32] (03CR) 10Ladsgroup: "Is there a way to avoid to re-inventing the wheel? :(((" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [10:27:00] (03CR) 10Slyngshede: Facter: PHP Version (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/942628 (https://phabricator.wikimedia.org/T271196) (owner: 10Slyngshede) [10:27:45] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add citoid and wikifeeds egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/943609 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [10:28:25] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1076.eqiad.wmnet with OS bullseye [10:28:29] (03Merged) 10jenkins-bot: rest-gateway: add citoid and wikifeeds egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/943609 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [10:28:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/944174 (https://phabricator.wikimedia.org/T342591) (owner: 10Filippo Giunchedi) [10:30:52] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add radimer-ctr to ldap_users [puppet] - 10https://gerrit.wikimedia.org/r/944174 (https://phabricator.wikimedia.org/T342591) (owner: 10Filippo Giunchedi) [10:31:23] !log hnowlan@deploy1002 Started deploy [restbase/deploy@8eb62f2]: Add gpewiki and btmwiktionary (T335988, T336116) [10:31:27] T335988: Add gpewiki to RESTBase - https://phabricator.wikimedia.org/T335988 [10:31:27] T336116: Add btmwiktionary to RESTBase - https://phabricator.wikimedia.org/T336116 [10:32:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T342617)', diff saved to https://phabricator.wikimedia.org/P49871 and previous config saved to /var/cache/conftool/dbconfig/20230801-103249-ladsgroup.json [10:32:53] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:33:34] (03PS1) 10Filippo Giunchedi: admin: fix radimer user name [puppet] - 10https://gerrit.wikimedia.org/r/944176 (https://phabricator.wikimedia.org/T342591) [10:33:53] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: fix radimer user name [puppet] - 10https://gerrit.wikimedia.org/r/944176 (https://phabricator.wikimedia.org/T342591) (owner: 10Filippo Giunchedi) [10:34:42] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant wmf and turnilo/superset access for Rae Adimer - https://phabricator.wikimedia.org/T342591 (10fgiunchedi) @RAdimer-WMF you are now part of `wmf` ldap group, please confirm access is working as expected! [10:36:50] (03PS3) 10Slyngshede: Credit logo artist. [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) [10:36:56] (03CR) 10Slyngshede: Credit logo artist. (032 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) (owner: 10Slyngshede) [10:38:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P49872 and previous config saved to /var/cache/conftool/dbconfig/20230801-103846-ladsgroup.json [10:38:48] (03PS6) 10Clément Goubert: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) [10:42:15] (03PS4) 10Slyngshede: Allow users to update their email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) [10:43:09] (03PS1) 10Fabfur: Release 0.20 [software/purged] - 10https://gerrit.wikimedia.org/r/944177 (https://phabricator.wikimedia.org/T342154) [10:44:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42739/console" [puppet] - 10https://gerrit.wikimedia.org/r/944172 (https://phabricator.wikimedia.org/T305874) (owner: 10Jbond) [10:44:40] (03CR) 10Slyngshede: Allow users to update their email address. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) (owner: 10Slyngshede) [10:44:48] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: remove cas omniauth_provider [puppet] - 10https://gerrit.wikimedia.org/r/944170 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [10:45:38] !log update d-i images to bookworm 12.1 T343121 [10:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:41] T343121: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 [10:47:09] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10WMDE-leszek) @fgiunchedi not sure if that is good enough but I was able to locate T222788 about Mónica's NDA. [10:47:37] (03PS1) 10Jbond: pcc: fix tuple formating [puppet] - 10https://gerrit.wikimedia.org/r/944178 [10:47:45] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42740/console" [puppet] - 10https://gerrit.wikimedia.org/r/944172 (https://phabricator.wikimedia.org/T305874) (owner: 10Jbond) [10:47:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P49873 and previous config saved to /var/cache/conftool/dbconfig/20230801-104755-ladsgroup.json [10:48:00] (03CR) 10CI reject: [V: 04-1] pcc: fix tuple formating [puppet] - 10https://gerrit.wikimedia.org/r/944178 (owner: 10Jbond) [10:50:50] (03PS2) 10Jbond: pcc: fix tuple formating [puppet] - 10https://gerrit.wikimedia.org/r/944178 [10:51:51] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@8eb62f2]: Add gpewiki and btmwiktionary (T335988, T336116) (duration: 20m 29s) [10:51:55] T335988: Add gpewiki to RESTBase - https://phabricator.wikimedia.org/T335988 [10:51:56] T336116: Add btmwiktionary to RESTBase - https://phabricator.wikimedia.org/T336116 [10:53:51] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10MoritzMuehlenhoff) >>! In T342968#9058234, @fgiunchedi wrote: > @KFrancis hello, we'd need verification that this user has an NDA on file, would you mind checking? Thank you in... [10:53:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P49874 and previous config saved to /var/cache/conftool/dbconfig/20230801-105352-ladsgroup.json [10:57:29] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [10:57:41] <_joe_> jouncebot: nowandnext [10:57:42] For the next 0 hour(s) and 2 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1000) [10:57:42] In 1 hour(s) and 2 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1200) [10:58:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [10:59:21] (03CR) 10Vgutierrez: "We can drop strings.Cut backport considering that bookworm ships golang 1.19: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/so" [software/purged] - 10https://gerrit.wikimedia.org/r/944177 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [11:01:17] (03CR) 10Vgutierrez: Release 0.20 (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/944177 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [11:03:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P49875 and previous config saved to /var/cache/conftool/dbconfig/20230801-110302-ladsgroup.json [11:04:28] (03CR) 10Jbond: [C: 03+2] pcc: fix tuple formating [puppet] - 10https://gerrit.wikimedia.org/r/944178 (owner: 10Jbond) [11:05:34] RECOVERY - Check systemd state on db1140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:42] RECOVERY - Check systemd state on db2141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49876 and previous config saved to /var/cache/conftool/dbconfig/20230801-110858-ladsgroup.json [11:09:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:09:02] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:09:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:09:16] (03CR) 10Giuseppe Lavagetto: [V: 03+1] noc: add static file server (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [11:11:22] (03PS2) 10Fabfur: Release 0.20 [software/purged] - 10https://gerrit.wikimedia.org/r/944177 (https://phabricator.wikimedia.org/T342154) [11:12:26] (03CR) 10Fabfur: Release 0.20 (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/944177 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [11:12:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) (owner: 10Slyngshede) [11:13:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) (owner: 10Slyngshede) [11:14:05] (03PS1) 10Btullis: Revert "install_server: drop Bashisms" [puppet] - 10https://gerrit.wikimedia.org/r/944189 [11:15:31] (03PS2) 10Btullis: Revert "install_server: drop Bashisms" [puppet] - 10https://gerrit.wikimedia.org/r/944189 (https://phabricator.wikimedia.org/T95064) [11:16:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/purged] - 10https://gerrit.wikimedia.org/r/944177 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [11:17:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:45] (03CR) 10Btullis: [C: 03+2] Revert "install_server: drop Bashisms" [puppet] - 10https://gerrit.wikimedia.org/r/944189 (https://phabricator.wikimedia.org/T95064) (owner: 10Btullis) [11:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T342617)', diff saved to https://phabricator.wikimedia.org/P49877 and previous config saved to /var/cache/conftool/dbconfig/20230801-111808-ladsgroup.json [11:18:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [11:18:12] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:18:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [11:18:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T342617)', diff saved to https://phabricator.wikimedia.org/P49878 and previous config saved to /var/cache/conftool/dbconfig/20230801-111829-ladsgroup.json [11:19:02] (03PS3) 10Fabfur: Release 0.20 [software/purged] - 10https://gerrit.wikimedia.org/r/944177 (https://phabricator.wikimedia.org/T342154) [11:21:46] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1076.eqiad.wmnet with OS bullseye [11:22:25] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1077.eqiad.wmnet with OS bullseye [11:22:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:26:31] 10SRE, 10Traffic-Icebox: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492 (10cmooney) This "temporary testing" IP appears to be the one our public DNS for text-lb in Amsterdam is resolving to: ` cathal@officepc:~$ dig +noall +answer test-lb.esams.wikimedia.org. @... [11:33:06] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1076.eqiad.wmnet with reason: host reimage [11:33:50] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on analytics1077.eqiad.wmnet with reason: host reimage [11:36:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1076.eqiad.wmnet with reason: host reimage [11:38:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on analytics1077.eqiad.wmnet with reason: host reimage [11:38:42] (03PS2) 10Muehlenhoff: Remove jgreen from ops group [puppet] - 10https://gerrit.wikimedia.org/r/936215 (https://phabricator.wikimedia.org/T336231) [11:40:29] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Credit logo artist. [software/bitu] - 10https://gerrit.wikimedia.org/r/934265 (https://phabricator.wikimedia.org/T338828) (owner: 10Slyngshede) [11:40:46] (03CR) 10Slyngshede: [C: 03+2] Allow users to update their email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) (owner: 10Slyngshede) [11:40:48] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Allow users to update their email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) (owner: 10Slyngshede) [11:41:34] (03CR) 10Jbond: [C: 03+2] sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 (owner: 10Jbond) [11:42:41] (03CR) 10Ladsgroup: [C: 03+1] noc: add static file server (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [11:42:50] (03CR) 10Jbond: [C: 03+2] httpd: Add a profile for loading httpd [puppet] - 10https://gerrit.wikimedia.org/r/937520 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [11:45:20] (03CR) 10Jbond: "can i get another review on this" [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [11:45:57] (03CR) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [11:46:08] (03CR) 10Ladsgroup: noc: remove symlinks and also neutralize createTxtFileSymlinks (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942675 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [11:48:09] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad VM 1 for config-master - https://phabricator.wikimedia.org/T343212 (10jbond) [11:48:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:24] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad VM 1 for config-master - https://phabricator.wikimedia.org/T343212 (10jbond) 05Open→03In progress p:05Triage→03Medium a:03jbond [11:48:49] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM 1 for config-master - https://phabricator.wikimedia.org/T343213 (10jbond) [11:48:59] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM 1 for config-master - https://phabricator.wikimedia.org/T343213 (10jbond) p:05Triage→03Medium a:03jbond [11:49:05] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM 1 for config-master - https://phabricator.wikimedia.org/T343213 (10jbond) 05Open→03In progress [11:49:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1001.eqiad.wmnet [11:50:30] 10sre-alert-triage, 10serviceops: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342761 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Alert has resolved [11:50:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:50:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:50:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:51:04] (03CR) 10Muehlenhoff: [C: 03+2] Remove jgreen from ops group [puppet] - 10https://gerrit.wikimedia.org/r/936215 (https://phabricator.wikimedia.org/T336231) (owner: 10Muehlenhoff) [11:51:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:51:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T342617)', diff saved to https://phabricator.wikimedia.org/P49879 and previous config saved to /var/cache/conftool/dbconfig/20230801-115110-ladsgroup.json [11:51:13] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:52:28] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM 1 for config-master - https://phabricator.wikimedia.org/T343213 (10jbond) [11:52:35] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add config-master to puppetserver role - https://phabricator.wikimedia.org/T341717 (10jbond) [11:52:43] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad VM 1 for config-master - https://phabricator.wikimedia.org/T343212 (10jbond) [11:52:47] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Add config-master to puppetserver role - https://phabricator.wikimedia.org/T341717 (10jbond) [11:52:51] 10SRE, 10Scap, 10serviceops-radar, 10Release-Engineering-Team (Seen): Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207 (10Clement_Goubert) [11:53:26] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Move config-master to dedicated VMs - https://phabricator.wikimedia.org/T341717 (10jbond) [11:55:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1001.eqiad.wmnet [11:57:09] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1002.eqiad.wmnet [11:59:03] 10SRE, 10Infrastructure-Foundations, 10netops: New IP and Vlan allocations for esams knams move - https://phabricator.wikimedia.org/T343214 (10cmooney) p:05Triage→03Medium [11:59:19] 10SRE, 10Infrastructure-Foundations, 10netops: New IP and Vlan allocations for esams knams move - https://phabricator.wikimedia.org/T343214 (10cmooney) [11:59:24] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10cmooney) [11:59:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T342617)', diff saved to https://phabricator.wikimedia.org/P49880 and previous config saved to /var/cache/conftool/dbconfig/20230801-115924-ladsgroup.json [11:59:27] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1200) [12:01:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:02] (03CR) 10Fabfur: [C: 03+2] Release 0.20 [software/purged] - 10https://gerrit.wikimedia.org/r/944177 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [12:03:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1002.eqiad.wmnet [12:03:34] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1077.eqiad.wmnet with OS bullseye [12:06:22] (03CR) 10Muehlenhoff: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [12:06:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host analytics1076.eqiad.wmnet with OS bullseye [12:08:06] (03CR) 10Volans: [C: 04-1] "Few comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [12:08:56] 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) p:05Triage→03Medium [12:09:05] 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) [12:09:11] 10SRE, 10Infrastructure-Foundations, 10netops: New IP and Vlan allocations for esams knams move - https://phabricator.wikimedia.org/T343214 (10cmooney) [12:11:00] !log imported purged package into bookworm-wikimedia (https://gerrit.wikimedia.org/r/c/operations/software/purged/+/944177) T342154 [12:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:03] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [12:11:35] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [12:11:42] (03PS1) 10Cathal Mooney: Announce new AMS IPv6 range from esams and knams ahead of move [homer/public] - 10https://gerrit.wikimedia.org/r/944184 (https://phabricator.wikimedia.org/T343216) [12:11:58] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/942695 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [12:12:15] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) Start working on `python-logstash` package [12:14:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P49881 and previous config saved to /var/cache/conftool/dbconfig/20230801-121430-ladsgroup.json [12:15:56] (03CR) 10Muehlenhoff: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [12:16:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) [12:17:05] (03CR) 10Volans: [C: 04-1] "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [12:17:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff) [12:18:41] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) IRR route6 object created: ` cathal@officepc:~$ whois -r -T route6 -h whois.ripe.net 2a02:ec80:300::/48 % This is the... [12:20:08] (03CR) 10Cathal Mooney: [C: 03+2] Announce new AMS IPv6 range from esams and knams ahead of move [homer/public] - 10https://gerrit.wikimedia.org/r/944184 (https://phabricator.wikimedia.org/T343216) (owner: 10Cathal Mooney) [12:20:41] (03Merged) 10jenkins-bot: Announce new AMS IPv6 range from esams and knams ahead of move [homer/public] - 10https://gerrit.wikimedia.org/r/944184 (https://phabricator.wikimedia.org/T343216) (owner: 10Cathal Mooney) [12:22:31] (03CR) 10Muehlenhoff: [C: 03+2] noc: Pass ports without ferm-specific service constants [puppet] - 10https://gerrit.wikimedia.org/r/931581 (owner: 10Muehlenhoff) [12:23:47] (03PS1) 10Fabfur: Version 0.4.6-4 [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/944209 (https://phabricator.wikimedia.org/T342154) [12:25:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1003.eqiad.wmnet [12:29:00] (03PS7) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [12:29:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P49882 and previous config saved to /var/cache/conftool/dbconfig/20230801-122936-ladsgroup.json [12:30:20] !log jbond@cumin1001 START - Cookbook sre.ganeti.resource_report [12:30:20] !log jbond@cumin1001 END (PASS) - Cookbook sre.ganeti.resource_report (exit_code=0) [12:31:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1003.eqiad.wmnet [12:31:36] (03PS8) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [12:31:39] (03CR) 10CI reject: [V: 04-1] sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [12:32:22] (03PS9) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [12:33:51] (03PS10) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [12:34:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T342617)', diff saved to https://phabricator.wikimedia.org/P49883 and previous config saved to /var/cache/conftool/dbconfig/20230801-123406-ladsgroup.json [12:34:12] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:35:14] (03PS11) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [12:36:15] (03PS12) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [12:37:19] (03CR) 10Jbond: "thanks for the quick feedback, updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [12:37:49] (03PS3) 10Muehlenhoff: PCC: Pass ports without ferm-specific service constants [puppet] - 10https://gerrit.wikimedia.org/r/931578 [12:40:32] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM 1 for config-master - https://phabricator.wikimedia.org/T343213 (10jbond) going to self approve this for group A [12:41:08] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad VM 1 for config-master - https://phabricator.wikimedia.org/T343212 (10jbond) going to self approve this for group A [12:43:13] (03PS1) 10Muehlenhoff: profile::aptrepo::wikimedia: Pass ports without Ferm-specific service identifiers [puppet] - 10https://gerrit.wikimedia.org/r/944211 [12:44:23] 10SRE, 10Traffic-Icebox: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492 (10cmooney) I also made the same allocations for the [[ https://netbox.wikimedia.org/ipam/prefixes/743/ip-addresses/ | IPv6 range ]] [12:44:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T342617)', diff saved to https://phabricator.wikimedia.org/P49885 and previous config saved to /var/cache/conftool/dbconfig/20230801-124442-ladsgroup.json [12:44:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [12:44:46] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:44:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944211 (owner: 10Muehlenhoff) [12:44:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [12:45:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:45:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [12:45:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T342617)', diff saved to https://phabricator.wikimedia.org/P49886 and previous config saved to /var/cache/conftool/dbconfig/20230801-124508-ladsgroup.json [12:45:46] (03CR) 10CI reject: [V: 04-1] profile::aptrepo::wikimedia: Pass ports without Ferm-specific service identifiers [puppet] - 10https://gerrit.wikimedia.org/r/944211 (owner: 10Muehlenhoff) [12:48:15] (03PS1) 10Jbond: install_server: Add partman config for config-master [puppet] - 10https://gerrit.wikimedia.org/r/944212 (https://phabricator.wikimedia.org/T341717) [12:49:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P49887 and previous config saved to /var/cache/conftool/dbconfig/20230801-124912-ladsgroup.json [12:49:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/931578 (owner: 10Muehlenhoff) [12:51:27] (03CR) 10Jbond: [C: 03+1] "LGTM suggestion inline. CI is complaining because the first line of the commit msg is to long" [puppet] - 10https://gerrit.wikimedia.org/r/944211 (owner: 10Muehlenhoff) [12:51:47] (03CR) 10Jbond: [C: 03+2] install_server: Add partman config for config-master [puppet] - 10https://gerrit.wikimedia.org/r/944212 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [12:52:37] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10fgiunchedi) >>! In T342973#9058334, @WMDE-leszek wrote: > @fgiunchedi don't know if that suffices as a confirmation, but the person in question has fairly recently started at WMDE... [12:52:56] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10fgiunchedi) >>! In T342972#9058370, @WMDE-leszek wrote: > @fgiunchedi don't know if that suffices as a confirmation, but the person in question has fairly recently started at WMDE... [12:55:20] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) p:05High→03Medium The cas omniauth_provider was removed in the last merged patch. OIDC is the only login available in G... [12:57:28] (03CR) 10Volans: "LGTM, one nit on the filename" [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [12:57:56] 10SRE, 10Bitu, 10Infrastructure-Foundations: Implement "update email" functionality - https://phabricator.wikimedia.org/T340637 (10SLyngshede-WMF) 05In progress→03Resolved [12:58:00] 10SRE, 10Bitu, 10Infrastructure-Foundations: Build-out for self service - https://phabricator.wikimedia.org/T320801 (10SLyngshede-WMF) [12:58:32] !log jbond@cumin1001 START - Cookbook sre.ganeti.makevm for new host config-master1001.eqiad.wmnet [12:58:34] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1300). [13:00:04] aanzx and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:18] I can be there in a few mins [13:00:23] (might also do a backport myself) [13:00:34] o/ [13:00:40] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM config-master1001.eqiad.wmnet - jbond@cumin1001" [13:01:59] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM config-master1001.eqiad.wmnet - jbond@cumin1001" [13:01:59] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:01:59] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache config-master1001.eqiad.wmnet on all recursors [13:02:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master1001.eqiad.wmnet on all recursors [13:02:38] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) >>! In T342969#9058232, @fgiunchedi wrote: > @KFrancis hello, we'd need verification that this user has an NDA on file, would you mind checking? Thank you in advance!... [13:02:43] alright, I can deploy! [13:03:13] hum, new fingerprint for deployment.eqiad.wmnet? [13:03:15] * Lucas_WMDE looks on wikitech [13:03:55] hm, it’s the ed25519 one from https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy1002.eqiad.wmnet [13:04:10] maybe I need to check on my fingerprints update timer later [13:04:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P49888 and previous config saved to /var/cache/conftool/dbconfig/20230801-130419-ladsgroup.json [13:04:26] (03PS13) 10Jbond: sre.ganeti.resource_report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [13:05:07] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM config-master1001.eqiad.wmnet - jbond@cumin1001" [13:05:37] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [13:05:53] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM config-master1001.eqiad.wmnet - jbond@cumin1001" [13:06:01] !log jbond@cumin2002 START - Cookbook sre.ganeti.makevm for new host config-master2001.codfw.wmnet [13:06:02] !log jbond@cumin2002 START - Cookbook sre.dns.netbox [13:06:10] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host config-master1001.eqiad.wmnet with OS bookworm [13:06:17] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad VM 1 for config-master - https://phabricator.wikimedia.org/T343212 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host config-master1001.eqiad.wmnet with OS bookworm [13:06:23] (03CR) 10Lucas Werkmeister (WMDE): idwikisource change wgSiteName, wgMetaNamespace and add project namespace alias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944190 (https://phabricator.wikimedia.org/T341173) (owner: 10Anzx) [13:06:44] (03CR) 10Filippo Giunchedi: wmcs: Disable Graphite query access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942692 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [13:07:17] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [13:07:48] (03PS1) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) [13:08:11] (03CR) 10Majavah: wmcs: Disable Graphite query access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942692 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [13:08:32] (03PS1) 10Jbond: config-master: Add role::insetup to new config masters [puppet] - 10https://gerrit.wikimedia.org/r/944217 (https://phabricator.wikimedia.org/T341717) [13:08:50] (03CR) 10Jbond: [C: 03+2] config-master: Add role::insetup to new config masters [puppet] - 10https://gerrit.wikimedia.org/r/944217 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [13:09:26] (03PS2) 10Lucas Werkmeister (WMDE): btmwiktionary: Add project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942781 (https://phabricator.wikimedia.org/T343004) (owner: 10Stang) [13:09:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942781 (https://phabricator.wikimedia.org/T343004) (owner: 10Stang) [13:09:50] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM config-master2001.codfw.wmnet - jbond@cumin2002" [13:10:06] (03PS3) 10Anzx: idwikisource change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944190 (https://phabricator.wikimedia.org/T341173) [13:10:24] (03PS11) 10Herron: profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) [13:10:31] (03Merged) 10jenkins-bot: btmwiktionary: Add project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942781 (https://phabricator.wikimedia.org/T343004) (owner: 10Stang) [13:10:35] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM config-master2001.codfw.wmnet - jbond@cumin2002" [13:10:35] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:10:35] !log jbond@cumin2002 START - Cookbook sre.dns.wipe-cache config-master2001.codfw.wmnet on all recursors [13:10:38] !log jbond@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master2001.codfw.wmnet on all recursors [13:11:00] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:942781|btmwiktionary: Add project logo (T343004)]] [13:11:03] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM config-master2001.codfw.wmnet - jbond@cumin2002" [13:11:12] T343004: Change logo btm.wikt - https://phabricator.wikimedia.org/T343004 [13:11:45] (03CR) 10Anzx: idwikisource change wgSiteName, wgMetaNamespace and add project namespace alias (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944190 (https://phabricator.wikimedia.org/T341173) (owner: 10Anzx) [13:11:49] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM config-master2001.codfw.wmnet - jbond@cumin2002" [13:12:09] !log jbond@cumin2002 START - Cookbook sre.hosts.reimage for host config-master2001.codfw.wmnet with OS bookworm [13:12:15] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM 1 for config-master - https://phabricator.wikimedia.org/T343213 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin2002 for host config-master2001.codfw.wmnet with OS bookworm [13:13:20] (03CR) 10Herron: profile::pyrra::api: create profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [13:14:58] (03PS14) 10Jbond: sre.ganeti.resource-report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 [13:15:02] (03CR) 10Jbond: [C: 03+2] sre.ganeti.resource-report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [13:16:41] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on config-master1001.eqiad.wmnet with reason: host reimage [13:17:48] (03Merged) 10jenkins-bot: sre.ganeti.resource-report: Add cookbook to fetch Ganeti resources [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [13:18:10] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T343180 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [13:18:22] (03CR) 10Filippo Giunchedi: wmcs: Disable Graphite query access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942692 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [13:18:44] `helmfile -e eqiad --selector name=pinkunicorn apply` is taking quite a while [13:18:58] (name=main, and the two codfws, already finished) [13:19:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T342617)', diff saved to https://phabricator.wikimedia.org/P49889 and previous config saved to /var/cache/conftool/dbconfig/20230801-131925-ladsgroup.json [13:19:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [13:19:28] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:19:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [13:19:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T342617)', diff saved to https://phabricator.wikimedia.org/P49890 and previous config saved to /var/cache/conftool/dbconfig/20230801-131946-ladsgroup.json [13:19:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] simplewiktionary: Update project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942806 (https://phabricator.wikimedia.org/T343084) (owner: 10Stang) [13:19:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on config-master1001.eqiad.wmnet with reason: host reimage [13:20:05] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [13:20:26] Lucas_WMDE: Scheduling issues :/ [13:20:36] 7m29s Warning FailedScheduling pod/mw-debug.eqiad.pinkunicorn-fd8bf588d-8lfwm 0/22 nodes are available: 16 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {dedicated: kask}, that the pod didn't tolerate. [13:20:52] :/ [13:21:17] I think we're gonna need a bigger boat [13:21:46] Especially with the work I'm doing for mw-on-k8s which will raise substantially their requests [13:22:01] (kubernetes resource requests, not rps) [13:22:19] now scap noticed too [13:22:24] K8s deployment to stage testservers failed: K8s deployment had the following errors: [13:22:29] Rolling back to prior state... [13:22:38] (03PS1) 10Anzx: Change idwikisource logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944221 [13:22:50] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:942781|btmwiktionary: Add project logo (T343004)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:22:52] T343004: Change logo btm.wikt - https://phabricator.wikimedia.org/T343004 [13:23:22] koi: can you test? [13:23:27] Yeah [13:23:28] looking [13:24:04] I mean for mw-debug we can probably tell it that we don't need to have a rolling scaling (that way we free the resources then take them back, instead of overprovisioning then scaling down) [13:24:28] (03PS2) 10Anzx: Change idwikisource logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944221 (https://phabricator.wikimedia.org/T341173) [13:24:35] Lucas_WMDE, looks good from my side [13:24:38] claime: *nod* [13:24:40] ok, will sync [13:24:41] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Continuing with sync [13:24:47] oh right, scap logs that now, yay :) [13:24:57] I'll have to redeploy mw-debug eqiad though [13:25:08] Because right now the new image hasn't been deployed there [13:25:36] Or if you have another backport next, it'll get deployed at that point, but we'll probably run into the same issue [13:25:50] I have two more yeah [13:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T342617)', diff saved to https://phabricator.wikimedia.org/P49891 and previous config saved to /var/cache/conftool/dbconfig/20230801-132604-ladsgroup.json [13:26:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:26:26] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 130 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:26:49] I need to look at the pod disruption things for mw-debug, but it'll take me more than the time it takes you to move on to the next backport [13:27:09] I'll lower requests manually, see if it's enough for it to deploy correctly [13:27:26] ok thanks [13:27:28] And bump up the priority of more hardware [13:27:43] (it’s currently helmfileing the canaries btw) [13:27:57] (codfw finished, eqiad still ongoing) [13:28:13] We're going to have the same issue [13:28:19] 3m16s Warning FailedScheduling pod/mw-api-ext.eqiad.canary-688764c94-qn4kv 0/22 nodes are available: 16 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {dedicated: kask}, that the pod didn't tolerate. [13:28:22] Damn [13:28:29] hm [13:28:32] but this time with user facing traffic? [13:28:36] Yes. [13:28:46] let me know if I should cancel the scap [13:28:47] This is a problem [13:28:53] Won't change anything [13:29:06] It just means mw-on-k8s won't be running up to date code [13:29:11] ok [13:29:17] but k8s will keep the old pods alive? [13:29:27] so no errors for users, just old code? [13:29:54] yep [13:30:23] (03PS1) 10Volans: wmf-update-known-hosts-production: fix CNAMEs [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/944225 [13:30:37] !log jbond@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on config-master2001.codfw.wmnet with reason: host reimage [13:30:56] ah, that ^ by volans looks like it might fix the SSH issue I had half an hour ago ^^ [13:31:40] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet [13:31:58] claime: I don’t think any of the config changes are urgent, I’m happy to wait once the current scap finishes [13:31:59] (03CR) 10Jbond: [C: 03+1] "lgtm, we could also export the other key?" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/944225 (owner: 10Volans) [13:32:31] (03PS11) 10Herron: profile::pyrra::filesystem: add profile [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) [13:32:41] (03PS10) 10Herron: pyrra: deploy to thanos-fe hosts [puppet] - 10https://gerrit.wikimedia.org/r/929734 (https://phabricator.wikimedia.org/T302995) [13:32:52] (03PS4) 10Herron: thanos-rule: add pyrra filesystem operator output dir to search path [puppet] - 10https://gerrit.wikimedia.org/r/930628 (https://phabricator.wikimedia.org/T302995) [13:32:53] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [13:33:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host config-master1001.eqiad.wmnet with OS bookworm [13:33:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host config-master1001.eqiad.wmnet [13:33:12] I'll reduce requests anyways, that's quick and dirty but we're overcommited already so it just needs to get us over the bump of deployment [13:33:16] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad VM 1 for config-master - https://phabricator.wikimedia.org/T343212 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host config-master1001.eqiad.wmnet with OS bookworm completed: - config-master1001 (**P... [13:33:18] Lucas_WMDE: glad to have proactively fixed it :D [13:33:34] I don't have time to reimage nodes to add hardware anyways [13:33:45] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on config-master2001.codfw.wmnet with reason: host reimage [13:33:48] At least not time enough today [13:35:10] ok, it timed out now, scap is rolling back to prior state [13:35:30] yeah, keep going, I'll batch deploy when you're done [13:35:35] ok thanks [13:35:50] It'll take more time because you'll need to wait for helm to timeout every time, sorry [13:35:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet [13:36:40] ok, it finished checking canary traffic [13:36:51] so now it’s helmfileing again [13:37:22] (03PS1) 10Clément Goubert: mediawiki: Reduce requests for deployments to go through [deployment-charts] - 10https://gerrit.wikimedia.org/r/944229 [13:37:28] aanzx, koi: I don’t think the idwikisource and simplewiktionary changes will be happening this window, sorry [13:37:39] (btmwiktionary is in progress and should finish eventually) [13:38:05] Lucas_WMDE: The "funny" thing is it'll probably work for main deployments because they have more replicas, so the pod disruption budget doesn't affect them as much [13:38:07] that's ok 0 0 [13:38:15] Lucas_WMDE: ok i will schedule it tomorrow [13:38:31] But with the very limited number of replicas in the canaries/mw-debug, it means it can't free as much resources to scale up [13:39:22] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-druid1001.eqiad.wmnet [13:39:50] Huh there's no PDB for mediawiki so I can't play on that [13:39:53] seven finished already [13:40:06] oh, all eight finished [13:40:14] predicted correctly ^^ [13:41:01] yeah but for the wrong reasons apparently lol [13:41:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P49895 and previous config saved to /var/cache/conftool/dbconfig/20230801-134111-ladsgroup.json [13:41:21] (03PS2) 10Muehlenhoff: aptrepo: Pass ports without Ferm-specific service identifiers [puppet] - 10https://gerrit.wikimedia.org/r/944211 [13:41:30] (03CR) 10Muehlenhoff: aptrepo: Pass ports without Ferm-specific service identifiers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944211 (owner: 10Muehlenhoff) [13:43:33] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:942781|btmwiktionary: Add project logo (T343004)]] (duration: 32m 32s) [13:43:36] T343004: Change logo btm.wikt - https://phabricator.wikimedia.org/T343004 [13:44:18] ok, scap exited nonzero but I don’t see any other errors so I assume that’s for the canaries [13:44:21] claime: I’m done for now [13:44:31] Thanks, I'll try and force deploy the things [13:44:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944211 (owner: 10Muehlenhoff) [13:45:02] koi: btmwiktionary logo should be deployed, except that a small number of requests (1% iiuc) will hit the outdated version on k8s (claime is on it) [13:45:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-druid1001.eqiad.wmnet [13:45:14] confirming 1% [13:45:35] (03PS1) 10Volans: cumin: fix installer configuration [puppet] - 10https://gerrit.wikimedia.org/r/944230 (https://phabricator.wikimedia.org/T342345) [13:45:53] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host config-master2001.codfw.wmnet with OS bookworm [13:45:53] !log jbond@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host config-master2001.codfw.wmnet [13:45:58] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM 1 for config-master - https://phabricator.wikimedia.org/T343213 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host config-master2001.codfw.wmnet with OS bookworm completed: - config-master2001 (**P... [13:46:07] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:46:30] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:46:57] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-worker1001.eqiad.wmnet [13:46:59] it has changed on my side, thx [13:47:00] How can I check one of the changes on mw-debug Lucas_WMDE ? [13:47:05] or koi [13:47:22] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [13:47:28] force-reload https://btm.wiktionary.org/wiki/Wikikamus:Alaman_Utamo and see if the logo has writing under it or not, I think [13:47:32] ack [13:47:36] (03CR) 10Volans: "Reply inline, also I'll leave the release to the reviewers if that's ok as you are the ones usually managing this package." [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/944225 (owner: 10Volans) [13:47:49] yeah, when I select k8s-experimental via mwdebug I get the logo without writing again [13:47:52] Expected is no writing ? [13:48:00] cool, that means it worked lol [13:48:00] no, expected new state is with writing [13:48:03] Ah [13:48:07] oops [13:48:21] ok checking [13:48:28] (Wikikamus, Pustaha Siseon is what it should be) [13:48:38] (but it would be quite surprising if you got any other nonempty writing ^^) [13:49:34] !log cgoubert@deploy1002 Started scap: (no justification provided) [13:49:48] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [13:49:58] (i'm forcing the helmfile update rn) [13:50:11] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:50:42] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:50:56] ok, writing's on [13:50:56] (03CR) 10Muehlenhoff: "I have updated our docs to use the cookbook instead of the shell hack which was previously listed:" [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [13:51:08] backporting my dirty changes to a proper patchset, and redeploying the rest [13:52:44] (03PS1) 10Stevemunene: idp_test: add datahub_staging as a OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) [13:53:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. FWIW, I also have a patch pending for later this week, I can piggyback this change into the updated deb when my patch is merge" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/944225 (owner: 10Volans) [13:53:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T342617)', diff saved to https://phabricator.wikimedia.org/P49896 and previous config saved to /var/cache/conftool/dbconfig/20230801-135350-ladsgroup.json [13:53:53] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:53:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-worker1001.eqiad.wmnet [13:54:33] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [13:54:34] !log removing dns3001 from cr2-esams and cr3-esams routing for reboot (T335835) [13:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:30] (03PS2) 10Clément Goubert: mediawiki: Reduce requests for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/944229 [13:56:13] (03PS3) 10Clément Goubert: mediawiki: Reduce requests for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/944229 [13:56:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P49897 and previous config saved to /var/cache/conftool/dbconfig/20230801-135617-ladsgroup.json [13:57:03] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:58:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:02:03] (03PS1) 10Fabfur: Move ntp.esams.wikimedia.org CNAME to reboot dns3001 [dns] - 10https://gerrit.wikimedia.org/r/944232 (https://phabricator.wikimedia.org/T335835) [14:03:25] (03CR) 10Ssingh: [C: 03+1] Move ntp.esams.wikimedia.org CNAME to reboot dns3001 [dns] - 10https://gerrit.wikimedia.org/r/944232 (https://phabricator.wikimedia.org/T335835) (owner: 10Fabfur) [14:04:10] (03CR) 10Fabfur: [C: 03+2] Move ntp.esams.wikimedia.org CNAME to reboot dns3001 [dns] - 10https://gerrit.wikimedia.org/r/944232 (https://phabricator.wikimedia.org/T335835) (owner: 10Fabfur) [14:05:45] !log running authdns-update on dns1004 to move ntp.esams to dns3002 (https://gerrit.wikimedia.org/r/c/operations/dns/+/944232) (T335835) [14:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:15] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bookworm [14:07:29] (03CR) 10Jbond: "See comments inline (also adding simon as im out for a week after today)" [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [14:07:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:37] (03CR) 10Hnowlan: [C: 03+1] mediawiki: Reduce requests for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/944229 (owner: 10Clément Goubert) [14:08:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P49899 and previous config saved to /var/cache/conftool/dbconfig/20230801-140856-ladsgroup.json [14:09:16] (03PS1) 10Muehlenhoff: ferm::service: Fix handling of multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) [14:09:38] (03CR) 10CI reject: [V: 04-1] ferm::service: Fix handling of multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:10:03] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Reduce requests for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/944229 (owner: 10Clément Goubert) [14:10:50] (03Merged) 10jenkins-bot: mediawiki: Reduce requests for canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/944229 (owner: 10Clément Goubert) [14:11:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T342617)', diff saved to https://phabricator.wikimedia.org/P49900 and previous config saved to /var/cache/conftool/dbconfig/20230801-141123-ladsgroup.json [14:11:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [14:11:27] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:11:30] (03PS2) 10Muehlenhoff: ferm::service: Fix handling of multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) [14:11:38] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:11:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [14:11:40] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:11:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49901 and previous config saved to /var/cache/conftool/dbconfig/20230801-141144-ladsgroup.json [14:13:00] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:13:22] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:13:32] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:13:52] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:13:58] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:14:06] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:14:27] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:15:06] (03PS3) 10Muehlenhoff: ferm::service: Fix handling of multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) [14:15:30] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:15:40] (03CR) 10Ssingh: [C: 03+1] Version 0.4.6-4 [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/944209 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:15:51] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:16:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1005.eqiad.wmnet [14:16:31] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:17:18] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:17:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:04] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:18:09] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:19:08] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:19:52] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:19:54] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1005.eqiad.wmnet [14:20:42] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:20:48] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:21:13] (03CR) 10Volans: [C: 03+2] wmf-update-known-hosts-production: fix CNAMEs (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/944225 (owner: 10Volans) [14:21:31] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:21:51] !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [14:21:52] !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [14:22:05] !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:22:06] !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:22:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1011.eqiad.wmnet [14:23:04] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): tcpircbot: enable logging to #wikimedia-cloud-feed - https://phabricator.wikimedia.org/T342666 (10fnegri) I grouped `logmsgbot_cloud` to the existing `logmsgbot` account: ` 16:16 identify logmsgbot {LOGMSGBO... [14:24:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P49902 and previous config saved to /var/cache/conftool/dbconfig/20230801-142403-ladsgroup.json [14:24:04] !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [14:24:04] !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [14:24:17] !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:24:17] !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:24:35] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [14:25:10] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [14:25:16] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [14:25:40] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [14:26:03] Lucas_WMDE: Ok, all redeployed now :) [14:26:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1011.eqiad.wmnet [14:26:14] That should hold until we add hardware [14:26:40] claime: ok, thanks! [14:26:48] would it make sense to do a test deployment now? [14:27:14] (I could pick up one of the config changes if the person is still around, or do a backport of a change that would be nice-but-not-critical to get in before the next train) [14:27:34] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet [14:28:02] If you want to move some backports/config changes forward, yes [14:28:34] What I did manually is what scap does on its own, but it wouldn't hurt to check end-to-end [14:29:03] ok [14:29:18] jouncebot: now [14:29:19] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [14:29:24] jouncebot: next [14:29:25] In 0 hour(s) and 30 minute(s): Wikifunctions.org enablement (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1500) [14:29:39] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): tcpircbot: enable logging to #wikimedia-cloud-feed - https://phabricator.wikimedia.org/T342666 (10bd808) >>! In T342666#9059267, @fnegri wrote: > I grouped `logmsgbot_cloud` to the existing `logmsgbot` account: Thank you. :) `... [14:29:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudcontrol2008-dev.mgmt.codfw.wmnet with reboot policy FORCED [14:29:45] aanzx: still around? we could try that idwikisource change now [14:29:49] (and hope it takes less than 30 minutes) [14:30:08] (I think that actually rules out doing a backport, that wouldn’t finish in time taking CI into account) [14:30:20] (so a config change seems better) [14:30:43] or koi, are you still around? [14:34:23] ok, no deployment I think [14:34:40] It's ok [14:34:50] !log UTC afternoon backport+config window done (one change, then some k8s issues, which are resolved for now) [14:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:52] (03CR) 10Jbond: [C: 04-1] (WIP) puppetdb-microservice: update puppetdb micro service so it streams data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940403 (https://phabricator.wikimedia.org/T342458) (owner: 10Jbond) [14:34:58] * Lucas_WMDE done [14:35:06] Sorry for the disturbance :( [14:35:15] no problem [14:35:26] I guess James_F’s change will end up testing it then ^^ [14:35:38] *fear* [14:35:59] Oh no. [14:36:27] What might break? [14:36:35] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [14:36:51] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): tcpircbot: enable logging to #wikimedia-cloud-feed - https://phabricator.wikimedia.org/T342666 (10fnegri) 05In progress→03Resolved > [14:28] ChanServ sets mode +v logmsgbot_cloud Thanks, I was about to ask you! :) [14:36:57] “Running helmfile” steps might take a long time and eventually time out (though they shouldn’t, claime fixed it for now) [14:36:58] James_F: mw-on-k8s deployments, but they shouldn't anymore, I've tricked them [14:37:05] but even if they do, scap should keep going [14:37:11] Ack. [14:37:16] worst case is that k8s hosts (1% of requests) won’t see the change [14:37:30] Thankfully Wikifunctions.org isn't solely served by k8s. [14:37:34] (well. worst expected case, I suppose) [14:37:51] But sadly it's not part of the WikimediaDebug extension's allow list yet, so we can't actually test changes before they go live. Joy. [14:38:08] ah. that’s annoying :/ [14:38:14] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dse-k8s-ctrl1001.eqiad.wmnet [14:38:37] Yeah, apparently it's waiting to find someone to agree to own it now that Performance is no more. [14:39:07] PROBLEM - Check systemd state on dse-k8s-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:09] (03CR) 10Jbond: ferm::service: Fix handling of multiple ports (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:39:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T342617)', diff saved to https://phabricator.wikimedia.org/P49903 and previous config saved to /var/cache/conftool/dbconfig/20230801-143909-ladsgroup.json [14:39:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [14:39:12] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:39:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [14:39:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T342617)', diff saved to https://phabricator.wikimedia.org/P49904 and previous config saved to /var/cache/conftool/dbconfig/20230801-143930-ladsgroup.json [14:41:10] (03PS5) 10Jforrester: Move wikifunctions.org from locked-down to limited deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941515 (https://phabricator.wikimedia.org/T342820) [14:42:20] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad VM 1 for config-master - https://phabricator.wikimedia.org/T343212 (10jbond) 05In progress→03Resolved System built [14:42:22] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Move config-master to dedicated VMs - https://phabricator.wikimedia.org/T341717 (10jbond) [14:42:35] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM 1 for config-master - https://phabricator.wikimedia.org/T343213 (10jbond) 05In progress→03Resolved system built [14:42:40] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Move config-master to dedicated VMs - https://phabricator.wikimedia.org/T341717 (10jbond) [14:43:15] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Move config-master to dedicated VMs - https://phabricator.wikimedia.org/T341717 (10jbond) [14:46:11] (03PS1) 10Cathal Mooney: Do not compare speed of disabled interfaces when validating blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/944240 (https://phabricator.wikimedia.org/T303529) [14:46:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P49905 and previous config saved to /var/cache/conftool/dbconfig/20230801-144641-ladsgroup.json [14:46:48] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add config-master[12]001 - jbond@cumin1001 - T341717" [14:46:50] T341717: Move config-master to dedicated VMs - https://phabricator.wikimedia.org/T341717 [14:47:43] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add config-master[12]001 - jbond@cumin1001 - T341717" [14:55:22] (03PS1) 10Jbond: O:config_master: Add new role for config-master [puppet] - 10https://gerrit.wikimedia.org/r/944242 (https://phabricator.wikimedia.org/T341717) [14:55:26] (03PS1) 10Jbond: site.pp: move config-master hosts to config-master role [puppet] - 10https://gerrit.wikimedia.org/r/944243 (https://phabricator.wikimedia.org/T341717) [14:55:41] (03CR) 10CI reject: [V: 04-1] O:config_master: Add new role for config-master [puppet] - 10https://gerrit.wikimedia.org/r/944242 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [14:56:49] (03PS2) 10Jbond: O:config_master: Add new role for config-master [puppet] - 10https://gerrit.wikimedia.org/r/944242 (https://phabricator.wikimedia.org/T341717) [14:56:51] (03PS2) 10Jbond: site.pp: move config-master hosts to config-master role [puppet] - 10https://gerrit.wikimedia.org/r/944243 (https://phabricator.wikimedia.org/T341717) [14:57:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49906 and previous config saved to /var/cache/conftool/dbconfig/20230801-145702-ladsgroup.json [14:57:09] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:00:05] James_F: Time to snap out of that daydream and deploy Wikifunctions.org enablement. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1500). [15:00:10] Ack. [15:00:53] (03PS3) 10Jbond: O:config_master: Add new role for config-master [puppet] - 10https://gerrit.wikimedia.org/r/944242 (https://phabricator.wikimedia.org/T341717) [15:00:55] (03PS3) 10Jbond: site.pp: move config-master hosts to config-master role [puppet] - 10https://gerrit.wikimedia.org/r/944243 (https://phabricator.wikimedia.org/T341717) [15:01:18] (03CR) 10CI reject: [V: 04-1] O:config_master: Add new role for config-master [puppet] - 10https://gerrit.wikimedia.org/r/944242 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [15:01:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P49907 and previous config saved to /var/cache/conftool/dbconfig/20230801-150146-ladsgroup.json [15:01:59] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/944240 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [15:02:03] RECOVERY - cinder-volume process on cloudcontrol1005 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:02:16] <_joe_> James_F: you should be ok with k8s, but if you have issues, let me know [15:02:23] Will do! [15:04:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/944211 (owner: 10Muehlenhoff) [15:04:33] (03CR) 10Volans: [V: 03+2 C: 03+2] wmf-update-known-hosts-production: fix CNAMEs [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/944225 (owner: 10Volans) [15:04:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by apine@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941515 (https://phabricator.wikimedia.org/T342820) (owner: 10Jforrester) [15:05:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/944230 (https://phabricator.wikimedia.org/T342345) (owner: 10Volans) [15:05:23] (03CR) 10Muehlenhoff: ferm::service: Fix handling of multiple ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:05:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudnet2007-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:05:44] (03Merged) 10jenkins-bot: Move wikifunctions.org from locked-down to limited deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941515 (https://phabricator.wikimedia.org/T342820) (owner: 10Jforrester) [15:05:52] (03CR) 10Jbond: [C: 03+2] sre.ganeti.resource-report: Add cookbook to fetch Ganeti resources (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/924062 (owner: 10Jbond) [15:06:15] !log apine@deploy1002 Started scap: Backport for [[gerrit:941515|Move wikifunctions.org from locked-down to limited deployment (T342820)]] [15:06:18] T342820: Migrate wikifunctions.org from locked-down to limited mode, letting users edit wikitext pages and some - https://phabricator.wikimedia.org/T342820 [15:06:31] PROBLEM - cinder-volume process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:06:39] PROBLEM - Check whether ferm is active by checking the default input chain on dse-k8s-ctrl1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:06:41] PROBLEM - cinder-scheduler process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:06:42] (03PS4) 10Jbond: O:config_master: Add new role for config-master [puppet] - 10https://gerrit.wikimedia.org/r/944242 (https://phabricator.wikimedia.org/T341717) [15:06:45] (03PS4) 10Jbond: site.pp: move config-master hosts to config-master role [puppet] - 10https://gerrit.wikimedia.org/r/944243 (https://phabricator.wikimedia.org/T341717) [15:07:25] PROBLEM - cinder-api http on cloudcontrol1005 is CRITICAL: connect to address 10.64.151.3 and port 18776: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:07:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for dbrant - https://phabricator.wikimedia.org/T343122 (10thcipriani) >>! In T343122#9057877, @MoritzMuehlenhoff wrote: > @thcipriani This needs your signoff Approved from my side, thanks for the ping! [15:07:54] !log apine@deploy1002 jforrester and apine: Backport for [[gerrit:941515|Move wikifunctions.org from locked-down to limited deployment (T342820)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [15:08:10] !log apine@deploy1002 jforrester and apine: Continuing with sync [15:11:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2008-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:12:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P49908 and previous config saved to /var/cache/conftool/dbconfig/20230801-151208-ladsgroup.json [15:14:00] !log apine@deploy1002 Finished scap: Backport for [[gerrit:941515|Move wikifunctions.org from locked-down to limited deployment (T342820)]] (duration: 07m 45s) [15:14:03] T342820: Migrate wikifunctions.org from locked-down to limited mode, letting users edit wikitext pages and some - https://phabricator.wikimedia.org/T342820 [15:14:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T342617)', diff saved to https://phabricator.wikimedia.org/P49909 and previous config saved to /var/cache/conftool/dbconfig/20230801-151427-ladsgroup.json [15:14:30] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:14:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:15:07] !log bounce ferm on dse-k8s-ctrl1001 [15:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:11] RECOVERY - Check systemd state on dse-k8s-ctrl1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:27] Well done Lucas_WMDE for getting the first edit [15:16:32] :D [15:16:36] (03CR) 10Volans: [C: 03+2] cumin: fix installer configuration [puppet] - 10https://gerrit.wikimedia.org/r/944230 (https://phabricator.wikimedia.org/T342345) (owner: 10Volans) [15:16:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P49910 and previous config saved to /var/cache/conftool/dbconfig/20230801-151650-ladsgroup.json [15:16:53] looks like I managed to get lucky with an mw server before the scap had fully finished ^^ [15:17:06] Yeah. [15:17:29] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [15:17:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudnet2008-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:18:10] James_F: is that site a SUL one? [15:18:23] hauskater: Yes. [15:18:38] hmm, guess firefox messed with my cookies again then [15:19:01] hauskater: New domain, if you haven't logged into prod since Wednesday you'll need to re-do it. [15:19:25] hauskater: It's not often a new wiki domain is added; last one before last week was a little thing call wikidata.org in 2012, after all. :-) [15:19:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:20:05] James_F: nah, I always log-out after I finish for the day [15:20:07] James_F: wasn’t wikivoyage a bit later? (Jan 2013 says enwiki) [15:20:16] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) > we can try to reuse the same logger, but configure the destination host (to be w... [15:20:26] not exactly a new project but a new TLD nonetheless [15:20:27] Lucas_WMDE: Officially beforehand, but yes, the SULification was perhaps later? I forget. [15:20:28] I guess the vpn is blocked so account auto-creation is banned [15:22:00] bingo [15:22:32] hauskater: Tell meta. :-( [15:25:39] (03PS1) 10Giuseppe Lavagetto: mediawiki::wanrouter_cache: add wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/944247 (https://phabricator.wikimedia.org/T297815) [15:25:41] (03PS1) 10Giuseppe Lavagetto: mediawiki::wancache: add the wikifunctions pools and routes [puppet] - 10https://gerrit.wikimedia.org/r/944248 (https://phabricator.wikimedia.org/T297815) [15:26:00] (03CR) 10Btullis: idp_test: add datahub_staging as a OIDC service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [15:26:11] <_joe_> James_F: ^^ server side configuration for the wikifunctions memcached [15:26:31] (03PS1) 10Volans: Revert "validators: temporary support for esams->knams" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/944192 [15:26:36] _joe_: Nice. [15:26:39] James_F: It's probable I made the block myself when trying to stop spambots heh :) [15:26:39] <_joe_> although I still have big doubts about accessing it from two different applications to read/write the same data and use it as a means of communication of sorts [15:26:47] <_joe_> we cna iron that out later [15:26:53] Yeah, not an issue for months. [15:27:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P49911 and previous config saved to /var/cache/conftool/dbconfig/20230801-152714-ladsgroup.json [15:27:23] <_joe_> James_F: well it is being written from one side and read from the other right now, or did I misunderstand the diagram? [15:27:37] <_joe_> it would be [15:27:40] No, that diagram is future state. Right now it's only being read and written from MW. [15:27:45] <_joe_> ok [15:27:51] <_joe_> so not an issue at all for now [15:28:02] <_joe_> we can reevaluate what's best once we get around there [15:28:13] (03CR) 10CI reject: [V: 04-1] mediawiki::wancache: add the wikifunctions pools and routes [puppet] - 10https://gerrit.wikimedia.org/r/944248 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [15:28:17] +1 [15:28:19] RECOVERY - cinder-api http on cloudcontrol1005 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 663 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:28:20] <_joe_> ok so, with the two patches above, that I will have to review tomorrow [15:28:30] <_joe_> and one mediawiki-config patch, we should be good to go [15:29:05] RECOVERY - cinder-scheduler process on cloudcontrol1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:29:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P49912 and previous config saved to /var/cache/conftool/dbconfig/20230801-152933-ladsgroup.json [15:31:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1196 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P49913 and previous config saved to /var/cache/conftool/dbconfig/20230801-153155-ladsgroup.json [15:32:36] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BTullis) [15:33:55] PROBLEM - Host mw2431 is DOWN: PING CRITICAL - Packet loss = 100% [15:34:34] ^ expected? [15:36:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/944209 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [15:36:12] (03PS2) 10Giuseppe Lavagetto: mediawiki::wancache: add the wikifunctions pools and routes [puppet] - 10https://gerrit.wikimedia.org/r/944248 (https://phabricator.wikimedia.org/T297815) [15:36:19] <_joe_> sukhe: abs not [15:37:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet2007-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:37:09] RECOVERY - Check whether ferm is active by checking the default input chain on dse-k8s-ctrl1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:37:34] _joe_: ok [15:37:41] (in a meeting but wanted to flag it) [15:38:00] mw2431 seems completely borked and needs a dc ops task, I can't even connect to the serial console [15:38:14] <_joe_> ouch [15:38:28] papaul, JennH ^ [15:38:29] moritzm: can take a look i am on site [15:38:43] (03CR) 10CI reject: [V: 04-1] mediawiki::wancache: add the wikifunctions pools and routes [puppet] - 10https://gerrit.wikimedia.org/r/944248 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [15:38:58] papaul: ack, that would be good, I can also file a task if that's easier [15:39:43] moritzm: no need let me us check if that server is in the rack that i am working in maybe i touched the network cable while doing other stuffs [15:39:47] i am wokring in b6 [15:39:51] looking now [15:39:54] ack [15:40:19] moritzm: yes that server is in b6 [15:40:23] checking console now [15:41:45] moritzm: console is working [15:42:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49914 and previous config saved to /var/cache/conftool/dbconfig/20230801-154220-ladsgroup.json [15:42:22] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [15:42:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:42:24] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:42:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [15:42:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T342617)', diff saved to https://phabricator.wikimedia.org/P49915 and previous config saved to /var/cache/conftool/dbconfig/20230801-154242-ladsgroup.json [15:44:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P49916 and previous config saved to /var/cache/conftool/dbconfig/20230801-154439-ladsgroup.json [15:44:59] (03PS3) 10Giuseppe Lavagetto: mediawiki::wancache: add the wikifunctions pools and routes [puppet] - 10https://gerrit.wikimedia.org/r/944248 (https://phabricator.wikimedia.org/T297815) [15:45:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet2008-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:46:18] papaul: I can connect to the serial console now, but dmesg tells me the link of the server itself is down, maybe some dislocated cable? [15:46:30] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10Papaul) [15:47:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42742/console" [puppet] - 10https://gerrit.wikimedia.org/r/944248 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [15:47:33] moritzm: ok checking cable [15:47:43] (03CR) 10Hnowlan: [C: 03+2] images: enable "debug" on memcache, log when servers are dead [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/941901 (https://phabricator.wikimedia.org/T341805) (owner: 10Hnowlan) [15:48:27] RECOVERY - Host mw2431 is UP: PING OK - Packet loss = 0%, RTA = 31.71 ms [15:48:38] moritzm: ^ [15:49:05] papaul: thanks [15:49:14] moritzm: you welcome [15:49:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt2004-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:52:20] (03CR) 10Jbond: [C: 03+2] O:config_master: Add new role for config-master [puppet] - 10https://gerrit.wikimedia.org/r/944242 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [15:52:23] (03CR) 10Jbond: [C: 03+2] site.pp: move config-master hosts to config-master role [puppet] - 10https://gerrit.wikimedia.org/r/944243 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [15:52:28] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): cloudcumin: decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [15:52:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt2005-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:56:16] (03Merged) 10jenkins-bot: images: enable "debug" on memcache, log when servers are dead [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/941901 (https://phabricator.wikimedia.org/T341805) (owner: 10Hnowlan) [15:59:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T342617)', diff saved to https://phabricator.wikimedia.org/P49917 and previous config saved to /var/cache/conftool/dbconfig/20230801-155945-ladsgroup.json [15:59:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [15:59:50] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:00:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [16:00:05] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1600). [16:00:05] lucaswerkmeister and Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1210 (T342617)', diff saved to https://phabricator.wikimedia.org/P49918 and previous config saved to /var/cache/conftool/dbconfig/20230801-160006-ladsgroup.json [16:00:23] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q1): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) 05In progress→03Resolved [16:00:26] * jbond looking [16:00:28] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [16:01:59] lucaswerkmeister yu will need to get a +1 from someone in toolforge before i can merge [16:04:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:33] (03PS1) 10Jbond: config_master: add server_aliases: [puppet] - 10https://gerrit.wikimedia.org/r/944253 (https://phabricator.wikimedia.org/T341717) [16:05:44] (03PS1) 10Muehlenhoff: graphite: Pass port without Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944254 (https://phabricator.wikimedia.org/T336497) [16:06:59] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:07:14] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:07:47] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [16:11:07] (03PS1) 10Elukey: admin_ng: increase resources allocated for knative pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/944256 [16:15:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:05] (03CR) 10Elukey: [C: 03+2] admin_ng: increase resources allocated for knative pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/944256 (owner: 10Elukey) [16:20:58] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:21:22] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:22:18] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:22:50] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:23:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [16:23:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Jhancock.wm) [16:23:43] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [16:25:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10Jhancock.wm) [16:25:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T342617)', diff saved to https://phabricator.wikimedia.org/P49919 and previous config saved to /var/cache/conftool/dbconfig/20230801-162541-ladsgroup.json [16:25:44] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:26:56] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/944192 (owner: 10Volans) [16:27:06] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10Jhancock.wm) [16:28:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Jhancock.wm) [16:31:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42744/console" [puppet] - 10https://gerrit.wikimedia.org/r/944253 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [16:31:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] config_master: add server_aliases: [puppet] - 10https://gerrit.wikimedia.org/r/944253 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [16:35:01] (03PS1) 10Jbond: config-master: we are in yaml now not puppet [puppet] - 10https://gerrit.wikimedia.org/r/944260 (https://phabricator.wikimedia.org/T341717) [16:35:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt2004-dev.mgmt.codfw.wmnet with reboot policy FORCED [16:35:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt2005-dev.mgmt.codfw.wmnet with reboot policy FORCED [16:38:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudvirt2006-dev.mgmt.codfw.wmnet with reboot policy FORCED [16:40:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P49920 and previous config saved to /var/cache/conftool/dbconfig/20230801-164047-ladsgroup.json [16:41:13] (03CR) 10Jbond: [C: 03+2] config-master: we are in yaml now not puppet [puppet] - 10https://gerrit.wikimedia.org/r/944260 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [16:41:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:42:05] 10SRE, 10Traffic, 10observability: HAProxy metrics go down on config reload - https://phabricator.wikimedia.org/T343000 (10Vgutierrez) 05Stalled→03In progress getting rid of KA didn't help a lot per https://grafana.wikimedia.org/goto/JcVQsuqVk?orgId=1: {F37158713} any suggestions @fgiunchedi on how to p... [16:45:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T342617)', diff saved to https://phabricator.wikimedia.org/P49921 and previous config saved to /var/cache/conftool/dbconfig/20230801-164550-ladsgroup.json [16:45:57] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:46:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:48:19] (03PS1) 10Jbond: O:config_master: use cfssl for tls [puppet] - 10https://gerrit.wikimedia.org/r/944263 [16:49:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:34] (03CR) 10Jbond: [C: 03+2] O:config_master: use cfssl for tls [puppet] - 10https://gerrit.wikimedia.org/r/944263 (owner: 10Jbond) [16:55:40] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10KFrancis) Hi all, I am confirming there in an NDA on file. Please proceed with the access request. Thanks! [16:55:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P49922 and previous config saved to /var/cache/conftool/dbconfig/20230801-165553-ladsgroup.json [16:56:57] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10KFrancis) No worries! :-) [16:57:59] (03PS1) 10Jbond: config-master: add profile::discovery variables [puppet] - 10https://gerrit.wikimedia.org/r/944264 (https://phabricator.wikimedia.org/T341717) [16:58:34] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10KFrancis) Hi all, I am confirming an NDA is on file for Robert Timm. Thanks! [16:59:15] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10KFrancis) NDA is confirmed. Thanks! [16:59:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42745/console" [puppet] - 10https://gerrit.wikimedia.org/r/944264 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1700) [17:00:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P49923 and previous config saved to /var/cache/conftool/dbconfig/20230801-170057-ladsgroup.json [17:01:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:15] (03CR) 10Jdlrobson: "Oh I see what happened here. ext.echo.styles.badge" [extensions/Echo] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/943605 (https://phabricator.wikimedia.org/T335273) (owner: 10Urbanecm) [17:03:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:05:19] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@ee544cb]: Update kartotherian to e28ea7ef (T334668 T332985 T332664 T329924) [17:05:28] T329924: Update kartotherian to use mapdata 0.9.0 (external data is expanded in-place) - https://phabricator.wikimedia.org/T329924 [17:05:28] T332664: Kartotherian "Cannot read property 'coordinates' of null" - https://phabricator.wikimedia.org/T332664 [17:05:28] T334668: Host sprites and glyphs in kartotherian for Android WebGL map - https://phabricator.wikimedia.org/T334668 [17:05:29] T332985: Reduce kartotherian empty group logspam caused by Wikivoyage - https://phabricator.wikimedia.org/T332985 [17:07:50] (03CR) 10Jbond: [V: 03+1 C: 03+2] config-master: add profile::discovery variables [puppet] - 10https://gerrit.wikimedia.org/r/944264 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [17:08:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:09:44] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@ee544cb]: Update kartotherian to e28ea7ef (T334668 T332985 T332664 T329924) (duration: 04m 25s) [17:11:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T342617)', diff saved to https://phabricator.wikimedia.org/P49924 and previous config saved to /var/cache/conftool/dbconfig/20230801-171059-ladsgroup.json [17:11:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:11:03] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:11:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:11:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49925 and previous config saved to /var/cache/conftool/dbconfig/20230801-171120-ladsgroup.json [17:11:23] (03PS1) 10Jbond: config_master: ad conftool parameters [puppet] - 10https://gerrit.wikimedia.org/r/944265 (https://phabricator.wikimedia.org/T341717) [17:14:15] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10dr0ptp4kt) Hi @fgiunchedi, using the config at https://wikitech.wikimedia.org/wiki/SRE/Production_access#Setting_up_your_SSH_config (after pinning the confirmed fingerp... [17:15:17] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10dr0ptp4kt) [17:16:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P49926 and previous config saved to /var/cache/conftool/dbconfig/20230801-171603-ladsgroup.json [17:17:31] (03CR) 10Jbond: ferm::service: Fix handling of multiple ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [17:18:11] (03PS1) 10Fabfur: Remove dns3001 for reboot [puppet] - 10https://gerrit.wikimedia.org/r/944286 (https://phabricator.wikimedia.org/T335835) [17:19:35] (03CR) 10Jbond: idp_test: add datahub_staging as a OIDC service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [17:20:56] (03CR) 10Jbond: [C: 03+2] config_master: ad conftool parameters [puppet] - 10https://gerrit.wikimedia.org/r/944265 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [17:21:41] (03PS2) 10Fabfur: Temporary depool dns3001 [puppet] - 10https://gerrit.wikimedia.org/r/944286 (https://phabricator.wikimedia.org/T335835) [17:24:27] (03CR) 10Ssingh: [C: 03+1] Temporary depool dns3001 [puppet] - 10https://gerrit.wikimedia.org/r/944286 (https://phabricator.wikimedia.org/T335835) (owner: 10Fabfur) [17:24:50] (03PS1) 10Herron: wip [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/944287 [17:25:40] (03CR) 10Fabfur: [C: 03+2] Temporary depool dns3001 [puppet] - 10https://gerrit.wikimedia.org/r/944286 (https://phabricator.wikimedia.org/T335835) (owner: 10Fabfur) [17:26:02] (03PS1) 10Jbond: O:config_master: add httpd profile [puppet] - 10https://gerrit.wikimedia.org/r/944288 (https://phabricator.wikimedia.org/T341717) [17:26:45] !log running puppet on 'A:cumin or A:dns-rec or A:netbox' (https://gerrit.wikimedia.org/r/c/operations/puppet/+/944286) (T335835) [17:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:49] (03CR) 10Jbond: [C: 03+2] O:config_master: add httpd profile [puppet] - 10https://gerrit.wikimedia.org/r/944288 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [17:29:54] (03PS1) 10Andrea Denisse: pontoon: Apply the 'alerting_host' role to the pontoon-alerting-host-01 host [puppet] - 10https://gerrit.wikimedia.org/r/944289 (https://phabricator.wikimedia.org/T333615) [17:31:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T342617)', diff saved to https://phabricator.wikimedia.org/P49927 and previous config saved to /var/cache/conftool/dbconfig/20230801-173109-ladsgroup.json [17:31:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:31:13] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:31:22] (03CR) 10Muehlenhoff: ferm::service: Fix handling of multiple ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [17:31:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:31:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49928 and previous config saved to /var/cache/conftool/dbconfig/20230801-173130-ladsgroup.json [17:31:33] (03PS4) 10Muehlenhoff: ferm::service: Fix handling of multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) [17:33:49] 10SRE, 10serviceops, 10wikitech.wikimedia.org: Install php-ldap on all MW appservers - https://phabricator.wikimedia.org/T237889 (10bd808) 05Open→03Declined Closing in favor of {T292707} as it makes little sense at this point to consider putting Wikitech into legacy production hosting. [17:34:47] (ConfdResourceFailed) firing: (223) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:35:24] BGP alerts in esams expected [17:35:47] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:36:29] !log stopped bird and disable puppet on dns3001 (T335835) [17:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:57] (03CR) 10Btullis: [C: 03+1] airflow-wmde: Add a postgresql database and user for airflow wmde [puppet] - 10https://gerrit.wikimedia.org/r/940961 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:37:23] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host dns3001.wikimedia.org [17:39:26] 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Papaul) [17:39:37] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:39:44] 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Papaul) p:05Triage→03Medium [17:39:47] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:41:32] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns3001.wikimedia.org [17:41:33] PROBLEM - Host dns3001 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:35] RECOVERY - Host dns3001 is UP: PING OK - Packet loss = 0%, RTA = 94.09 ms [17:41:37] sigh [17:41:50] definitely a weird race condition with the reboot-single cookbook here [17:42:00] or the command it calls [17:42:41] !log started bird and enabled puppet on dns3001 (T335835) [17:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:12] FYI, dns3001 is depooled from authdns_servers so nothing to worry [17:43:20] fwiw? fyi? [17:43:59] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns3001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:44:20] yeah all good [17:44:25] (03CR) 10Herron: pontoon: Apply the 'alerting_host' role to the pontoon-alerting-host-01 host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944289 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [17:45:18] (03PS1) 10Fabfur: Revert "Temporary depool dns3001" [puppet] - 10https://gerrit.wikimedia.org/r/944295 [17:45:27] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns3001 is OK: OK: UP (pid=3795) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:45:31] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:46:43] (03CR) 10Btullis: [C: 04-1] airflow-wmde: Create scap deployment source for wmde (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:46:50] (03CR) 10Ssingh: [C: 03+1] Revert "Temporary depool dns3001" [puppet] - 10https://gerrit.wikimedia.org/r/944295 (owner: 10Fabfur) [17:47:30] (03CR) 10Fabfur: [C: 03+2] Revert "Temporary depool dns3001" [puppet] - 10https://gerrit.wikimedia.org/r/944295 (owner: 10Fabfur) [17:48:07] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) I will do that tomorrow and leave it ready for you to check [17:48:52] !log running puppet on 'A:cumin or A:dns-rec or A:netbox' (https://gerrit.wikimedia.org/r/c/operations/puppet/+/944286) (T335835) [17:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:13] (03PS1) 10Fabfur: Revert "Move ntp.esams.wikimedia.org CNAME to reboot dns3001" [dns] - 10https://gerrit.wikimedia.org/r/944297 [17:53:35] (03CR) 10Ssingh: [C: 03+1] Revert "Move ntp.esams.wikimedia.org CNAME to reboot dns3001" [dns] - 10https://gerrit.wikimedia.org/r/944297 (owner: 10Fabfur) [17:54:16] (03CR) 10Fabfur: [C: 03+2] Revert "Move ntp.esams.wikimedia.org CNAME to reboot dns3001" [dns] - 10https://gerrit.wikimedia.org/r/944297 (owner: 10Fabfur) [17:54:20] (03CR) 10Andrea Denisse: pontoon: Apply the 'alerting_host' role to the pontoon-alerting-host-01 host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944289 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [17:55:39] !log running authdns-update on dns1004 to revert ntp.esams to dns3001 (T335835) [17:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:55] (03CR) 10Btullis: "I think that there's one other issue, which is that the `analytics-wmde` user to whom the keytabs belong is only created by the statistics" [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [17:56:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49929 and previous config saved to /var/cache/conftool/dbconfig/20230801-175641-ladsgroup.json [17:56:44] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:59:32] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:40] PROBLEM - Check systemd state on config-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirt2006-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:00:05] dancy and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1800). [18:00:36] (03PS1) 10Jbond: config_master: add docs [puppet] - 10https://gerrit.wikimedia.org/r/944299 (https://phabricator.wikimedia.org/T341717) [18:00:38] (03PS1) 10Jbond: configmaster: add support to proxy the puppet sha1 files [puppet] - 10https://gerrit.wikimedia.org/r/944300 (https://phabricator.wikimedia.org/T336497) [18:04:46] PROBLEM - config-master.wikimedia.org requires authentication on config-master1001 is CRITICAL: connect to address 10.64.0.110 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:05:10] !log adding dns3001 on cr2-esams and cr3-esams routing for ns2 (T335835) [18:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:19] o/ [18:06:40] PROBLEM - config-master.wikimedia.org tls expiry on config-master1001 is CRITICAL: connect to address 10.64.0.110 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:07:03] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944302 (https://phabricator.wikimedia.org/T340248) [18:07:05] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944302 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [18:07:47] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944302 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [18:10:24] PROBLEM - config-master.wikimedia.org requires authentication on config-master2001 is CRITICAL: connect to address 10.192.0.15 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:10:47] jbond: are you already looking at the config-master alerts? [18:10:51] ^ is this known/something we should do something? [18:11:25] asking j.bond just because of the recent puppet patches that look relevant, I haven't started properly digging [18:11:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P49930 and previous config saved to /var/cache/conftool/dbconfig/20230801-181147-ladsgroup.json [18:12:18] PROBLEM - config-master.wikimedia.org tls expiry on config-master2001 is CRITICAL: connect to address 10.192.0.15 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:12:43] I am guessing the cfssl switch might be it [18:12:58] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10Papaul) [18:15:00] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol2006-dev'] [18:15:27] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol2007-dev'] [18:15:28] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.20 refs T340248 [18:15:31] T340248: 1.41.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T340248 [18:16:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol2006-dev'] [18:16:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol2007-dev'] [18:17:31] (03PS2) 10Jbond: configmaster: add support to proxy the puppet sha1 files [puppet] - 10https://gerrit.wikimedia.org/r/944300 (https://phabricator.wikimedia.org/T336497) [18:17:58] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet2008-dev'] [18:18:41] rzl: sorry missed the ping yes theses are not production yet ill add a silence [18:18:56] sorry for th noise [18:19:00] jbond: <3 [18:19:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42747/console" [puppet] - 10https://gerrit.wikimedia.org/r/944300 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [18:20:02] rad, thank you! [18:21:00] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet2007-dev'] [18:21:42] * jbond done [18:26:25] (03PS3) 10Jbond: configmaster: add support to proxy the puppet sha1 files [puppet] - 10https://gerrit.wikimedia.org/r/944300 (https://phabricator.wikimedia.org/T336497) [18:26:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P49931 and previous config saved to /var/cache/conftool/dbconfig/20230801-182653-ladsgroup.json [18:28:11] (03CR) 10Jbond: [C: 03+2] config_master: add docs [puppet] - 10https://gerrit.wikimedia.org/r/944299 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [18:28:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42748/console" [puppet] - 10https://gerrit.wikimedia.org/r/944300 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [18:29:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudnet2008-dev'] [18:29:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] configmaster: add support to proxy the puppet sha1 files [puppet] - 10https://gerrit.wikimedia.org/r/944300 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [18:30:15] (03PS1) 10Jforrester: wikifunctions: Bump to image without stupendous output logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/944304 (https://phabricator.wikimedia.org/T343176) [18:30:20] (03CR) 10Herron: [C: 03+1] pontoon: Apply the 'alerting_host' role to the pontoon-alerting-host-01 host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944289 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [18:30:49] jouncebot: nowandnext [18:30:49] For the next 1 hour(s) and 29 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1800) [18:30:49] In 1 hour(s) and 29 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T2000) [18:31:13] dancy: Can I sling out a service update for Wikifunctions? [18:31:30] Yep! [18:31:50] Okie. [18:31:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49932 and previous config saved to /var/cache/conftool/dbconfig/20230801-183151-ladsgroup.json [18:31:55] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:32:02] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Bump to image without stupendous output logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/944304 (https://phabricator.wikimedia.org/T343176) (owner: 10Jforrester) [18:32:50] stupendous.. haha [18:32:51] (03Merged) 10jenkins-bot: wikifunctions: Bump to image without stupendous output logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/944304 (https://phabricator.wikimedia.org/T343176) (owner: 10Jforrester) [18:33:04] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:33:07] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:33:17] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:33:20] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:35:51] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:36:29] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:36:37] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:37:49] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:37:51] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:37:52] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10apaskulin) [18:38:21] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10apaskulin) [18:39:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudnet2007-dev'] [18:39:53] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10apaskulin) [18:39:58] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:40:20] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10Papaul) [18:40:46] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Ladsgroup) I'm around if you want me to do it. [18:42:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49933 and previous config saved to /var/cache/conftool/dbconfig/20230801-184159-ladsgroup.json [18:42:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [18:42:03] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:42:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [18:42:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T342617)', diff saved to https://phabricator.wikimedia.org/P49934 and previous config saved to /var/cache/conftool/dbconfig/20230801-184220-ladsgroup.json [18:46:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P49935 and previous config saved to /var/cache/conftool/dbconfig/20230801-184657-ladsgroup.json [18:50:18] (03PS1) 10Cwhite: Revert "logstash remove wikifunctions response field" [puppet] - 10https://gerrit.wikimedia.org/r/944194 (https://phabricator.wikimedia.org/T343176) [18:51:53] (03PS1) 10Papaul: Add new cloud nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/944306 (https://phabricator.wikimedia.org/T342456) [18:52:16] (03CR) 10CI reject: [V: 04-1] Add new cloud nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/944306 (https://phabricator.wikimedia.org/T342456) (owner: 10Papaul) [18:53:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [18:55:09] (03PS2) 10Papaul: Add new cloud nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/944306 (https://phabricator.wikimedia.org/T342456) [18:55:17] (03PS3) 10Papaul: Add new cloud nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/944306 (https://phabricator.wikimedia.org/T342456) [18:56:22] (03CR) 10Papaul: [C: 03+2] Add new cloud nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/944306 (https://phabricator.wikimedia.org/T342456) (owner: 10Papaul) [18:56:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2006-dev.codfw.wmnet with OS bullseye [18:57:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2006-dev.codfw.wm... [19:01:45] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet2007-dev'] [19:02:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P49936 and previous config saved to /var/cache/conftool/dbconfig/20230801-190203-ladsgroup.json [19:05:03] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol2006-dev'] [19:07:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudnet2007-dev'] [19:10:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol2006-dev'] [19:11:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol2007-dev'] [19:11:46] (03CR) 10BCornwall: init: Optimize puppet disabling on reboot (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/943620 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [19:17:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P49937 and previous config saved to /var/cache/conftool/dbconfig/20230801-191709-ladsgroup.json [19:17:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [19:17:13] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:17:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [19:18:57] (03PS1) 10Jforrester: WikiLambdaApiBase: Don't explode in dieWithZError() [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944195 (https://phabricator.wikimedia.org/T343253) [19:19:13] (03PS1) 10Jforrester: WikiLambdaApiBase: Don't explode in dieWithZError() [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944196 (https://phabricator.wikimedia.org/T343253) [19:19:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T342617)', diff saved to https://phabricator.wikimedia.org/P49938 and previous config saved to /var/cache/conftool/dbconfig/20230801-191925-ladsgroup.json [19:20:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol2007-dev'] [19:28:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2007-dev.codfw.wmnet with OS bullseye [19:28:32] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontr... [19:28:44] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol2008-dev'] [19:30:29] (03PS1) 10Jforrester: ApiFunctionCall: Actually check 'wikilambda-execute' before proceeding [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944197 [19:30:39] (03PS1) 10Jforrester: ApiFunctionCall: Actually check 'wikilambda-execute' before proceeding [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944198 [19:31:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage [19:33:28] (03CR) 10DVrandecic: [C: 03+1] WikiLambdaApiBase: Don't explode in dieWithZError() [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944195 (https://phabricator.wikimedia.org/T343253) (owner: 10Jforrester) [19:33:46] (03CR) 10DVrandecic: [C: 03+1] WikiLambdaApiBase: Don't explode in dieWithZError() [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944196 (https://phabricator.wikimedia.org/T343253) (owner: 10Jforrester) [19:34:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2006-dev.codfw.wmnet with reason: host reimage [19:34:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P49939 and previous config saved to /var/cache/conftool/dbconfig/20230801-193432-ladsgroup.json [19:35:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol2008-dev'] [19:46:15] (03CR) 10AW GitLab Bot: "PAN-PAN: end-to-end deploy stage failed" [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944198 (owner: 10Jforrester) [19:48:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2007-dev.codfw.wmnet with reason: host reimage [19:49:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P49940 and previous config saved to /var/cache/conftool/dbconfig/20230801-194938-ladsgroup.json [19:50:57] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:51:32] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudnet2008-dev'] [19:51:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2007-dev.codfw.wmnet with reason: host reimage [19:52:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:52:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2006-dev.codfw.wmnet with OS bullseye [19:52:15] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol20... [19:53:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2008-dev.codfw.wmnet with OS bullseye [19:53:41] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontr... [19:56:10] !nowandnext [19:56:23] nowandnext [19:56:32] jouncebot: nowandnext [19:56:32] For the next 0 hour(s) and 3 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T1800) [19:56:32] In 0 hour(s) and 3 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T2000) [19:58:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudnet2008-dev'] [19:58:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:58:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:00:07] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T2000). nyaa~ [20:00:07] Dreamy_Jazz and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] \o [20:00:19] (03PS1) 10Jforrester: onHtmlPageLinkRendererEnd: Fiddle more carefully with links so we don't over-write non-edit ones [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944199 (https://phabricator.wikimedia.org/T343256) [20:00:28] o/ [20:00:51] let's see [20:01:12] i can deploy if there's no one else :) [20:01:19] (03PS2) 10Urbanecm: Design: Provide wordmarks/taglines for Wikiversity projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943614 (https://phabricator.wikimedia.org/T341256) (owner: 10Jdlrobson) [20:01:21] (03PS2) 10Urbanecm: Provide wordmarks for Wikivoyage projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943617 (https://phabricator.wikimedia.org/T341259) (owner: 10Jdlrobson) [20:01:28] (03CR) 10Urbanecm: [C: 03+2] Provide wordmarks for Wikivoyage projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943617 (https://phabricator.wikimedia.org/T341259) (owner: 10Jdlrobson) [20:01:31] (03CR) 10Urbanecm: [C: 03+2] Design: Provide wordmarks/taglines for Wikiversity projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943614 (https://phabricator.wikimedia.org/T341256) (owner: 10Jdlrobson) [20:01:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943614 (https://phabricator.wikimedia.org/T341256) (owner: 10Jdlrobson) [20:01:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943617 (https://phabricator.wikimedia.org/T341259) (owner: 10Jdlrobson) [20:02:10] (03PS2) 10Dreamy Jazz: Write new on group0 for event table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944168 (https://phabricator.wikimedia.org/T330158) [20:02:20] (03Merged) 10jenkins-bot: Provide wordmarks for Wikivoyage projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943617 (https://phabricator.wikimedia.org/T341259) (owner: 10Jdlrobson) [20:03:07] (03PS3) 10Dreamy Jazz: Write new on group0 for event table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944168 (https://phabricator.wikimedia.org/T330158) [20:04:03] (03PS3) 10Urbanecm: Design: Provide wordmarks/taglines for Wikiversity projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943614 (https://phabricator.wikimedia.org/T341256) (owner: 10Jdlrobson) [20:04:11] (03CR) 10Urbanecm: [C: 03+2] Design: Provide wordmarks/taglines for Wikiversity projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943614 (https://phabricator.wikimedia.org/T341256) (owner: 10Jdlrobson) [20:04:14] (03PS1) 10Jforrester: onHtmlPageLinkRendererEnd: Fiddle more carefully with links so we don't over-write non-edit ones [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944200 (https://phabricator.wikimedia.org/T343256) [20:04:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T342617)', diff saved to https://phabricator.wikimedia.org/P49941 and previous config saved to /var/cache/conftool/dbconfig/20230801-200444-ladsgroup.json [20:04:48] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:04:55] (03Merged) 10jenkins-bot: Design: Provide wordmarks/taglines for Wikiversity projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/943614 (https://phabricator.wikimedia.org/T341256) (owner: 10Jdlrobson) [20:08:19] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:09:57] okay, that failed merge stopped scap and i needed to restart. okay. [20:10:07] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:943614|Design: Provide wordmarks/taglines for Wikiversity projects (T341256)]], [[gerrit:943617|Provide wordmarks for Wikivoyage projects (T341259)]] [20:10:08] :( [20:10:12] T341256: Design: Provide wordmarks/taglines for Wikiversity projects - https://phabricator.wikimedia.org/T341256 [20:10:13] T341259: Design: Provide wordmarks for Wikivoyage projects - https://phabricator.wikimedia.org/T341259 [20:10:23] jouncebot: nowandnext [20:10:24] For the next 0 hour(s) and 49 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230801T2000) [20:10:24] In 9 hour(s) and 49 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T0600) [20:10:44] James_F: want me to ping you once done? [20:10:51] urbanecm: That'd be great, thanks! [20:10:53] will do [20:11:51] !log urbanecm@deploy1002 urbanecm and jdlrobson: Backport for [[gerrit:943614|Design: Provide wordmarks/taglines for Wikiversity projects (T341256)]], [[gerrit:943617|Provide wordmarks for Wikivoyage projects (T341259)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option [20:11:51] ) [20:12:02] Jdlrobson: please go ahead and test :) [20:12:36] looking [20:12:42] (03CR) 10Andrea Denisse: [C: 03+2] pontoon: Apply the 'alerting_host' role to the pontoon-alerting-host-01 host [puppet] - 10https://gerrit.wikimedia.org/r/944289 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [20:13:43] urbanecm: LGMT [20:13:50] proceeding [20:13:51] !log urbanecm@deploy1002 urbanecm and jdlrobson: Continuing with sync [20:13:56] as scap says :) [20:14:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2008-dev.codfw.wmnet with reason: host reimage [20:15:23] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: confd_prometheus_metrics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:01] yay [20:17:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2008-dev.codfw.wmnet with reason: host reimage [20:17:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:17:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2007-dev.codfw.wmnet with OS bullseye [20:18:06] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol20... [20:19:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudnet2007-dev.codfw.wmnet with OS bullseye [20:19:27] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudnet20... [20:19:49] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:943614|Design: Provide wordmarks/taglines for Wikiversity projects (T341256)]], [[gerrit:943617|Provide wordmarks for Wikivoyage projects (T341259)]] (duration: 09m 41s) [20:19:53] T341256: Design: Provide wordmarks/taglines for Wikiversity projects - https://phabricator.wikimedia.org/T341256 [20:19:54] T341259: Design: Provide wordmarks for Wikivoyage projects - https://phabricator.wikimedia.org/T341259 [20:19:58] and deployed Jdlrobson [20:20:11] Dreamy_Jazz: ready for the CU patch? [20:20:18] Yup [20:20:37] On slower internet than usual, but should still be able to test. [20:20:59] (03PS4) 10Urbanecm: Write new on group0 for event table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944168 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [20:21:04] okay. let's go ahead! [20:21:07] (03CR) 10Urbanecm: [C: 03+2] Write new on group0 for event table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944168 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [20:21:37] (03Merged) 10jenkins-bot: Write new on group0 for event table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944168 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [20:22:19] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:944168|Write new on group0 for event table migration (T330158)]] [20:22:22] T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158 [20:23:34] thanks urbanecm [20:23:38] much appreciated as usual! [20:23:51] !log urbanecm@deploy1002 urbanecm and dreamyjazz: Backport for [[gerrit:944168|Write new on group0 for event table migration (T330158)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:23:58] no problem [20:24:05] Starting testing now. [20:24:07] Dreamy_Jazz: please test! especially at testcommons i guess :)) [20:24:16] :) [20:25:43] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:19] Can only perform login attempts to testcommonswiki as it has autocreation of accounts disabled (as the wiki is closed). There should be some events in the table "cu_private_event" [20:28:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [20:28:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudnet2008-dev.codfw.wmnet with OS bullseye [20:29:07] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudnet20... [20:29:37] Logstash looks clean for the testing on testcommonswiki. Will test on testwikidatawiki for other actions. [20:31:19] okay [20:31:35] cu_private_event is non-empty [20:31:46] urbanecm: Could I be granted confirmed rights on testwikidatawiki again? Can't move my sandbox again. [20:31:58] sure [20:32:09] done [20:32:26] granted indefinitely, you're trusted enough :-D [20:33:01] :) [20:33:42] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:33:56] Nearly done with tests [20:34:44] Dreamy_Jazz: also autocreated you an account on the closed wiki (in case it comes helpful) [20:34:51] Thanks. [20:34:55] My part of testing is done. [20:35:21] Can you please check if testwikidatawiki has rows in the tables "cu_private_event" and "cu_log_event" [20:35:33] sure [20:35:43] Plus please check if there is a row in "cu_changes" that has the column "cuc_only_for_read_old" set to "1". [20:35:51] (03PS1) 10Jforrester: ApiFunctionCall,ApiPerformTest: Require higher privs for custom execution/test runs [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944201 [20:36:37] 107 cuc_only_for_read_old rows, 105 in private events [20:36:39] some are recent [20:37:00] That makes sense as read new was enabled for a while [20:37:09] yup [20:37:15] nothing dangerous in logstash [20:37:18] Logstash looks clean from what I can see, so test is fine. [20:37:21] Yup. [20:37:32] Thanks! [20:38:21] so, let's go then! [20:38:23] !log urbanecm@deploy1002 urbanecm and dreamyjazz: Continuing with sync [20:38:25] syncing :) [20:39:20] (03CR) 10CI reject: [V: 04-1] ApiFunctionCall,ApiPerformTest: Require higher privs for custom execution/test runs [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944201 (owner: 10Jforrester) [20:39:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage [20:40:01] (03CR) 10Jforrester: [C: 03+2] WikiLambdaApiBase: Don't explode in dieWithZError() [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944195 (https://phabricator.wikimedia.org/T343253) (owner: 10Jforrester) [20:40:07] (03CR) 10Jforrester: [C: 03+2] ApiFunctionCall: Actually check 'wikilambda-execute' before proceeding [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944197 (owner: 10Jforrester) [20:40:13] (03CR) 10CI reject: [V: 04-1] ApiFunctionCall,ApiPerformTest: Require higher privs for custom execution/test runs [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944202 (owner: 10Jforrester) [20:40:19] (03CR) 10Jforrester: [C: 03+2] onHtmlPageLinkRendererEnd: Fiddle more carefully with links so we don't over-write non-edit ones [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944199 (https://phabricator.wikimedia.org/T343256) (owner: 10Jforrester) [20:40:25] (03CR) 10Jforrester: [C: 03+2] ApiFunctionCall,ApiPerformTest: Require higher privs for custom execution/test runs [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944202 (owner: 10Jforrester) [20:42:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:42:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2008-dev.codfw.wmnet with OS bullseye [20:42:52] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol20... [20:43:17] (03Merged) 10jenkins-bot: WikiLambdaApiBase: Don't explode in dieWithZError() [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944195 (https://phabricator.wikimedia.org/T343253) (owner: 10Jforrester) [20:43:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2007-dev.codfw.wmnet with reason: host reimage [20:43:38] (03Merged) 10jenkins-bot: ApiFunctionCall: Actually check 'wikilambda-execute' before proceeding [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944197 (owner: 10Jforrester) [20:43:44] (03Merged) 10jenkins-bot: onHtmlPageLinkRendererEnd: Fiddle more carefully with links so we don't over-write non-edit ones [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944199 (https://phabricator.wikimedia.org/T343256) (owner: 10Jforrester) [20:43:50] (03Merged) 10jenkins-bot: ApiFunctionCall,ApiPerformTest: Require higher privs for custom execution/test runs [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944202 (owner: 10Jforrester) [20:44:05] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:944168|Write new on group0 for event table migration (T330158)]] (duration: 21m 46s) [20:44:08] T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158 [20:44:12] and we're live [20:44:14] anything else? [20:44:19] No. Thanks. [20:44:24] no problem [20:44:29] James_F: floor is yours [20:44:34] Ack, thanks! [20:45:08] (03CR) 10Jforrester: [C: 03+2] WikiLambdaApiBase: Don't explode in dieWithZError() [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944196 (https://phabricator.wikimedia.org/T343253) (owner: 10Jforrester) [20:45:13] (03CR) 10Jforrester: [C: 03+2] onHtmlPageLinkRendererEnd: Fiddle more carefully with links so we don't over-write non-edit ones [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944200 (https://phabricator.wikimedia.org/T343256) (owner: 10Jforrester) [20:45:17] (03CR) 10Jforrester: [C: 03+2] ApiFunctionCall: Actually check 'wikilambda-execute' before proceeding [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944198 (owner: 10Jforrester) [20:45:20] (03CR) 10Jforrester: [C: 03+2] ApiFunctionCall,ApiPerformTest: Require higher privs for custom execution/test runs [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944201 (owner: 10Jforrester) [20:48:59] (03Merged) 10jenkins-bot: WikiLambdaApiBase: Don't explode in dieWithZError() [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944196 (https://phabricator.wikimedia.org/T343253) (owner: 10Jforrester) [20:49:16] (03Merged) 10jenkins-bot: onHtmlPageLinkRendererEnd: Fiddle more carefully with links so we don't over-write non-edit ones [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944200 (https://phabricator.wikimedia.org/T343256) (owner: 10Jforrester) [20:49:18] (03Merged) 10jenkins-bot: ApiFunctionCall: Actually check 'wikilambda-execute' before proceeding [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944198 (owner: 10Jforrester) [20:49:24] (03Merged) 10jenkins-bot: ApiFunctionCall,ApiPerformTest: Require higher privs for custom execution/test runs [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944201 (owner: 10Jforrester) [20:49:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage [20:51:19] (03CR) 10AW GitLab Bot: "PAN-PAN: end-to-end deploy stage failed" [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944201 (owner: 10Jforrester) [20:51:28] (03CR) 10AW GitLab Bot: "PAN-PAN: end-to-end deploy stage failed" [extensions/WikiLambda] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/944202 (owner: 10Jforrester) [20:52:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2008-dev.codfw.wmnet with reason: host reimage [20:55:47] !log jforrester@deploy1002 Synchronized ./php-1.41.0-wmf.19/extensions/WikiLambda/: T343253 T343256 (duration: 06m 58s) [20:55:52] T343256: Wikifunction special links are wrongly taking over ?action=history, ?diff=prev etc. links - https://phabricator.wikimedia.org/T343256 [20:55:53] T343253: Some object changes or creations leads to ZErrorException on wikifunctions.org - https://phabricator.wikimedia.org/T343253 [20:56:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:56:41] (03PS1) 10Jforrester: Wikifunctions: Restrict wikilambda-execute to functioneers for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944316 [20:59:15] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:01:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:02:09] PROBLEM - Disk space on people1004 is CRITICAL: DISK CRITICAL - free space: / 1872MiB (2% inode=91%): /tmp 1872MiB (2% inode=91%): /var/tmp 1872MiB (2% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=people1004&var-datasource=eqiad+prometheus/ops [21:05:09] !log jforrester@deploy1002 Synchronized ./php-1.41.0-wmf.20/extensions/WikiLambda/: T343253 T343256 (duration: 07m 23s) [21:05:14] T343256: Wikifunction special links are wrongly taking over ?action=history, ?diff=prev etc. links - https://phabricator.wikimedia.org/T343256 [21:05:14] T343253: Some object changes or creations leads to ZErrorException on wikifunctions.org - https://phabricator.wikimedia.org/T343253 [21:05:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944316 (owner: 10Jforrester) [21:07:00] (03Merged) 10jenkins-bot: Wikifunctions: Restrict wikilambda-execute to functioneers for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944316 (owner: 10Jforrester) [21:07:29] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:944316|Wikifunctions: Restrict wikilambda-execute to functioneers for now]] [21:08:36] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:09:05] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:944316|Wikifunctions: Restrict wikilambda-execute to functioneers for now]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [21:10:24] (03PS1) 10Jdlrobson: Fix finnish projects, remove unused SVG/PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944318 (https://phabricator.wikimedia.org/T343278) [21:10:31] !log jforrester@deploy1002 jforrester: Continuing with sync [21:11:54] (03PS1) 10Jforrester: Wikifunctions: Log the 'WikiLambda' warnings and above logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944319 [21:14:42] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:16:32] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:944316|Wikifunctions: Restrict wikilambda-execute to functioneers for now]] (duration: 09m 03s) [21:17:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944319 (owner: 10Jforrester) [21:18:34] (03Merged) 10jenkins-bot: Wikifunctions: Log the 'WikiLambda' warnings and above logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944319 (owner: 10Jforrester) [21:19:02] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:944319|Wikifunctions: Log the 'WikiLambda' warnings and above logs]] [21:19:42] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:20:46] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:944319|Wikifunctions: Log the 'WikiLambda' warnings and above logs]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [21:23:19] !log jforrester@deploy1002 jforrester: Continuing with sync [21:29:24] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:944319|Wikifunctions: Log the 'WikiLambda' warnings and above logs]] (duration: 10m 22s) [21:40:03] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:46:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:46:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2007-dev.codfw.wmnet with OS bullseye [21:46:09] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudnet2007-d... [21:46:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [21:46:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2008-dev.codfw.wmnet with OS bullseye [21:46:20] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudnet2008-d... [21:57:10] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2004-dev'] [22:00:14] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10Papaul) [22:01:07] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10Papaul) 05Open→03Resolved @Andrew this is complete [22:01:38] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2005-dev'] [22:05:43] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2006-dev'] [22:09:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2004-dev'] [22:10:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2004-dev'] [22:11:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt2005-dev'] [22:11:26] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2005-dev'] [22:11:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2005-dev'] [22:14:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt2004-dev'] [22:16:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2006-dev'] [22:17:32] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2004-dev'] [22:18:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2005-dev'] [22:19:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt2006-dev'] [22:23:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2004-dev'] [22:23:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2005-dev'] [22:25:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudvirt2006-dev'] [22:29:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2004-dev.codfw.wmnet with OS bullseye [22:29:20] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt2... [22:33:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [22:38:36] (03PS7) 10Krinkle: noc: don't use on-disk files but etcd directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942672 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [22:40:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2005-dev.codfw.wmnet with OS bullseye [22:40:26] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt2005-dev.codfw.wmnet with OS bullseye [22:41:57] (03CR) 10Krinkle: noc: don't use on-disk files but etcd directly (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942672 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [22:41:59] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10Papaul) [22:51:39] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:52:35] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:53:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirt2006-dev.codfw.wmnet with OS bullseye [22:53:16] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirt2006-dev.codfw.wmnet with OS bullseye [23:05:15] (03PS1) 10Dreamy Jazz: Write new on group1 except wikidatawiki for event table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944350 (https://phabricator.wikimedia.org/T330158) [23:06:13] (03PS2) 10Dreamy Jazz: Write new on group1 except wikidatawiki for event table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944350 (https://phabricator.wikimedia.org/T330158) [23:09:45] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:10:23] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:35:37] (03CR) 10Krinkle: noc: centralize file list management (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942673 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [23:42:47] (03CR) 10Krinkle: noc: add static file server (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [23:43:25] (03PS1) 10Krinkle: noc: Fix various PHP errors that prevent db.php from working locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859) [23:44:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:44:35] (03PS2) 10Krinkle: noc: Fix various PHP errors that prevent db.php from working locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859) [23:49:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:49:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:59:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded