[00:00:02] (JobUnavailable) firing: Reduced availability for job blackbox/pingthing in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:00:35] (JobUnavailable) firing: (11) Reduced availability for job blackbox/icmp in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:01:22] (JobUnavailable) firing: Reduced availability for job blackbox/pingthing in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:04:56] (03CR) 10Cwhite: [C: 03+1] prometheus: Apply prometheus::pop role to prometheus4002 [puppet] - 10https://gerrit.wikimedia.org/r/907984 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [00:05:13] (03CR) 10Cwhite: [C: 03+1] prometheus: Apply prometheus::pop role to prometheus6002 [puppet] - 10https://gerrit.wikimedia.org/r/907987 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [00:09:02] (JobUnavailable) resolved: (11) Reduced availability for job blackbox/icmp in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:10:02] (JobUnavailable) resolved: (2) Reduced availability for job blackbox/pingthing in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:21:33] !log krinkle@deploy2002 Started deploy [integration/docroot@f68055d]: (no justification provided) [00:22:01] !log krinkle@deploy2002 Finished deploy [integration/docroot@f68055d]: (no justification provided) (duration: 00m 28s) [00:30:56] (03PS1) 10Krinkle: eventlogging: Remove unused FeaturePolicyViolation schema [puppet] - 10https://gerrit.wikimedia.org/r/908382 (https://phabricator.wikimedia.org/T209572) [00:37:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/907842 [00:39:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/907842 (owner: 10TrainBranchBot) [00:45:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:01] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/907842 (owner: 10TrainBranchBot) [01:01:01] (03PS1) 10Andrew Bogott: Add a disable_tool job to a cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/908386 [01:01:38] (03CR) 10CI reject: [V: 04-1] Add a disable_tool job to a cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/908386 (owner: 10Andrew Bogott) [01:03:38] (03PS2) 10Andrew Bogott: Add a disable_tool job to a cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/908386 [01:06:25] (03CR) 10Andrew Bogott: [C: 03+2] Add a disable_tool job to a cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/908386 (owner: 10Andrew Bogott) [01:09:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:03] (03PS1) 10Andrew Bogott: disable_tool: fix up config file for deleting ldap records [puppet] - 10https://gerrit.wikimedia.org/r/908388 [01:22:05] (03CR) 10Andrew Bogott: [C: 03+2] disable_tool: fix up config file for deleting ldap records [puppet] - 10https://gerrit.wikimedia.org/r/908388 (owner: 10Andrew Bogott) [01:23:32] !log fab@deploy2002 Started deploy [airflow-dags/research@f8dad05]: (no justification provided) [01:23:43] !log fab@deploy2002 Finished deploy [airflow-dags/research@f8dad05]: (no justification provided) (duration: 00m 10s) [01:36:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:41:32] !log fab@deploy2002 Started deploy [airflow-dags/research@f8dad05]: (no justification provided) [01:41:43] !log fab@deploy2002 Finished deploy [airflow-dags/research@f8dad05]: (no justification provided) (duration: 00m 10s) [01:46:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:37] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:51:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:25] !log fab@deploy2002 Started deploy [airflow-dags/research@f8dad05]: (no justification provided) [02:01:36] !log fab@deploy2002 Finished deploy [airflow-dags/research@f8dad05]: (no justification provided) (duration: 00m 11s) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:56] !log fab@deploy2002 Started deploy [airflow-dags/research@f8dad05]: (no justification provided) [02:08:07] !log fab@deploy2002 Finished deploy [airflow-dags/research@f8dad05]: (no justification provided) (duration: 00m 10s) [02:16:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [02:37:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:05:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:45] PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:13] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:00:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:11:45] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:18:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:19:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.321 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:20:35] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.677 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:35:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:16] (03CR) 10Ayounsi: "A bit of overlap with I8678223ca801e0c794ebdbfd5af59625435a9eeb (I can abandon it) and probably a good opportunity to standardize the core" [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [05:36:51] (03CR) 10Ayounsi: [C: 03+1] Expose interface VRF association to templates if present in Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/908325 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [05:45:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:48:37] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T0600) [06:00:05] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T0600). [06:02:19] 10ops-eqiad, 10Cloud-Services: Move cloudcephosd1021 to cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T334641 (10ayounsi) [06:04:11] 10ops-eqiad, 10Cloud-Services: Move cloudcephosd1021 to cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T334641 (10ayounsi) [06:06:44] 10ops-eqiad, 10Cloud-Services: Move cloudcephosd1021 to cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T334641 (10ayounsi) [06:07:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:15:32] (03PS1) 10Marostegui: instances.yaml: Add db1210 [puppet] - 10https://gerrit.wikimedia.org/r/908428 (https://phabricator.wikimedia.org/T326669) [06:15:34] 10ops-eqiad, 10Cloud-Services: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10ayounsi) p:05Triage→03Low [06:16:06] 10ops-eqiad, 10Cloud-Services: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10ayounsi) [06:16:11] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10ayounsi) [06:16:21] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1210 [puppet] - 10https://gerrit.wikimedia.org/r/908428 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:17:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1210 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46626 and previous config saved to /var/cache/conftool/dbconfig/20230413-061716-marostegui.json [06:17:22] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:18:10] (03PS1) 10Marostegui: db1210: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908429 (https://phabricator.wikimedia.org/T326669) [06:18:46] (03CR) 10Marostegui: [C: 03+2] db1210: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908429 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:19:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 1%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46627 and previous config saved to /var/cache/conftool/dbconfig/20230413-061913-root.json [06:20:51] (03PS1) 10Marostegui: instances.yaml: Add db1221 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908430 (https://phabricator.wikimedia.org/T326669) [06:21:18] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1221 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908430 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:21:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1221 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46628 and previous config saved to /var/cache/conftool/dbconfig/20230413-062231-marostegui.json [06:22:36] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:23:53] (03PS1) 10Marostegui: db1221: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908431 (https://phabricator.wikimedia.org/T326669) [06:24:21] (03CR) 10Marostegui: [C: 03+2] db1221: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908431 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:24:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 1%: Pooling', diff saved to https://phabricator.wikimedia.org/P46629 and previous config saved to /var/cache/conftool/dbconfig/20230413-062446-root.json [06:31:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:31:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:32:51] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:34:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 2%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46630 and previous config saved to /var/cache/conftool/dbconfig/20230413-063417-root.json [06:34:23] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:35:53] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1004.wikimedia.org (B1) - https://phabricator.wikimedia.org/T333997 (10Jelto) Thanks @Jclark-ctr , I can confirm disks are available on `gitlab1004`: ` sdc 8:32 0... [06:36:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 2%: Pooling', diff saved to https://phabricator.wikimedia.org/P46631 and previous config saved to /var/cache/conftool/dbconfig/20230413-063951-root.json [06:41:20] (03CR) 10Hashar: "Posting comments I forgot to post yesterday morning :/" [puppet] - 10https://gerrit.wikimedia.org/r/867670 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [06:43:20] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [06:44:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114 to clone db1214 T326669', diff saved to https://phabricator.wikimedia.org/P46632 and previous config saved to /var/cache/conftool/dbconfig/20230413-064452-marostegui.json [06:44:57] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [06:45:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:29] (03PS1) 10Marostegui: db1214: Move to s8 [puppet] - 10https://gerrit.wikimedia.org/r/908448 (https://phabricator.wikimedia.org/T326669) [06:46:35] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49852 bytes in 0.401 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:46:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.402 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:47:05] (03CR) 10Marostegui: [C: 03+2] db1214: Move to s8 [puppet] - 10https://gerrit.wikimedia.org/r/908448 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [06:47:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:49:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 3%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46634 and previous config saved to /var/cache/conftool/dbconfig/20230413-064922-root.json [06:54:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 3%: Pooling', diff saved to https://phabricator.wikimedia.org/P46635 and previous config saved to /var/cache/conftool/dbconfig/20230413-065456-root.json [06:56:09] !log jelto@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [06:59:26] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [07:00:05] Amir1, apergos, and jnuche: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T0700). [07:00:44] morning! there are no trainees signed up for today's backport window, which is just as well because there are also no patches scheduled for deployment during it. have a nice quiet day! [07:04:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 4%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46636 and previous config saved to /var/cache/conftool/dbconfig/20230413-070428-root.json [07:04:33] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:06:04] (03Abandoned) 10Hashar: ci: split contint hosts to different roles [puppet] - 10https://gerrit.wikimedia.org/r/907886 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [07:08:35] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [07:09:44] !log update bookworm installer to rc1 T330495 [07:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:48] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [07:10:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 4%: Pooling', diff saved to https://phabricator.wikimedia.org/P46637 and previous config saved to /var/cache/conftool/dbconfig/20230413-071000-root.json [07:14:20] !log Puppet: move htcacheclean to httpd class https://gerrit.wikimedia.org/r/c/operations/puppet/+/904102 [07:14:21] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [07:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:33] (03CR) 10Slyngshede: [C: 03+2] C:httpd move htcacheclean to httpd class [puppet] - 10https://gerrit.wikimedia.org/r/904102 (https://phabricator.wikimedia.org/T334577) (owner: 10Slyngshede) [07:18:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder) [07:19:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 5%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46638 and previous config saved to /var/cache/conftool/dbconfig/20230413-071932-root.json [07:19:38] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:22:17] (03CR) 10Elukey: profile::kafka::broker: refactor TLS settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [07:25:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 5%: Pooling', diff saved to https://phabricator.wikimedia.org/P46639 and previous config saved to /var/cache/conftool/dbconfig/20230413-072505-root.json [07:26:03] (03PS5) 10Elukey: profile::kafka::broker: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954 [07:27:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on 10 hosts with reason: Cloning db1117 [07:27:44] 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Atieno) [07:27:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 10 hosts with reason: Cloning db1117 [07:28:04] (03CR) 10Muehlenhoff: [C: 03+2] Install Puppet 5.5 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/908202 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [07:31:48] (03PS1) 10Marostegui: db1217: Move it to misc [puppet] - 10https://gerrit.wikimedia.org/r/908465 (https://phabricator.wikimedia.org/T326669) [07:32:25] (03PS4) 10Hashar: ci: rename ci::master role to ci [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) [07:33:21] (03CR) 10Hashar: "Changed based on John follow up change which uses role(ci) https://gerrit.wikimedia.org/r/c/operations/puppet/+/908232/11" [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [07:33:27] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [07:33:53] (03CR) 10Marostegui: [C: 03+2] db1217: Move it to misc [puppet] - 10https://gerrit.wikimedia.org/r/908465 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:34:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 10%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46641 and previous config saved to /var/cache/conftool/dbconfig/20230413-073437-root.json [07:34:43] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:36:39] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/907885/1730/" [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [07:37:11] (03PS12) 10Hashar: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [07:37:23] (03PS6) 10Elukey: profile::kafka::broker: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954 [07:38:03] (03CR) 10Hashar: "I have moved the role renaming in the parent change https://gerrit.wikimedia.org/r/c/operations/puppet/+/907885/" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [07:38:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40651/console" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [07:38:13] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [07:38:48] (03PS1) 10Muehlenhoff: Add entry for bookworm in wikimedia-private repo [puppet] - 10https://gerrit.wikimedia.org/r/908466 (https://phabricator.wikimedia.org/T330495) [07:39:04] (03PS2) 10Muehlenhoff: Add entry for bookworm in wikimedia-private repo [puppet] - 10https://gerrit.wikimedia.org/r/908466 (https://phabricator.wikimedia.org/T330495) [07:40:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 10%: Pooling', diff saved to https://phabricator.wikimedia.org/P46642 and previous config saved to /var/cache/conftool/dbconfig/20230413-074010-root.json [07:40:34] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40652/console" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [07:42:38] (03PS7) 10Elukey: profile::kafka::{broker,mirror}: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954 [07:45:22] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/908466 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [07:45:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40653/console" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [07:47:47] (03PS2) 10Muehlenhoff: smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) [07:48:11] (03CR) 10CI reject: [V: 04-1] smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) (owner: 10Muehlenhoff) [07:48:17] 10Puppet, 10Infrastructure-Foundations: Investigate why apache-htcacheclean is started - https://phabricator.wikimedia.org/T334577 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is resolved. [07:48:25] (03CR) 10Muehlenhoff: [C: 03+2] Add entry for bookworm in wikimedia-private repo [puppet] - 10https://gerrit.wikimedia.org/r/908466 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [07:49:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 25%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46643 and previous config saved to /var/cache/conftool/dbconfig/20230413-074942-root.json [07:49:48] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:49:56] (03CR) 10Elukey: [V: 03+1] "for some reason the change fails on a deployment prep jumbo node, but it seems due to kafka mirror. I think it is something related to pcc" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [07:52:46] (03PS1) 10Marostegui: instances.yaml: Add db1223 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908468 (https://phabricator.wikimedia.org/T326669) [07:54:14] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1223 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908468 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:55:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1223 to dbctl T326669', diff saved to https://phabricator.wikimedia.org/P46644 and previous config saved to /var/cache/conftool/dbconfig/20230413-075513-marostegui.json [07:55:19] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [07:55:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 25%: Pooling', diff saved to https://phabricator.wikimedia.org/P46645 and previous config saved to /var/cache/conftool/dbconfig/20230413-075522-root.json [07:56:03] (03PS1) 10Marostegui: db1223: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908469 (https://phabricator.wikimedia.org/T326669) [07:56:52] (03CR) 10Marostegui: [C: 03+2] db1223: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908469 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [07:57:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1223 (re)pooling @ 1%: Pooling db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46646 and previous config saved to /var/cache/conftool/dbconfig/20230413-075722-root.json [08:00:06] ^demon and hashar: Dear deployers, time to do the MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T0800). [08:00:34] !log imported perccli 007.1910.0000.000 to bookworm-wikimedia-private T330495 [08:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:38] T330495: Prepare our custom installer for Bookworm - https://phabricator.wikimedia.org/T330495 [08:02:59] (03PS3) 10Muehlenhoff: smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) [08:03:22] (03CR) 10CI reject: [V: 04-1] smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) (owner: 10Muehlenhoff) [08:04:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 50%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46647 and previous config saved to /var/cache/conftool/dbconfig/20230413-080447-root.json [08:04:52] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:06:59] (03PS4) 10Muehlenhoff: smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) [08:07:24] (03CR) 10CI reject: [V: 04-1] smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) (owner: 10Muehlenhoff) [08:09:21] (03CR) 10Jaime Nuche: [C: 03+1] contint: manage jenkins-ci dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar) [08:09:31] (03PS5) 10Muehlenhoff: smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) [08:10:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 50%: Pooling', diff saved to https://phabricator.wikimedia.org/P46649 and previous config saved to /var/cache/conftool/dbconfig/20230413-081027-root.json [08:11:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) (owner: 10Muehlenhoff) [08:12:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1223 (re)pooling @ 2%: Pooling db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46650 and previous config saved to /var/cache/conftool/dbconfig/20230413-081227-root.json [08:12:31] (03CR) 10jenkins-bot: smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T313984) (owner: 10Muehlenhoff) [08:12:33] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:15:05] (03PS1) 10Marostegui: site.pp: Productionize db1214 [puppet] - 10https://gerrit.wikimedia.org/r/908471 (https://phabricator.wikimedia.org/T326669) [08:15:51] (03CR) 10Cathal Mooney: Automate DHCP forwarding on Juniper L3 Swithces (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [08:17:43] (03CR) 10Stevemunene: [C: 03+2] Decommission an-worker1132 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/906017 (https://phabricator.wikimedia.org/T334092) (owner: 10Stevemunene) [08:19:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 75%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46651 and previous config saved to /var/cache/conftool/dbconfig/20230413-081952-root.json [08:19:58] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:24:38] (03PS1) 10Elukey: aptrepo: add more packages in the update list of rocm54 [puppet] - 10https://gerrit.wikimedia.org/r/908474 (https://phabricator.wikimedia.org/T295661) [08:25:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 75%: Pooling', diff saved to https://phabricator.wikimedia.org/P46652 and previous config saved to /var/cache/conftool/dbconfig/20230413-082532-root.json [08:27:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1223 (re)pooling @ 3%: Pooling db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46653 and previous config saved to /var/cache/conftool/dbconfig/20230413-082732-root.json [08:27:38] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:34:21] (03CR) 10Marostegui: [C: 03+2] site.pp: Productionize db1214 [puppet] - 10https://gerrit.wikimedia.org/r/908471 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [08:34:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1210 (re)pooling @ 100%: Pooling T326669', diff saved to https://phabricator.wikimedia.org/P46654 and previous config saved to /var/cache/conftool/dbconfig/20230413-083457-root.json [08:35:02] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:36:17] !log installing git security updates [08:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:40] (03CR) 10Vgutierrez: [C: 03+2] hieradata: move overrides to role/site part of hiera [puppet] - 10https://gerrit.wikimedia.org/r/908308 (owner: 10Jbond) [08:40:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1221 (re)pooling @ 100%: Pooling', diff saved to https://phabricator.wikimedia.org/P46655 and previous config saved to /var/cache/conftool/dbconfig/20230413-084036-root.json [08:42:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1223 (re)pooling @ 4%: Pooling db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46656 and previous config saved to /var/cache/conftool/dbconfig/20230413-084238-root.json [08:42:43] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:45:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:35] (03PS1) 10Elukey: adm_rocm: add support for Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908476 (https://phabricator.wikimedia.org/T295661) [08:49:50] (03PS2) 10Elukey: amd_rocm: add support for Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908476 (https://phabricator.wikimedia.org/T295661) [08:50:36] (03CR) 10CI reject: [V: 04-1] amd_rocm: add support for Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908476 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [08:50:42] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10MatthewVernon) There are still ongoing issues around thumbs; I noticed this morning that purging an image on commons (i.e. adding `?action=purge` to the commons... [08:51:24] (03PS1) 10Filippo Giunchedi: alertmanager: sink notifications for wmcs -dev hosts too [puppet] - 10https://gerrit.wikimedia.org/r/908477 (https://phabricator.wikimedia.org/T333204) [08:52:59] (03PS3) 10Elukey: amd_rocm: add support for Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908476 (https://phabricator.wikimedia.org/T295661) [08:53:24] (03CR) 10CI reject: [V: 04-1] amd_rocm: add support for Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908476 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [08:57:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1223 (re)pooling @ 5%: Pooling db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46657 and previous config saved to /var/cache/conftool/dbconfig/20230413-085742-root.json [08:57:48] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [08:59:38] (03PS4) 10Elukey: amd_rocm: add support for Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908476 (https://phabricator.wikimedia.org/T295661) [09:00:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40657/console" [puppet] - 10https://gerrit.wikimedia.org/r/908476 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [09:01:56] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1132.eqiad.wmnet with OS buster [09:02:09] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster [09:02:11] (03CR) 10Elukey: [C: 03+2] aptrepo: add more packages in the update list of rocm54 [puppet] - 10https://gerrit.wikimedia.org/r/908474 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [09:02:49] (03CR) 10Elukey: [V: 03+1 C: 03+2] amd_rocm: add support for Debian Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908476 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [09:04:54] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: centrallog1001.eqiad.wmnet [09:04:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: centrallog1001.eqiad.wmnet [09:05:01] 10SRE, 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decommission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: centrallog1001.eqiad.wmnet [09:05:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:02] (03PS1) 10Marostegui: mariadb 11.1: Add control files [software] - 10https://gerrit.wikimedia.org/r/908481 (https://phabricator.wikimedia.org/T333289) [09:09:44] (03CR) 10Marostegui: [C: 03+2] mariadb 11.1: Add control files [software] - 10https://gerrit.wikimedia.org/r/908481 (https://phabricator.wikimedia.org/T333289) (owner: 10Marostegui) [09:10:09] (03CR) 10Muehlenhoff: [C: 03+1] "Also applies to some non-WMCS hosts; cassandra-dev* and netbox-dev, but that seems fine as well." [puppet] - 10https://gerrit.wikimedia.org/r/908477 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [09:11:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [09:12:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/900621 (https://phabricator.wikimedia.org/T179463) (owner: 10Slyngshede) [09:12:28] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1001.eqiad.wmnet [09:12:39] (03PS1) 10Marostegui: install_server: Do not reimage db1210 [puppet] - 10https://gerrit.wikimedia.org/r/908482 [09:12:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1223 (re)pooling @ 10%: Pooling db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46658 and previous config saved to /var/cache/conftool/dbconfig/20230413-091247-root.json [09:12:52] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [09:12:54] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/908477 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [09:13:14] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1210 [puppet] - 10https://gerrit.wikimedia.org/r/908482 (owner: 10Marostegui) [09:13:34] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the quick reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/908477 (https://phabricator.wikimedia.org/T333204) (owner: 10Filippo Giunchedi) [09:15:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:12] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/908483 [09:16:21] (03PS6) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) [09:18:44] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/908483 (owner: 10Muehlenhoff) [09:21:04] 10SRE, 10ops-eqiad, 10Cloud-Services: Move cloudcephosd1021 to cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T334641 (10dcaro) If as I understand the ips don't change, this is fairly painless (unless jumbos strike again xd), all that's needed is to unlpug and plug again, it could be done without stop... [09:21:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:15] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1132.eqiad.wmnet with OS buster [09:22:25] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster executed with errors: - an-wo... [09:22:50] (03CR) 10Jbond: ci: indicate which server is the control server via a hiera param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [09:25:47] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dse-k8s-worker1001.eqiad.wmnet [09:26:19] (03PS7) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) [09:27:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1223 (re)pooling @ 25%: Pooling db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46659 and previous config saved to /var/cache/conftool/dbconfig/20230413-092752-root.json [09:27:57] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [09:30:31] (03PS1) 10Elukey: prometheus: use python3 for /usr/local/bin/prometheus-amd-rocm-stats [puppet] - 10https://gerrit.wikimedia.org/r/908485 (https://phabricator.wikimedia.org/T295661) [09:30:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:24] (03CR) 10Elukey: [C: 03+2] prometheus: use python3 for /usr/local/bin/prometheus-amd-rocm-stats [puppet] - 10https://gerrit.wikimedia.org/r/908485 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [09:32:48] 10SRE, 10SRE-swift-storage, 10Commons: Persistent 404 errors for some Commons files, fixable by overwriting - https://phabricator.wikimedia.org/T334346 (10bjh21) It looks like all of these files are now working properly, so at least this instance of the problem is resolved. [09:35:58] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40658/console" [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [09:36:29] 10SRE, 10SRE-swift-storage, 10Commons: Persistent 404 errors for some Commons files, fixable by overwriting - https://phabricator.wikimedia.org/T334346 (10MatthewVernon) There's a weekly job that runs on Monday mornings that attempts to pick up any objects (originals only) that got written to the primary DC... [09:40:01] (03CR) 10Hashar: [C: 04-1] "I have to verify the services status. Notably:" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [09:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1223 (re)pooling @ 50%: Pooling db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46660 and previous config saved to /var/cache/conftool/dbconfig/20230413-094257-root.json [09:43:03] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [09:44:01] RECOVERY - Check systemd state on dse-k8s-worker1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:33] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10MatthewVernon) >>! In T333042#8776941, @Umar wrote: > For more than a month I have not seen new versions of files. > > https://commons.wikimedia.org/w... [09:46:32] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10MatthewVernon) >>! In T333042#8764707, @Lionel_Scheepmans wrote: > Hello. I still have a problem with the display of a PDF on this page : https://fr.w... [09:47:06] (03PS1) 10DCausse: flink: upgrade to flink 1.16.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908486 (https://phabricator.wikimedia.org/T334244) [09:48:42] (03PS8) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) [09:48:47] (03PS1) 10Elukey: admin: add ml-team-admins to the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/908487 [09:49:16] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:49:19] (03CR) 10CI reject: [V: 04-1] admin: add ml-team-admins to the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/908487 (owner: 10Elukey) [09:50:03] (03PS1) 10Muehlenhoff: Failover idp.w.o as part of Tomcat update [dns] - 10https://gerrit.wikimedia.org/r/908488 [09:50:31] RECOVERY - Check systemd state on dse-k8s-worker1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:45] (03PS2) 10Elukey: admin: add ml-team-admins to the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/908487 [09:52:15] (03CR) 10CI reject: [V: 04-1] admin: add ml-team-admins to the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/908487 (owner: 10Elukey) [09:53:14] !log taavi@mwmaint2002 ~ $ mwscript emptyUserGroup.php --wiki frwikinews editor # T333750 [09:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:19] T333750: Remove non-existent user group on frwikinews - https://phabricator.wikimedia.org/T333750 [09:53:52] (03PS1) 10Jelto: install_server: use raidid in gitlab-raid1 recipe [puppet] - 10https://gerrit.wikimedia.org/r/908491 (https://phabricator.wikimedia.org/T330172) [09:54:46] (03PS3) 10Elukey: admin: add ml-team-admins to the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/908487 [09:55:12] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp.w.o as part of Tomcat update [dns] - 10https://gerrit.wikimedia.org/r/908488 (owner: 10Muehlenhoff) [09:58:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1223 (re)pooling @ 75%: Pooling db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46661 and previous config saved to /var/cache/conftool/dbconfig/20230413-095802-root.json [09:58:07] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [10:00:04] mvolz: That opportune time is upon us again. Time for a Services – Citoid / Zotero deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T1000). [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T1000) [10:06:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:34] !log installing tomcat security updates [10:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:12] (03CR) 10Hashar: [C: 04-1] "I went to download the PCC change catalogues and used jq to find the parameters passed to each of jenkins, zuul-merger and zuul services:" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [10:08:32] (03CR) 10David Caro: "Looks good, minor comments (formatting, commit msg)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [10:09:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/908491 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [10:10:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/908487 (owner: 10Elukey) [10:11:31] (03CR) 10Elukey: [C: 03+2] admin: add ml-team-admins to the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/908487 (owner: 10Elukey) [10:13:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1223 (re)pooling @ 100%: Pooling db1223 T326669', diff saved to https://phabricator.wikimedia.org/P46662 and previous config saved to /var/cache/conftool/dbconfig/20230413-101307-root.json [10:13:12] T326669: Productionize db1206-db1225 - https://phabricator.wikimedia.org/T326669 [10:15:12] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1132.eqiad.wmnet with OS buster [10:15:25] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster [10:20:06] (03CR) 10Majavah: [C: 04-1] tools-webservice: set default for buildservice-image (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [10:20:44] (03CR) 10Jelto: [C: 03+2] install_server: use raidid in gitlab-raid1 recipe [puppet] - 10https://gerrit.wikimedia.org/r/908491 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto) [10:23:49] !log clear old 2/22/Free-object-universal-property.svg thumbs from wikipedia-commons-local-thumb.22 T334303 [10:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:54] T334303: PNG thumbnail of Wikimedia Commons SVG file sometimes not updated - https://phabricator.wikimedia.org/T334303 [10:24:15] (03CR) 10Gmodena: [C: 03+1] flink: upgrade to flink 1.16.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908486 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse) [10:24:59] 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic: PNG thumbnail of Wikimedia Commons SVG file sometimes not updated - https://phabricator.wikimedia.org/T334303 (10MatthewVernon) I have cleared out the old thumbnails of this image (so as the CDN expires the ones its cached they should get regenerated). [10:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:49] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) >>! In T332650#8774188, @jsn.sherman wrote: > 2. i... [10:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:35:54] (03PS1) 10Hnowlan: thumbor: make tmp-dir configurable, default disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) [10:36:34] 10SRE, 10Infrastructure-Foundations, 10netops: Verify and Configure ECMP operation for EVPN switches - https://phabricator.wikimedia.org/T334658 (10cmooney) p:05Triage→03Medium [10:36:46] (03CR) 10CI reject: [V: 04-1] thumbor: make tmp-dir configurable, default disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [10:38:00] (03PS2) 10Hnowlan: thumbor: make tmp-dir configurable, default disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) [10:38:54] (03CR) 10CI reject: [V: 04-1] thumbor: make tmp-dir configurable, default disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [10:39:28] !log updating appservers and api certificates - T334561 [10:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:32] T334561: Add mw-on-k8s deployments to mediawiki certificates - https://phabricator.wikimedia.org/T334561 [10:45:16] (03PS3) 10Hnowlan: thumbor: make tmp-dir configurable, default disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) [10:48:20] (03PS1) 10Clément Goubert: ssl: Update api.svc and appservers.svc certs [puppet] - 10https://gerrit.wikimedia.org/r/908502 (https://phabricator.wikimedia.org/T334561) [10:49:31] (03CR) 10Filippo Giunchedi: [C: 03+1] ssl: Update api.svc and appservers.svc certs [puppet] - 10https://gerrit.wikimedia.org/r/908502 (https://phabricator.wikimedia.org/T334561) (owner: 10Clément Goubert) [10:51:47] (03CR) 10Clément Goubert: [C: 03+2] ssl: Update api.svc and appservers.svc certs [puppet] - 10https://gerrit.wikimedia.org/r/908502 (https://phabricator.wikimedia.org/T334561) (owner: 10Clément Goubert) [10:52:35] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1003.wikimedia.org (A3) - https://phabricator.wikimedia.org/T333996 (10Jclark-ctr) 05Open→03Resolved drives installed into gitlab1003 [10:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:57:04] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Samwalton9) >>! In T332650#8778602, @Tgr wrote: > How f... [10:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:59:11] (03PS2) 10Muehlenhoff: Add krb2002 as additional KDC [puppet] - 10https://gerrit.wikimedia.org/r/906560 (https://phabricator.wikimedia.org/T331695) [10:59:26] (03PS2) 10Cathal Mooney: Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) [10:59:59] (03CR) 10CI reject: [V: 04-1] Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [11:04:53] (03CR) 10Ladsgroup: [C: 03+1] "Let me know if you want me to deploy it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905626 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [11:07:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q4): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10fnegri) [11:08:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff) >>! In T331899#8774740, @ItamarWMDE wrote: > Hello @MoritzMuehlenhoff and @BCornwall, apologies for the delay in the response... [11:09:25] (03PS1) 10Marostegui: mariadb: Productionize db1217 [puppet] - 10https://gerrit.wikimedia.org/r/908508 (https://phabricator.wikimedia.org/T326669) [11:09:59] (03PS3) 10Cathal Mooney: Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) [11:10:01] (03PS1) 10Jbond: sre.hardware.upgrade-firmware: update to only upgrade one hardware type [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 [11:10:29] (03CR) 10CI reject: [V: 04-1] Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [11:10:51] (03CR) 10Hokwelum: [C: 03+1] blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno) [11:11:03] (03CR) 10Michael Große: [C: 04-1] "This should only be deployed after T333813 (validation!) is done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908509 (https://phabricator.wikimedia.org/T332725) (owner: 10Michael Große) [11:12:21] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: update to only upgrade one hardware type [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 (owner: 10Jbond) [11:13:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1217 [puppet] - 10https://gerrit.wikimedia.org/r/908508 (https://phabricator.wikimedia.org/T326669) (owner: 10Marostegui) [11:14:01] (03PS4) 10Cathal Mooney: Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) [11:15:02] (03CR) 10Muehlenhoff: Also broadcast RCFeed/IRC events to irc1002/irc2002 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905626 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [11:15:25] !log Re-deploying mw-on-k8s to update certificates - T334561 [11:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:30] T334561: Add mw-on-k8s deployments to mediawiki certificates - https://phabricator.wikimedia.org/T334561 [11:16:28] (03CR) 10Cathal Mooney: Automate and update DHCP relay configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [11:16:32] !log cgoubert@deploy2002 Started scap: Updating mw-on-k8s certificates [11:18:28] !log cgoubert@deploy2002 Finished scap: Updating mw-on-k8s certificates (duration: 01m 56s) [11:23:27] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr) [11:23:54] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [11:24:12] !log installing imagemagick security updates [11:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P46664 and previous config saved to /var/cache/conftool/dbconfig/20230413-112446-root.json [11:31:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10ItamarWMDE) sure, do I need to open a separate request for that? [11:32:13] (03PS2) 10Jbond: sre.hardware.upgrade-firmware: update to only upgrade one hardware type [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 [11:32:43] (03PS1) 10Marostegui: instances.yaml: Remove db1120 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908522 (https://phabricator.wikimedia.org/T334580) [11:34:07] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1120 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/908522 (https://phabricator.wikimedia.org/T334580) (owner: 10Marostegui) [11:34:17] (03CR) 10CI reject: [V: 04-1] sre.hardware.upgrade-firmware: update to only upgrade one hardware type [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 (owner: 10Jbond) [11:34:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1120 from dbctl T334580', diff saved to https://phabricator.wikimedia.org/P46665 and previous config saved to /var/cache/conftool/dbconfig/20230413-113435-marostegui.json [11:34:41] T334580: decommission db1120.eqiad.wmnet - https://phabricator.wikimedia.org/T334580 [11:36:22] (03PS1) 10Majavah: P:toolforge::prometheus: scrape harbor metrics [puppet] - 10https://gerrit.wikimedia.org/r/908525 [11:38:16] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Ladsgroup) [11:38:26] 10SRE, 10Commons, 10MediaWiki-File-management, 10Traffic: PNG thumbnail of Wikimedia Commons SVG file sometimes not updated - https://phabricator.wikimedia.org/T334303 (10Ladsgroup) [11:38:52] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) [11:39:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P46666 and previous config saved to /var/cache/conftool/dbconfig/20230413-113951-root.json [11:41:20] (03PS13) 10Jbond: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) [11:42:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40659/console" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [11:42:50] (03PS1) 10Marostegui: db1120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908526 (https://phabricator.wikimedia.org/T334580) [11:43:10] (03CR) 10Kamila Součková: [C: 03+2] "kamila stupid question time (I want to make sure I understand): how does the dying workers GC work? Is that the default behaviour of k8s w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [11:45:47] (03CR) 10Marostegui: [C: 03+2] db1120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/908526 (https://phabricator.wikimedia.org/T334580) (owner: 10Marostegui) [11:49:00] (03Merged) 10jenkins-bot: thumbor: make tmp-dir configurable, default disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [11:50:23] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) The issue is exactly the problem outlined above. People have been complaining that T330942 is not resolved, that is fixed. e.g. If you go to eqiad,... [11:50:53] (03CR) 10Jbond: ci: indicate which server is the control server via a hiera param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [11:51:59] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10IKhitron) So, there still is no way to fix an existing thumb? [11:54:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P46667 and previous config saved to /var/cache/conftool/dbconfig/20230413-115456-root.json [11:56:06] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: update to only upgrade one hardware type [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 [11:59:31] 10SRE, 10LDAP-Access-Requests, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10MarcoAurelio) 05Stalled→03In progress Apologies for the delay. I emailed @KFrancis on the day she requested me to do so, however I had some questions before moving forward... [12:00:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:04:46] (03CR) 10Volans: "LGTM in general, couple of minor things inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 (owner: 10Jbond) [12:07:15] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [12:07:37] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/904177 (owner: 10Ayounsi) [12:08:15] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [12:10:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P46668 and previous config saved to /var/cache/conftool/dbconfig/20230413-121001-root.json [12:11:21] !log installing imagemagick security updates for buster T328901 [12:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:42] (03CR) 10Cathal Mooney: [C: 03+1] "Overall looks good to me, replied on your comment about the possible additions but overall LGTM." [homer/public] - 10https://gerrit.wikimedia.org/r/905257 (owner: 10Ayounsi) [12:12:16] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [12:13:07] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) >>! In T332650#8778681, @Samwalton9 wrote:... [12:13:15] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [12:15:21] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus3001.esams.wmnet [12:18:01] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) >>! In T331138#8778788, @IKhitron wrote: > So, there still is no way to fix an existing thumb? Until the actual issue is resolved, the community can... [12:21:02] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10IKhitron) Ouch. But if just deleting old thumbs works, does this mean one can just add a pixel to the image and reupload it now, and it will work this time? [12:21:03] !log remove imagemagick 8:6.9.10.23+dfsg-2.1+deb10u1+wmf1 from apt.wikimedia.org (obsoleted by 8:6.9.10.23+dfsg-2.1+deb10u4 from the Debian archive) T328901 [12:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:12] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3001.esams.wmnet [12:25:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P46669 and previous config saved to /var/cache/conftool/dbconfig/20230413-122506-root.json [12:26:50] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) You can do something simpler, if after reupload e.g. 200px thumb size is not updated, just use 205px thumbsize instead, that should trigger a new thu... [12:28:39] (03PS1) 10Aqu: Remove extra check on webrequest _SUCCESS files on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) [12:30:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff) >>! In T331899#8778731, @ItamarWMDE wrote: > sure, do I need to open a separate request for that? No need, the SRE on our we... [12:31:15] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10IKhitron) [[https://he.wikipedia.org/wiki/%D7%9E%D7%A9%D7%AA%D7%9E%D7%A9:IKhitron/%D7%98%D7%99%D7%95%D7%98%D7%94|Does not help]], the bottom image still has lab... [12:32:03] (03CR) 10David Caro: [C: 03+1] P:toolforge::prometheus: scrape harbor metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908525 (owner: 10Majavah) [12:32:18] (03CR) 10David Caro: [C: 03+2] P:toolforge::prometheus: scrape harbor metrics [puppet] - 10https://gerrit.wikimedia.org/r/908525 (owner: 10Majavah) [12:33:42] !log installing Django security updates [12:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:20] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40660/console" [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [12:35:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] beta wikidata: Enable new EntitySchema datatype (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908509 (https://phabricator.wikimedia.org/T332725) (owner: 10Michael Große) [12:38:55] !log installing systemd security updates on buster [12:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:29] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) What I mean is something like this https://he.wikipedia.org/w/index.php?title=%D7%9E%D7%A9%D7%AA%D7%9E%D7%A9%3AIKhitron%2F%D7%98%D7%99%D7%95%D7%98%D7... [12:39:39] (03CR) 10Ottomata: Remove extra check on webrequest _SUCCESS files on HDFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [12:40:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P46670 and previous config saved to /var/cache/conftool/dbconfig/20230413-124011-root.json [12:41:14] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10IKhitron) Indeed, but how do you know which sizes work and which don't? [12:43:53] (03PS1) 10Clément Goubert: Revert "Revert "cxserver: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908302 [12:44:40] (03PS4) 10Jbond: sre.hardware.upgrade-firmware: update to only upgrade one hardware type [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 [12:44:45] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 (owner: 10Jbond) [12:45:41] (03PS1) 10Aqu: Prepare removal of systemd_timer check_webrequest_partitions [puppet] - 10https://gerrit.wikimedia.org/r/908533 (https://phabricator.wikimedia.org/T327073) [12:46:13] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [12:48:05] (03PS2) 10Aqu: Remove extra check on webrequest _SUCCESS files on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) [12:49:57] !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [12:50:50] (03CR) 10Aqu: "Thanks for the review Ottomata!" [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [12:51:47] (03CR) 10Clément Goubert: [C: 03+2] Revert "Revert "cxserver: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908302 (owner: 10Clément Goubert) [12:55:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P46671 and previous config saved to /var/cache/conftool/dbconfig/20230413-125516-root.json [12:56:04] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar) [12:56:40] (03Merged) 10jenkins-bot: Revert "Revert "cxserver: Switch to mw-api-int-async on k8s"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908302 (owner: 10Clément Goubert) [12:56:51] !log Migrating cxserver to mw-api-int on kubernetes, take two - T334204 [12:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:56] T334204: Migrate cxserver to mw-api-int - https://phabricator.wikimedia.org/T334204 [12:57:28] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:57:54] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:58:06] (03PS1) 10DCausse: rdf-streaming-updater: disable jemalloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/908539 [12:58:57] (03CR) 10CDanis: Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar) [12:59:19] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T1300) [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:20] (03CR) 10David Caro: [V: 03+1 C: 03+2] maintain_dbusers: move all the files under service [puppet] - 10https://gerrit.wikimedia.org/r/906637 (owner: 10David Caro) [13:00:36] yup, nothing to do [13:00:36] (03CR) 10DCausse: [C: 04-1] rdf-streaming-updater: tune managed memory instead of overhead (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/907934 (owner: 10DCausse) [13:02:02] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@a08f56d]: (no justification provided) [13:03:01] Welp, take two didn't work either lol [13:03:15] * claime continues competing with Amir1 for number of reverts in a commit message [13:03:27] lol [13:04:42] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:04:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [13:05:24] claime: Can you log errors to the task if that's helpful for Language team to take a look? [13:05:47] kart_: I could, but it's more me fumbling around with ingress/egress settings [13:05:51] It's not applicative [13:06:01] claime: OK :) [13:06:22] (03PS3) 10DCausse: rdf-streaming-updater: tune managed memory instead of overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/907934 [13:06:57] (03PS1) 10Clément Goubert: Revert "Revert "Revert "cxserver: Switch to mw-api-int-async on k8s""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908303 [13:08:56] (03CR) 10CDanis: [C: 04-1] Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar) [13:10:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P46672 and previous config saved to /var/cache/conftool/dbconfig/20230413-131021-root.json [13:14:13] (03PS1) 10Andrew Bogott: cloudvirt100[1-3] fix yet another partman typo [puppet] - 10https://gerrit.wikimedia.org/r/908541 (https://phabricator.wikimedia.org/T329863) [13:15:06] (03PS1) 10Clément Goubert: admin_ng: Add mw-on-k8s Egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/908542 (https://phabricator.wikimedia.org/T333120) [13:15:13] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt100[1-3] fix yet another partman typo [puppet] - 10https://gerrit.wikimedia.org/r/908541 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [13:19:04] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@a08f56d]: (no justification provided) (duration: 17m 02s) [13:19:15] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [13:23:03] (03CR) 10Hnowlan: [C: 03+1] admin_ng: Add mw-on-k8s Egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/908542 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [13:23:22] (03CR) 10Clément Goubert: [C: 03+2] admin_ng: Add mw-on-k8s Egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/908542 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [13:25:17] (03CR) 10Elukey: Remove extra check on webrequest _SUCCESS files on HDFS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908529 (https://phabricator.wikimedia.org/T327073) (owner: 10Aqu) [13:25:19] (03PS1) 10MSantos: mobileapps: bump to 2023-04-13-131847-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/908544 [13:25:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P46673 and previous config saved to /var/cache/conftool/dbconfig/20230413-132525-root.json [13:26:57] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) I do wonder if we can identify some of the... [13:27:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: Drop kubernetes 1.21 components [puppet] - 10https://gerrit.wikimedia.org/r/908275 (https://phabricator.wikimedia.org/T286856) (owner: 10Majavah) [13:27:45] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 (owner: 10Jbond) [13:28:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: Add kubeadm 1.23 component [puppet] - 10https://gerrit.wikimedia.org/r/908276 (https://phabricator.wikimedia.org/T298005) (owner: 10Majavah) [13:28:16] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: Add kubeadm 1.23 component [puppet] - 10https://gerrit.wikimedia.org/r/908276 (https://phabricator.wikimedia.org/T298005) (owner: 10Majavah) [13:28:40] (03Merged) 10jenkins-bot: admin_ng: Add mw-on-k8s Egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/908542 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [13:29:32] (03PS1) 10Clément Goubert: Revert "admin_ng: Add mw-on-k8s Egress rules" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908304 [13:29:42] (03CR) 10Clément Goubert: [C: 03+2] Revert "admin_ng: Add mw-on-k8s Egress rules" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908304 (owner: 10Clément Goubert) [13:29:46] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] Revert "admin_ng: Add mw-on-k8s Egress rules" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908304 (owner: 10Clément Goubert) [13:30:27] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2023-04-13-131847-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/908544 (owner: 10MSantos) [13:30:52] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2007 [13:31:16] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2007 [13:33:33] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) >>! In T332650#8779038, @jsn.sherman wrote... [13:33:40] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [13:33:44] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Apr-Jun-2023), 10Datacenter-Switchover: CommRel support for April 2023 Datacenter Switchback - https://phabricator.wikimedia.org/T334671 (10Clement_Goubert) [13:33:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [13:35:34] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10doctaxon) >>! In T332650#8770440, @doctaxon wrote: > @T... [13:35:59] (03Merged) 10jenkins-bot: Revert "admin_ng: Add mw-on-k8s Egress rules" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908304 (owner: 10Clément Goubert) [13:36:52] (03Merged) 10jenkins-bot: mobileapps: bump to 2023-04-13-131847-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/908544 (owner: 10MSantos) [13:37:59] !log mbsantos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:38:17] !log mbsantos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:38:49] (03PS1) 10Vgutierrez: cache::haproxy: Enable coredump configuration [puppet] - 10https://gerrit.wikimedia.org/r/908546 (https://phabricator.wikimedia.org/T334448) [13:38:51] (03PS1) 10Vgutierrez: hiera: Enable coredumps for haproxy at text cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/908547 (https://phabricator.wikimedia.org/T334448) [13:38:55] PROBLEM - DPKG on an-airflow1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:39:15] 10SRE, 10MediaWiki-extensions-OAuth, 10The-Wikipedia-Library, 10Datacenter-Switchover, and 2 others: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) >>! In T332650#8779061, @doctaxon wrote: >... [13:40:24] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40661/console" [puppet] - 10https://gerrit.wikimedia.org/r/908547 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [13:40:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P46674 and previous config saved to /var/cache/conftool/dbconfig/20230413-134030-root.json [13:42:24] (03CR) 10FNegri: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907888 (owner: 10David Caro) [13:42:31] !log mbsantos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:42:36] ^ fixing an-airflow1001 [13:42:54] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40662/console" [puppet] - 10https://gerrit.wikimedia.org/r/908546 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [13:43:26] !log mbsantos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:43:28] (03CR) 10David Caro: build: add helper scripts (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro) [13:43:37] !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye [13:44:37] !log andrew@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirtlocal1003'] [13:44:46] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudvirtlocal1003'] [13:45:02] !log mbsantos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:45:03] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1003.eqiad.wmnet with OS bullseye [13:45:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1003.eqiad.wmnet with OS bullseye [13:45:29] (03PS2) 10David Caro: build: add helper scripts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 [13:45:47] (03CR) 10David Caro: build: add helper scripts (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro) [13:45:57] (03Abandoned) 10Samtar: Revert "Remove 50% opacity from notification badges when they are all read" [extensions/Echo] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902155 (https://phabricator.wikimedia.org/T331502) (owner: 10Samtar) [13:45:57] !log mbsantos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:46:05] (03Abandoned) 10Samtar: Revert "Remove 50% opacity from notification badges when they are all read" [extensions/Echo] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/902154 (https://phabricator.wikimedia.org/T331502) (owner: 10Samtar) [13:46:40] (03CR) 10Ssingh: [C: 03+1] hiera: Enable coredumps for haproxy at text cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/908547 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [13:46:45] (03CR) 10Ssingh: [C: 03+1] cache::haproxy: Enable coredump configuration [puppet] - 10https://gerrit.wikimedia.org/r/908546 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [13:47:08] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Receive network latency reports from the browsers - https://phabricator.wikimedia.org/T334417 (10CDanis) [13:48:35] (03PS1) 10Hnowlan: thumbor: set maxUnavailable to a higher number [deployment-charts] - 10https://gerrit.wikimedia.org/r/908549 (https://phabricator.wikimedia.org/T334488) [13:48:38] (03CR) 10Hashar: [C: 03+1] "+1 thank you so much for the new pattern!" [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [13:49:16] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:49:26] (03PS2) 10Hnowlan: thumbor: set maxUnavailable to a higher number [deployment-charts] - 10https://gerrit.wikimedia.org/r/908549 (https://phabricator.wikimedia.org/T334488) [13:49:31] (03PS14) 10Hashar: ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [13:51:24] (03CR) 10FNegri: [C: 03+1] "It would be nice to share those scripts in a central location, instead of copy/pasting them in different repos... But for now, I'm fine wi" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro) [13:51:57] (03PS1) 10Ssingh: hiera: remove bgp-med override for lvs2007 [puppet] - 10https://gerrit.wikimedia.org/r/908552 (https://phabricator.wikimedia.org/T321309) [13:54:13] !log [puppetmaster] sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080; failing PCC for recently reimaged node [13:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:53] (03CR) 10Kamila Součková: [C: 03+2] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908549 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:00:35] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:01:06] (03Merged) 10jenkins-bot: thumbor: set maxUnavailable to a higher number [deployment-charts] - 10https://gerrit.wikimedia.org/r/908549 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [14:01:25] PROBLEM - Host irc2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:31] PROBLEM - Host pybal-test2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:31] PROBLEM - Host ganeti2019 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:31] PROBLEM - Host kubestagetcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:37] PROBLEM - Host ncredir2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:39] PROBLEM - Host orespoolcounter2004 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:44] what's happening here? [14:01:49] PROBLEM - Host kubemaster2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:55] (03CR) 10Majavah: [C: 04-1] build: add helper scripts (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907890 (owner: 10David Caro) [14:02:11] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable coredumps for haproxy at text cache cluster [puppet] - 10https://gerrit.wikimedia.org/r/908547 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [14:02:19] RECOVERY - Host ganeti2019 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [14:02:21] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Enable coredump configuration [puppet] - 10https://gerrit.wikimedia.org/r/908546 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [14:02:37] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [14:02:38] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:02:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [14:02:43] looks like a ganeti node issue? [14:03:07] seems so yeah but not sure if known or not [14:03:18] irc.wm.o is a CNAME to irc1001, not 2001 [14:03:33] (03CR) 10JMeybohm: [C: 03+1] "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/908553 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [14:04:07] PROBLEM - Check systemd state on ganeti2019 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service,prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:11] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:04:53] !log rolling restart of HAProxy on A:cp-text - T334448 [14:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:58] T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 [14:05:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1002.eqiad.wmnet [14:05:32] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:05:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST csidrivers) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:05:35] RECOVERY - Host kubemaster2002 is UP: PING OK - Packet loss = 0%, RTA = 6.91 ms [14:05:35] (03Abandoned) 10David Caro: DONOTMERGE tests for pcc [puppet] - 10https://gerrit.wikimedia.org/r/739766 (owner: 10David Caro) [14:05:37] RECOVERY - Host kubestagetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 3.49 ms [14:05:41] RECOVERY - Check systemd state on ganeti2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:47] RECOVERY - Host pybal-test2003 is UP: PING OK - Packet loss = 0%, RTA = 8.09 ms [14:06:13] RECOVERY - Host ncredir2002 is UP: PING OK - Packet loss = 0%, RTA = 6.94 ms [14:06:25] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [14:06:31] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:07:21] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:07:39] PROBLEM - Check systemd state on kubemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:29] RECOVERY - DPKG on an-airflow1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:09:35] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [14:09:57] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: tune managed memory instead of overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/907934 (owner: 10DCausse) [14:10:07] (03PS4) 10Bking: rdf-streaming-updater: tune managed memory instead of overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/907934 (owner: 10DCausse) [14:10:32] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST csidrivers) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:10:42] (03PS1) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) [14:12:35] (03CR) 10Jbond: [C: 03+2] ci: rename ci::master role to ci [puppet] - 10https://gerrit.wikimedia.org/r/907885 (https://phabricator.wikimedia.org/T254646) (owner: 10Hashar) [14:12:39] (03CR) 10Jbond: [C: 03+2] ci: indicate which server is the control server via a hiera param [puppet] - 10https://gerrit.wikimedia.org/r/908232 (https://phabricator.wikimedia.org/T324659) (owner: 10Jbond) [14:12:43] (03CR) 10CI reject: [V: 04-1] dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [14:14:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet [14:14:05] 10SRE, 10Anti-Harassment, 10Cloud-Services, 10Content-Transform-Team, and 16 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10MSantos) [14:14:39] (03PS2) 10Clément Goubert: cxserver: Add mesh egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/908553 (https://phabricator.wikimedia.org/T333120) [14:14:47] (03PS1) 10Vgutierrez: cache::haproxy: Drop LimitCORESoft [puppet] - 10https://gerrit.wikimedia.org/r/908557 (https://phabricator.wikimedia.org/T334448) [14:15:48] (03CR) 10Bking: rdf-streaming-updater: tune managed memory instead of overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/907934 (owner: 10DCausse) [14:15:50] (03CR) 10Ssingh: [C: 03+1] cache::haproxy: Drop LimitCORESoft [puppet] - 10https://gerrit.wikimedia.org/r/908557 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [14:15:52] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: tune managed memory instead of overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/907934 (owner: 10DCausse) [14:16:15] (03PS1) 10Hnowlan: rest-gateway: support for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) [14:16:43] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Drop LimitCORESoft [puppet] - 10https://gerrit.wikimedia.org/r/908557 (https://phabricator.wikimedia.org/T334448) (owner: 10Vgutierrez) [14:17:11] (03PS1) 10Hashar: Remove ci::manager ci::worker roles from Hiera [labs/private] - 10https://gerrit.wikimedia.org/r/908560 [14:17:29] (03CR) 10Hashar: [V: 03+2 C: 03+2] Remove ci::manager ci::worker roles from Hiera [labs/private] - 10https://gerrit.wikimedia.org/r/908560 (owner: 10Hashar) [14:18:43] (03CR) 10Ottomata: profile::kafka::{broker,mirror}: refactor TLS settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [14:19:20] (03PS3) 10Clément Goubert: cxserver: Add mesh egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/908553 (https://phabricator.wikimedia.org/T333120) [14:19:25] (03PS4) 10CDanis: Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar) [14:19:32] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar) [14:19:44] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [14:20:16] !log installing mariadb-10.3 security updates (as shipped in Debian, not the wmf-mariadb packages) [14:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:34] (03CR) 10CI reject: [V: 04-1] rest-gateway: support for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [14:20:43] (03CR) 10Ottomata: flink: upgrade to flink 1.16.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908486 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse) [14:21:08] (03Merged) 10jenkins-bot: rdf-streaming-updater: tune managed memory instead of overhead [deployment-charts] - 10https://gerrit.wikimedia.org/r/907934 (owner: 10DCausse) [14:21:12] (03PS4) 10Clément Goubert: cxserver: Add mesh egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/908553 (https://phabricator.wikimedia.org/T333120) [14:21:51] (03CR) 10Elukey: [V: 03+1] profile::kafka::{broker,mirror}: refactor TLS settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [14:22:17] (03CR) 10Ottomata: profile::kafka::{broker,mirror}: refactor TLS settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [14:23:32] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:23:43] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:25:00] (03CR) 10Ssingh: [C: 03+2] hiera: remove bgp-med override for lvs2007 [puppet] - 10https://gerrit.wikimedia.org/r/908552 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:25:29] (03CR) 10Clément Goubert: [C: 03+2] contint: manage jenkins-ci dsh group from Puppet DB [puppet] - 10https://gerrit.wikimedia.org/r/893484 (https://phabricator.wikimedia.org/T328920) (owner: 10Hashar) [14:26:15] !log restart pybal on lvs2007 to pick up bgp-med change CR 908552 [14:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:35] PROBLEM - Check whether ferm is active by checking the default input chain on kubemaster2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:27:53] (03PS2) 10Raymond Ndibe: tools-webservice: set default for buildservice-image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) [14:28:15] (03CR) 10Raymond Ndibe: tools-webservice: set default for buildservice-image (035 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [14:29:32] (03CR) 10Cathal Mooney: "Just looking in Prometheus directly I do see these stats exported for Ganeti servers? Do you mean the VMs? For instance:" [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [14:30:09] (03CR) 10JMeybohm: [C: 03+1] cxserver: Add mesh egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/908553 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [14:30:27] (03CR) 10Clément Goubert: [C: 03+2] cxserver: Add mesh egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/908553 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [14:30:57] (03PS2) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) [14:35:39] 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:35:47] (03Merged) 10jenkins-bot: cxserver: Add mesh egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/908553 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [14:36:11] (03PS2) 10DCausse: flink: upgrade to flink 1.16.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908486 (https://phabricator.wikimedia.org/T334244) [14:36:14] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [14:36:17] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [14:36:20] (03CR) 10DCausse: flink: upgrade to flink 1.16.1 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908486 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse) [14:37:00] jouncebot: now [14:37:00] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [14:37:29] (03PS11) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) [14:38:19] (03PS3) 10Jbond: dcops: add netdev duplex and speed checks [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) [14:38:48] (03CR) 10Jbond: "thanks see response inline" [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [14:39:54] (03CR) 10Filippo Giunchedi: "Overall LGTM, see inline too" [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [14:39:55] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [14:40:24] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: update to only upgrade one hardware type (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 (owner: 10Jbond) [14:40:29] (03CR) 10Bking: [C: 03+2] flink: upgrade to flink 1.16.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908486 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse) [14:41:57] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [14:42:32] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: update to only upgrade one hardware type [cookbooks] - 10https://gerrit.wikimedia.org/r/908510 (owner: 10Jbond) [14:47:10] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [14:47:17] (03CR) 10Kamila Součková: rest-gateway: support for proton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [14:47:41] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [14:48:05] (03PS1) 10Vgutierrez: Revert "hiera: Enable ESI testing in cp3064" [puppet] - 10https://gerrit.wikimedia.org/r/908568 (https://phabricator.wikimedia.org/T308799) [14:48:07] (03PS1) 10Vgutierrez: Revert "hiera: Enable esitest on text@eqsin" [puppet] - 10https://gerrit.wikimedia.org/r/908569 (https://phabricator.wikimedia.org/T308799) [14:48:43] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Investigate HW requirements for Thanos frontend - https://phabricator.wikimedia.org/T312201 (10fgiunchedi) We'll need real hardware since there's significant memory requirements (min 32G), I've put in requests for 2x config A 10G hosts (2x eqia... [14:49:05] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [14:50:14] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Priority Backlog 📥): git-fat replacement/removal - https://phabricator.wikimedia.org/T279509 (10hashar) [14:50:44] (03PS3) 10Ayounsi: Manage drmrs LVS/bird BGP with Homer [homer/public] - 10https://gerrit.wikimedia.org/r/905257 [14:50:50] (03CR) 10Cathal Mooney: "Overall LGTM, few comments in-line which might help make CI happy. Filtering interfaces starting with an 'e' and status 'up' should be fa" [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [14:51:38] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Seen): git-fat replacement/removal - https://phabricator.wikimedia.org/T279509 (10hashar) a:05demon→03None [14:51:53] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Seen): git-fat replacement/removal - https://phabricator.wikimedia.org/T279509 (10hashar) [14:53:51] (03PS4) 10Ayounsi: Manage drmrs LVS/bird BGP with Homer [homer/public] - 10https://gerrit.wikimedia.org/r/905257 [14:54:47] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40666/console" [puppet] - 10https://gerrit.wikimedia.org/r/908568 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [14:55:20] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] Revert "hiera: Enable ESI testing in cp3064" [puppet] - 10https://gerrit.wikimedia.org/r/908568 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [14:55:26] (03CR) 10Ayounsi: Manage drmrs LVS/bird BGP with Homer (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/905257 (owner: 10Ayounsi) [14:58:30] (03PS2) 10Michael Große: beta wikidata: Enable new EntitySchema datatype [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908509 (https://phabricator.wikimedia.org/T332725) [14:58:32] (03CR) 10Michael Große: beta wikidata: Enable new EntitySchema datatype (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908509 (https://phabricator.wikimedia.org/T332725) (owner: 10Michael Große) [15:00:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Config change itself looks fine now, though still blocked on T333813 (data type validation)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908509 (https://phabricator.wikimedia.org/T332725) (owner: 10Michael Große) [15:00:32] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirtlocal1003.eqiad.wmnet with OS bullseye [15:03:05] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [15:03:06] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [15:03:46] (03CR) 10Cathal Mooney: dcops: add netdev duplex and speed checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [15:04:24] (03PS8) 10Elukey: profile::kafka::{broker,mirror}: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954 [15:04:43] !log installing unbound security updates on buster [15:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:58] (03PS1) 10Hashar: gerrit: remove leftover absent http config [puppet] - 10https://gerrit.wikimedia.org/r/908574 (https://phabricator.wikimedia.org/T326125) [15:05:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40667/console" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [15:05:22] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [15:06:25] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [15:06:31] (03PS2) 10Hnowlan: rest-gateway: support for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) [15:06:40] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [15:07:03] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [15:07:08] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40668/console" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [15:07:18] (03CR) 10Elukey: [V: 03+1] profile::kafka::{broker,mirror}: refactor TLS settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [15:07:38] (03CR) 10Elukey: [V: 03+1] "Ottomata: +1 from you to proceed?" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [15:08:04] (03PS1) 10Andrew Bogott: Fix hostname for new cloudvirtlocal hosts [puppet] - 10https://gerrit.wikimedia.org/r/908576 (https://phabricator.wikimedia.org/T329863) [15:08:43] (03CR) 10Andrew Bogott: [C: 03+2] Fix hostname for new cloudvirtlocal hosts [puppet] - 10https://gerrit.wikimedia.org/r/908576 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [15:09:06] 10SRE, 10Phabricator: Remove phabricator Multi-factor Auth for Atieno - https://phabricator.wikimedia.org/T334480 (10sbassett) >>! In T334480#8778684, @Atieno wrote: > Hi @sbassett so can I schedule some time on your calendar for a video chat or can I slack you as @Aklapper has suggested. Though, I might go t... [15:09:21] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [15:09:26] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1003.eqiad.wmnet with OS bullseye [15:09:28] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [15:09:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1003.eqiad.wmnet... [15:09:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet... [15:09:41] (03CR) 10CI reject: [V: 04-1] rest-gateway: support for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [15:10:00] (03CR) 10David Caro: [C: 03+2] debian: add defaults for changelog generation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907888 (owner: 10David Caro) [15:10:43] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:10:53] !log Migrating cxserver to mw-api-int on kubernetes, take three - T334204 [15:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:58] T334204: Migrate cxserver to mw-api-int - https://phabricator.wikimedia.org/T334204 [15:11:15] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [15:11:16] (03Merged) 10jenkins-bot: debian: add defaults for changelog generation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/907888 (owner: 10David Caro) [15:11:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet... [15:12:52] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [15:13:00] !log remove runc packages installed on mw1349-mw1436, these were once used for a load test with dragonfly and are no longer needed [15:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:07] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:13:34] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [15:13:48] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [15:14:11] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [15:14:30] (03CR) 10Filippo Giunchedi: dcops: add netdev duplex and speed checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [15:17:34] !log cxserver migrated to mw-api-int on kubernetes, take three - T334204 [15:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:38] T334204: Migrate cxserver to mw-api-int - https://phabricator.wikimedia.org/T334204 [15:18:48] (03CR) 10Eevans: [C: 03+1] sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [15:19:21] (03CR) 10MVernon: [C: 03+2] sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon) [15:19:21] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [15:19:27] (03Abandoned) 10Clément Goubert: Revert "Revert "Revert "cxserver: Switch to mw-api-int-async on k8s""" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908303 (owner: 10Clément Goubert) [15:19:29] (03CR) 10Cathal Mooney: dcops: add netdev duplex and speed checks (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [15:19:33] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [15:22:18] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [15:22:36] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [15:22:42] (03CR) 10Bking: [V: 03+2 C: 03+2] flink: upgrade to flink 1.16.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/908486 (https://phabricator.wikimedia.org/T334244) (owner: 10DCausse) [15:23:10] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1132.eqiad.wmnet with OS buster [15:23:30] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster executed with errors: - an-wo... [15:23:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [15:24:15] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [15:24:33] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [15:25:05] (03CR) 10Ottomata: [C: 03+1] profile::kafka::{broker,mirror}: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [15:25:10] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:25:13] (03CR) 10Ottomata: [C: 03+1] profile::kafka::{broker,mirror}: refactor TLS settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [15:25:42] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1003.eqiad.wmnet with reason: host reimage [15:25:45] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1002.eqiad.wmnet with reason: host reimage [15:25:58] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [15:26:34] (03CR) 10Filippo Giunchedi: dcops: add netdev duplex and speed checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [15:29:03] !log deploying analytics refinery-update pageview druid table [15:29:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1003.eqiad.wmnet with reason: host reimage [15:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:30:01] PROBLEM - Check systemd state on db1117 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@m3.service,wmf_auto_restart_prometheus-mysqld-exporter@m5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:14] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1132.eqiad.wmnet with OS buster [15:30:24] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster [15:31:15] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [15:31:39] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:33:17] (03CR) 10David Caro: [C: 04-1] "?me like where this is going :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [15:33:56] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1002.eqiad.wmnet with reason: host reimage [15:34:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:34:47] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kafka::{broker,mirror}: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey) [15:34:49] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler: PCC failing for an LVS host (false negative) even after manually updating facts - https://phabricator.wikimedia.org/T334680 (10ssingh) [15:36:21] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2007 [15:36:25] (03CR) 10David Caro: [C: 04-1] tools-webservice: set default for buildservice-image (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [15:36:28] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2007 [15:36:41] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [15:36:53] (03CR) 10Elukey: [C: 03+1] Add krb2002 as additional KDC [puppet] - 10https://gerrit.wikimedia.org/r/906560 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff) [15:38:30] (03PS1) 10BCornwall: hiera: lvs2008: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908585 (https://phabricator.wikimedia.org/T321309) [15:40:50] (03CR) 10Ssingh: hiera: lvs2008: update iface names for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908585 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [15:41:32] !log ebysans@deploy2002 Started deploy [analytics/refinery@4e8f1ac]: Update druid pageview hourly and daily tables [analytics/refinery@4e8f1ac] [15:41:37] (03PS2) 10BCornwall: hiera: lvs2008: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908585 (https://phabricator.wikimedia.org/T321309) [15:42:21] (03CR) 10BCornwall: hiera: lvs2008: update iface names for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908585 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [15:42:41] !log paused Oozie pageview-druid-hourly job. [15:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:55] (03PS3) 10Hnowlan: rest-gateway: support for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) [15:44:14] (03CR) 10Ssingh: [C: 03+1] hiera: lvs2008: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908585 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [15:44:50] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1001" [15:46:40] !log Disable Puppet/PyBal on lvs2008 in preparation for reimaging - T321309 [15:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:44] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [15:46:51] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1001" [15:47:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:47:27] ^ expected, BGP alerts in codfw [15:47:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1001" [15:47:37] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1003.eqiad.wmnet with OS bullseye [15:47:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1003.eqiad.wmnet with OS bullseye complete... [15:47:57] !log ebysans@deploy2002 Finished deploy [analytics/refinery@4e8f1ac]: Update druid pageview hourly and daily tables [analytics/refinery@4e8f1ac] (duration: 06m 24s) [15:48:03] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1001" [15:48:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [15:48:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye complete... [15:49:11] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1132.eqiad.wmnet with OS buster [15:49:19] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster executed with errors: - an-wo... [15:49:22] !log ebysans@deploy2002 Started deploy [analytics/refinery@4e8f1ac] (thin): Update druid pageview hourly and daily tables THIN [analytics/refinery@4e8f1ac] [15:49:31] !log ebysans@deploy2002 Finished deploy [analytics/refinery@4e8f1ac] (thin): Update druid pageview hourly and daily tables THIN [analytics/refinery@4e8f1ac] (duration: 00m 08s) [15:49:40] (03CR) 10David Caro: "A bit old, lgtm, have not tested it though, and probably might be better to rebase, but +1 from me if rebasing works and you tested it xd" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853403 (https://phabricator.wikimedia.org/T322437) (owner: 10Jbond) [15:49:41] PROBLEM - pybal on lvs2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:49:47] ^ expected [15:49:55] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1001" [15:50:01] we don't intentionally downtime the host [15:50:15] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:32] !log ebysans@deploy2002 Started deploy [analytics/refinery@4e8f1ac] (hadoop-test): Update druid pageview hourly and daily tables TEST [analytics/refinery@4e8f1ac] [15:50:41] PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:50:50] (03CR) 10Hnowlan: rest-gateway: support for proton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [15:50:51] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [15:51:15] PROBLEM - PyBal connections to etcd on lvs2008 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [15:51:19] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1001" [15:51:24] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [15:51:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye complete... [15:51:58] !log ebysans@deploy2002 Finished deploy [analytics/refinery@4e8f1ac] (hadoop-test): Update druid pageview hourly and daily tables TEST [analytics/refinery@4e8f1ac] (duration: 01m 26s) [15:52:09] 10SRE, 10serviceops: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 (10elukey) 05Open→03Resolved a:03elukey Deployment-prep may be migrated in the future, not in scope for this task. Finally closing! [15:52:18] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform Value Stream, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) [15:55:11] (03CR) 10FNegri: [C: 03+1] "Thanks Moritz, this can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/907717 (owner: 10Muehlenhoff) [15:55:55] (03PS1) 10Andrew Bogott: Make cloudvirtlocal100[1-3] into actual cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/908587 (https://phabricator.wikimedia.org/T329863) [15:58:18] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudvirtlocal100[1-3] into actual cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/908587 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [15:58:44] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1132.eqiad.wmnet with OS buster [15:58:53] 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster [15:58:59] (03CR) 10Cathal Mooney: dcops: add netdev duplex and speed checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [15:59:58] (03CR) 10Jbond: dcops: add netdev duplex and speed checks (038 comments) [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [16:00:05] jbond, rzl, and sukhe: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T1600). [16:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:42] tgr_: hi! I can start now [16:01:49] I will wait for your signal to start [16:04:52] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [16:05:04] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [16:07:27] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [16:08:51] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [16:10:58] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1132.eqiad.wmnet with reason: host reimage [16:12:30] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [16:13:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [16:14:26] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1132.eqiad.wmnet with reason: host reimage [16:18:42] (03PS5) 10Jameel Kaisar: Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) [16:18:44] (03PS3) 10Raymond Ndibe: tools-webservice: set default for buildservice-image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) [16:18:54] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw:row A/B: rack/cable new switches - https://phabricator.wikimedia.org/T332180 (10Papaul) [16:19:46] (03CR) 10Jameel Kaisar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar) [16:20:22] sukhe: sorry for the delay! a meeting ran over [16:20:29] we can start now if that's OK [16:20:53] np! I just wanted to be sure that it's OK to start [16:21:03] starting now [16:21:34] !log disable puppet on A:cp-text to merge CR 907937 [16:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:43] PROBLEM - ensure kvm processes are running on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:22:05] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) Try and error but again if it's not generated already, it'll be generated correctly so any non-obvious size should work, if you have 180px, 181px wou... [16:22:17] (03CR) 10Ssingh: [C: 03+2] multi-dc: Improve OAuth URL patterns for routing to primary [puppet] - 10https://gerrit.wikimedia.org/r/907937 (https://phabricator.wikimedia.org/T332650) (owner: 10Gergő Tisza) [16:23:03] (03CR) 10Jameel Kaisar: Set NEL 'success_fraction: 1.0' on HTTP responses for measurement domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar) [16:23:20] (03CR) 10Raymond Ndibe: tools-webservice: set default for buildservice-image (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [16:23:32] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [16:23:43] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott work in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:24:13] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott work in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:26:45] tgr_: looks good, will roll out to others and restarts. takes about an hour or more (48 hosts, we sleep/drain 90 seconds per host) [16:26:50] feel free to leave and I will ping you when done :) [16:27:03] !log enable puppet on A:cp-text to merge CR 907937 [16:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:32] sukhe: ack, thanks! seeing whether the patch did much will probably take another hour or so [16:27:42] ok! [16:28:29] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2067'] [16:28:34] !log jhancock@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ms-be2067'] [16:29:08] (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/908328 (https://phabricator.wikimedia.org/T334608) (owner: 10Jameel Kaisar) [16:30:20] (03CR) 10Kamila Součková: rest-gateway: support for proton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [16:30:24] (03CR) 10Majavah: [C: 04-1] tools-webservice: set default for buildservice-image (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [16:31:12] (03PS4) 10Raymond Ndibe: tools-webservice: set default for buildservice-image [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) [16:31:24] (03CR) 10Raymond Ndibe: tools-webservice: set default for buildservice-image (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/908369 (https://phabricator.wikimedia.org/T334586) (owner: 10Raymond Ndibe) [16:31:36] !log sudo cumin -b1 -s30 'A:cp-text' 'ats-backend-restart': T332650 [16:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:41] T332650: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 [16:40:52] 10SRE, 10MediaWiki-General: The script file run.php cannot be executed using MaintenanceRunner - https://phabricator.wikimedia.org/T334484 (10daniel) >>! In T334484#8771927, @TheresNoTime wrote: > Was that a recent change..? It was a recent fix. Explicit mention of "run.php" was only needed for a months ot so... [16:46:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet [16:46:35] (03PS1) 10Andrew Bogott: Update profile::openstack::base::nova::instance_dev for new cloudvirtlocals [puppet] - 10https://gerrit.wikimedia.org/r/908595 (https://phabricator.wikimedia.org/T329863) [16:47:10] (03CR) 10Andrew Bogott: [C: 03+2] Update profile::openstack::base::nova::instance_dev for new cloudvirtlocals [puppet] - 10https://gerrit.wikimedia.org/r/908595 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [16:47:59] 10Puppet, 10Cloud-Services, 10Infrastructure-Foundations: WMCS puppet-enc not working with puppetserver 7 - https://phabricator.wikimedia.org/T334686 (10jbond) [16:48:11] (03CR) 10BCornwall: [C: 03+2] hiera: lvs2008: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908585 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [16:49:32] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2008.codfw.wmnet with OS bullseye [16:49:38] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations: WMCS puppet-enc not working with puppetserver 7 - https://phabricator.wikimedia.org/T334686 (10taavi) [16:49:45] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs2008.codfw.wmnet with OS bullseye [16:52:00] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations: WMCS puppet-enc not working with puppetserver 7 - https://phabricator.wikimedia.org/T334686 (10taavi) You seem to be missing the configuration to use puppet-enc, particularly `/etc/puppet/puppet.conf` needs `external_nodes = /usr/local/bin/puppet-enc`. [16:53:29] (03PS4) 10Hnowlan: rest-gateway: support for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) [16:53:41] (03CR) 10Hnowlan: rest-gateway: support for proton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [16:53:53] (03PS2) 10BCornwall: Add itamar to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/898675 (https://phabricator.wikimedia.org/T331899) (owner: 10Muehlenhoff) [16:55:44] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10BCornwall) a:05ItamarWMDE→03thcipriani @thcipriani can you approve @ItamarWMDE's inclusion to the private group, please? Thanks! [16:55:47] (03CR) 10Cathal Mooney: [C: 03+2] Expose interface VRF association to templates if present in Netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/908325 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [17:00:06] bd808: It is that lovely time of the day again! You are hereby commanded to deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T1700). [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T1700) [17:01:43] * bd808 looks to see if he has anything to deploy today [17:04:51] oh I do have a reason to push out a new build of the developer portal. :) [17:06:42] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-04-13-112340-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/908599 [17:09:20] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2008.codfw.wmnet with reason: host reimage [17:11:59] (03PS3) 10Cathal Mooney: Expose additional link information to Homer templates in wmf-netbox.py [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) [17:12:07] (03CR) 10CI reject: [V: 04-1] Expose additional link information to Homer templates in wmf-netbox.py [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) (owner: 10Cathal Mooney) [17:12:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2008.codfw.wmnet with reason: host reimage [17:12:50] (03CR) 10Kamila Součková: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/908559 (https://phabricator.wikimedia.org/T334611) (owner: 10Hnowlan) [17:13:17] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-04-13-112340-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/908599 (owner: 10BryanDavis) [17:18:23] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-04-13-112340-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/908599 (owner: 10BryanDavis) [17:27:12] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:27:42] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:27:59] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:28:22] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:28:27] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:28:51] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:29:19] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:30:05] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2008.codfw.wmnet with OS bullseye [17:30:16] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs2008.codfw.wmnet with OS bullseye completed: - lvs2008 (**PASS**) - Downtimed on Icinga/Aler... [17:32:17] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1103.eqiad.wmnet - https://phabricator.wikimedia.org/T332293 (10Jclark-ctr) a:03Jclark-ctr [17:34:05] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1103.eqiad.wmnet - https://phabricator.wikimedia.org/T332293 (10Jclark-ctr) [17:36:47] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2008 [17:37:04] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2008 [17:37:04] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2008 [17:37:13] 10ops-codfw, 10Data-Persistence-Backup: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10Jhancock.wm) @Papaul can we try port 6/0/22? I will be here on Friday to make the move. [17:37:31] !log brett@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2008 [17:38:41] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381 (10Jclark-ctr) [17:40:22] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: disable jemalloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/908539 (owner: 10DCausse) [17:40:51] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 (10Jclark-ctr) [17:40:58] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 (10Jclark-ctr) 05Open→03Resolved [17:41:13] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1103.eqiad.wmnet - https://phabricator.wikimedia.org/T332293 (10Jclark-ctr) 05Open→03Resolved [17:42:17] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1101.eqiad.wmnet - https://phabricator.wikimedia.org/T331381 (10Jclark-ctr) 05Open→03Resolved [17:42:28] (03PS4) 10Cathal Mooney: Expose additional link information to Homer templates in wmf-netbox.py [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) [17:43:05] (03CR) 10CI reject: [V: 04-1] Expose additional link information to Homer templates in wmf-netbox.py [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) (owner: 10Cathal Mooney) [17:43:07] (03PS3) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [17:43:09] (03PS5) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [17:43:11] (03PS2) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 [17:43:13] (03PS30) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [17:43:25] RECOVERY - Check systemd state on db1117 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:43:31] (03CR) 10CI reject: [V: 04-1] environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 (owner: 10Jbond) [17:43:49] (03CR) 10CI reject: [V: 04-1] core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (owner: 10Jbond) [17:44:20] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [17:46:14] (03PS1) 10Hashar: gerrit: remove duplicate $gerrit_site definition [puppet] - 10https://gerrit.wikimedia.org/r/908604 (https://phabricator.wikimedia.org/T333143) [17:46:33] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2008 [17:46:33] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [17:46:45] !log brett@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2008 [17:47:47] (03PS1) 10BCornwall: hierdata: Remove bgp-med for lvs2008 [puppet] - 10https://gerrit.wikimedia.org/r/908605 (https://phabricator.wikimedia.org/T321309) [17:48:49] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: T334057 [17:48:53] T334057: Replace db1102 with db1225 - https://phabricator.wikimedia.org/T334057 [17:49:04] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: T334057 [17:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:51:31] (03CR) 10Ssingh: [C: 03+1] hierdata: Remove bgp-med for lvs2008 [puppet] - 10https://gerrit.wikimedia.org/r/908605 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [17:51:37] (03CR) 10BCornwall: [C: 03+2] hierdata: Remove bgp-med for lvs2008 [puppet] - 10https://gerrit.wikimedia.org/r/908605 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [17:53:19] (03PS1) 10Bartosz Dziewoński: Enable mobile page tabs for everyone in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908607 (https://phabricator.wikimedia.org/T334395) [17:54:00] (03PS2) 10DCausse: rdf-streaming-updater: disable jemalloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/908539 [17:55:12] !log restarting pybal on lvs2008 to pick up bgp-med change [17:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:26] (03PS5) 10Cathal Mooney: Expose additional link information to Homer templates in wmf-netbox.py [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) [17:57:09] (03PS1) 10Andrew Bogott: cloudvirtlocal: link nova instances dir to /srv [puppet] - 10https://gerrit.wikimedia.org/r/908608 (https://phabricator.wikimedia.org/T329863) [17:57:30] !log Disable Puppet/PyBal on lvs2009 in preparation for reimaging - T321309 [17:57:32] (03CR) 10CI reject: [V: 04-1] cloudvirtlocal: link nova instances dir to /srv [puppet] - 10https://gerrit.wikimedia.org/r/908608 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [17:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:34] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [17:58:18] (03PS2) 10Andrew Bogott: cloudvirtlocal: link nova instances dir to /srv [puppet] - 10https://gerrit.wikimedia.org/r/908608 (https://phabricator.wikimedia.org/T329863) [17:58:21] (03PS31) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [17:59:17] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:59:17] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [18:00:07] ^demon and hashar: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T1800). [18:00:41] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirtlocal: link nova instances dir to /srv [puppet] - 10https://gerrit.wikimedia.org/r/908608 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [18:00:48] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: disable jemalloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/908539 (owner: 10DCausse) [18:00:53] (03PS1) 10BCornwall: hiera: lvs2009: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908609 (https://phabricator.wikimedia.org/T321309) [18:01:07] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:01:21] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1017.eqiad.wmnet and cloudvirt102[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T333169 (10Jclark-ctr) [18:01:37] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1017.eqiad.wmnet and cloudvirt102[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T333169 (10Jclark-ctr) 05Open→03Resolved [18:01:53] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:02:25] PROBLEM - pybal on lvs2009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:02:41] (03PS32) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [18:02:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:24] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [18:04:39] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 0 connections established with conf2005.codfw.wmnet:4001 (min=74) https://wikitech.wikimedia.org/wiki/PyBal [18:05:34] (03Merged) 10jenkins-bot: rdf-streaming-updater: disable jemalloc [deployment-charts] - 10https://gerrit.wikimedia.org/r/908539 (owner: 10DCausse) [18:05:37] (03PS33) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [18:06:08] (03PS6) 10Cathal Mooney: Expose additional link information to Homer templates in wmf-netbox.py [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) [18:06:20] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [18:06:51] (03PS34) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [18:07:04] (03CR) 10Ssingh: [C: 03+1] hiera: lvs2009: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908609 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [18:07:08] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [18:07:29] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [18:07:31] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [18:07:37] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [18:10:47] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:15:42] (03CR) 10BCornwall: [C: 03+2] hiera: lvs2009: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/908609 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [18:15:45] tgr_: all done, restarts completed :) [18:15:59] thanks sukhe! [18:16:23] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frdata1001.frack.eqiad.wmnet (WMF7292) - https://phabricator.wikimedia.org/T333971 (10Jclark-ctr) [18:16:23] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2009.codfw.wmnet with OS bullseye [18:16:31] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frdata1001.frack.eqiad.wmnet (WMF7292) - https://phabricator.wikimedia.org/T333971 (10Jclark-ctr) 05Open→03Resolved [18:16:35] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs2009.codfw.wmnet with OS bullseye [18:17:13] (03PS35) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [18:17:54] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [18:20:19] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations: WMCS puppet-enc not working with puppetserver 7 - https://phabricator.wikimedia.org/T334686 (10jbond) 05Open→03Resolved a:03jbond >>! In T334686#8779836, @taavi wrote: > You seem to be missing the configuration to use puppet-enc, particularly `/etc/p... [18:20:25] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [18:23:01] (03PS1) 10DCausse: rdf-streaming-updater: env is an array not a map [deployment-charts] - 10https://gerrit.wikimedia.org/r/908611 [18:23:08] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [18:23:19] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [18:23:40] (03PS1) 10Andrew Bogott: cloudvirtlocal: move instance dir to /srv/instances [puppet] - 10https://gerrit.wikimedia.org/r/908612 (https://phabricator.wikimedia.org/T329863) [18:24:15] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908613 (https://phabricator.wikimedia.org/T330210) [18:24:17] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908613 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [18:24:33] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10Jclark-ctr) Rebalanced power between 3 legs [18:24:43] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10Jclark-ctr) 05Open→03Resolved [18:25:05] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908613 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [18:25:51] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirtlocal: move instance dir to /srv/instances [puppet] - 10https://gerrit.wikimedia.org/r/908612 (https://phabricator.wikimedia.org/T329863) (owner: 10Andrew Bogott) [18:26:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [18:26:13] rolling train as backup-backup-backup or something like that. [18:26:14] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [18:26:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet... [18:26:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet... [18:28:36] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: env is an array not a map [deployment-charts] - 10https://gerrit.wikimedia.org/r/908611 (owner: 10DCausse) [18:28:44] 10SRE, 10Infrastructure-Foundations, 10Traffic: Receive network latency reports from the browsers - https://phabricator.wikimedia.org/T334417 (10JameelKaisar) [18:30:34] (03PS1) 10Urbanecm: enwiki: Remove userrights from `founder` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908614 (https://phabricator.wikimedia.org/T334692) [18:30:36] hrm, stack traces on php restarts [18:31:56] 34 failures [18:32:33] 10SRE, 10Traffic, 10Wikidata, 10wdwb-tech: Wikidata seems to still be utilizing insecure HTTP URIs - https://phabricator.wikimedia.org/T331356 (10BBlack) >>! In T331356#8718619, @MisterSynergy wrote: > Some remarks: > * We should consider these canonical HTTP URIs to be //names// in the first place, which... [18:34:18] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.4 refs T330210 [18:34:23] T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210 [18:34:35] (03Merged) 10jenkins-bot: rdf-streaming-updater: env is an array not a map [deployment-charts] - 10https://gerrit.wikimedia.org/r/908611 (owner: 10DCausse) [18:34:59] hm, was wmf.4 just rolled out to enwiki? [18:35:11] i can't click on anything on any page while logged in [18:35:14] not sure if it's just me [18:35:15] MatmaRex: yeah, something's up in general, 139 php restart failures [18:35:17] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2009.codfw.wmnet with reason: host reimage [18:36:27] hmm, maybe it's a temporary issue. like incompatible versions of HTML and CSS being served temporarily or something [18:36:53] Hrm [18:37:21] https://phabricator.wikimedia.org/P46678 [18:38:27] I see the same as MatmaRex -- looks like there's a .vector-menu-checkbox that has width: 100% and height: 100% and is covering the page [18:38:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2009.codfw.wmnet with reason: host reimage [18:39:05] a rollback seems indicated, but i'm not sure if that's going to leave things in an equally broken state [18:39:38] Unsure either way. Roll back [18:39:45] rzl: i no longer see that issue [18:40:14] concur, I don't either [18:40:33] Php restart problems make me sorry about partial deployment [18:40:37] which is making me think i was probably getting old CSS with the new HTML [18:40:37] no, wait, I wasn't logged in in that window -- I do still see it [18:40:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: Tenant networking not working on cloudvirtlocal hosts - https://phabricator.wikimedia.org/T334694 (10Andrew) [18:41:00] rolling back. [18:42:04] (03PS1) 10TrainBranchBot: group2 wikis to 1.40.1-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908615 (https://phabricator.wikimedia.org/T330210) [18:42:06] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.1-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908615 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [18:42:08] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1002.eqiad.wmnet with reason: host reimage [18:42:54] (03Merged) 10jenkins-bot: group2 wikis to 1.40.1-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908615 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [18:44:38] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [18:44:56] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [18:45:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1002.eqiad.wmnet with reason: host reimage [18:45:56] it also helps to not fat-finger the version. [18:46:12] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908616 (https://phabricator.wikimedia.org/T330210) [18:46:14] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908616 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [18:46:14] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [18:46:57] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908616 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [18:48:13] brennen: not having to type the old version is something we can discuss this week. [18:48:35] yeah, never typing versions is probably a good ideal to strive for [18:48:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: Tenant networking not working on cloudvirtlocal hosts - https://phabricator.wikimedia.org/T334694 (10Andrew) [18:48:54] 90% less fingers [18:48:57] key metric [18:49:04] i think this may be the first time i've actually messed one up and thankfully i think no consequences [18:49:16] but i sure have spent an inordinate amount of time staring at them to make sure i'm not about to mess them up [18:50:25] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10IKhitron) It works, thanks a lot. I think we should explain this trick in the Tech News. [18:51:08] I'm back at my desk. [18:54:00] * dancy stares at the paste [18:54:05] my current guess is that whether this rollback sees the lvs-related php restart errors or not, it should be status quo (wmf.4 at group1) once it finishes. [18:54:22] right [18:55:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: cloudvirtlocal1001.eqiad.wmnet tends to get stuck on boot - https://phabricator.wikimedia.org/T334696 (10Andrew) [18:55:46] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2009.codfw.wmnet with OS bullseye [18:55:57] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs2009.codfw.wmnet with OS bullseye completed: - lvs2009 (**PASS**) - Downtimed on Icinga/Aler... [18:56:21] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [18:57:52] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [18:58:05] (03PS4) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [18:58:07] (03PS6) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [18:58:08] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [18:58:09] (03PS3) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 [18:58:11] (03PS36) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [18:58:29] (03CR) 10CI reject: [V: 04-1] environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 (owner: 10Jbond) [18:58:46] (03CR) 10CI reject: [V: 04-1] core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (owner: 10Jbond) [18:59:17] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [18:59:21] (03PS1) 10Hashar: gerrit: relocate LFS data [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) [18:59:48] !log lvs1020: restart pybal for experiment... [18:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:35] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [19:01:59] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10doctaxon) I also think it's important to know. Please publish it in Tech News. [19:03:21] (03PS5) 10Jbond: environment: add environment.conf file and remove environments dir [puppet] - 10https://gerrit.wikimedia.org/r/907991 [19:03:23] (03PS7) 10Jbond: wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) [19:03:25] (03PS4) 10Jbond: core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 [19:03:27] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.3 refs T330210 [19:03:27] (03PS37) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:03:31] T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210 [19:03:34] (03CR) 10Majavah: puppetserver: (WIP) add basic class for puppert server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [19:04:04] (03CR) 10CI reject: [V: 04-1] core_modules: add core modules [puppet] - 10https://gerrit.wikimedia.org/r/908326 (owner: 10Jbond) [19:04:21] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [19:05:51] rzl, MatmaRex, dancy: so no restart failures that time. i guess the remaining question is do we think the editing issue for logged in users was a side effect. [19:06:26] (03PS1) 10Superpes15: [wikitech] Add a logo and a wordmark for Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908618 (https://phabricator.wikimedia.org/T334666) [19:06:38] (03CR) 10CI reject: [V: 04-1] wmflib: updat ipresolv to work with puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/907938 (https://phabricator.wikimedia.org/T294841) (owner: 10Jbond) [19:07:22] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:12:43] (03PS38) 10Jbond: puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 [19:13:25] (03CR) 10CI reject: [V: 04-1] puppetserver: (WIP) add basic class for puppert server [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [19:15:20] (03CR) 10AOkoth: [C: 03+1] vrts: do not use /srv/sqldata as mariadb datadir (cloud, devtools) [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [19:16:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye [19:16:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bullseye complete... [19:16:57] (03CR) 10Jbond: puppetserver: (WIP) add basic class for puppert server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/895356 (owner: 10Jbond) [19:18:08] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10IKhitron) Something as > There is a problem with svg files thumbnails creating. If you can see old version of image instead of the current one, change the size... [19:18:26] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:18:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [19:18:41] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/output/908331/40673/otrs1001.eqiad.wmnet/change.otrs1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [19:22:15] 10SRE-swift-storage, 10MediaWiki-File-management, 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10RhinosF1) [19:22:49] (03PS3) 10Dzahn: vrts: do not use /srv/sqldata as mariadb datadir (cloud, devtools) [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571) [19:25:26] !log brett@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2009 [19:25:28] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs2009 [19:25:33] !log brett@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2009 [19:25:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:25:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [19:25:50] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host lvs2009 [19:26:57] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/908331/40674/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [19:27:00] (03PS1) 10Ssingh: hiera: remove bgp-med override for lvs2009 [puppet] - 10https://gerrit.wikimedia.org/r/908619 (https://phabricator.wikimedia.org/T321309) [19:27:41] (03CR) 10Ssingh: [C: 03+2] hiera: remove bgp-med override for lvs2009 [puppet] - 10https://gerrit.wikimedia.org/r/908619 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:28:04] (03PS1) 10BCornwall: hierdata: Remove bgp-med for lvs2009 [puppet] - 10https://gerrit.wikimedia.org/r/908620 (https://phabricator.wikimedia.org/T321309) [19:28:18] (03CR) 10Hashar: "I think that will do it. Might be better to sync up when I am back from vacations though." [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [19:29:35] (03Abandoned) 10BCornwall: hierdata: Remove bgp-med for lvs2009 [puppet] - 10https://gerrit.wikimedia.org/r/908620 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:29:37] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop in production. in cloud vps, we set datadir to /srv/sqldata in Hiera for old instance only, on new instance we will use default /var/" [puppet] - 10https://gerrit.wikimedia.org/r/908331 (https://phabricator.wikimedia.org/T329571) (owner: 10Dzahn) [19:29:57] !log restart pybal on lvs2009 to pick up bgp-med change and pool [19:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/898675 (https://phabricator.wikimedia.org/T331899) (owner: 10Muehlenhoff) [19:37:53] brennen: i don't know, but it might be possible to test it again with wikimediadebug [19:40:12] the reading web team might also have some ideas if you can contact them, or file a task [19:42:35] (03PS1) 10Dzahn: admin: add Marco Aurelio to LDAP-only admins (nda) [puppet] - 10https://gerrit.wikimedia.org/r/908622 (https://phabricator.wikimedia.org/T333884) [19:44:14] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "needs to wait until KFrancis has updated https://phabricator.wikimedia.org/T333884 but that should be soon" [puppet] - 10https://gerrit.wikimedia.org/r/908622 (https://phabricator.wikimedia.org/T333884) (owner: 10Dzahn) [19:46:24] (03CR) 10Dzahn: [C: 03+1] "looks like it just needs Tyler to approve, but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/898675 (https://phabricator.wikimedia.org/T331899) (owner: 10Muehlenhoff) [19:47:23] i sorta think we should just give it another shot. dancy, thoughts? [19:47:49] I'm down [19:48:53] awright, let's give it a try before the backport window. i'll roll forward. [19:49:17] 🍀 [19:49:22] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908623 (https://phabricator.wikimedia.org/T330210) [19:49:22] :D [19:49:24] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908623 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [19:50:11] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908623 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [19:53:45] 10SRE-swift-storage, 10MediaWiki-File-management, 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10doctaxon) But you can't change the size, if the thumbnail is listed on category pages. [19:55:16] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.4 refs T330210 [19:55:22] T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210 [19:57:12] welp, looks like the bug is still present. rolling back and filing a blocker. [19:57:34] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908624 (https://phabricator.wikimedia.org/T330210) [19:57:36] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908624 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [19:58:32] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908624 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [19:58:57] TheresNoTime: note for upcoming backport window - please hold while train rollback finishes. [19:59:11] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:00:05] brennen and TheresNoTime: Dear deployers, time to do the UTC late backport and config training deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230413T2000). [20:00:05] MatmaRex and Superpes: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:28] hi [20:00:35] MatmaRex, Superpes: see a few lines back in scrollback. [20:00:41] yup [20:00:45] shouldn't take too long. [20:01:17] * urbanecm waves [20:01:42] brennen: happy to deploy once train finishes [20:02:11] I'm unable to deploy this evening, so thank you urbanecm [20:02:21] np [20:03:03] (03CR) 10Legoktm: enwiki: Remove userrights from `founder` (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908614 (https://phabricator.wikimedia.org/T334692) (owner: 10Urbanecm) [20:03:16] thx urbanecm, i'll ping. [20:03:19] ty [20:04:10] (03PS2) 10Urbanecm: enwiki: Remove userrights from `founder` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908614 (https://phabricator.wikimedia.org/T334692) [20:04:24] (03CR) 10Urbanecm: enwiki: Remove userrights from `founder` (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908614 (https://phabricator.wikimedia.org/T334692) (owner: 10Urbanecm) [20:07:50] (03CR) 10Dzahn: [C: 03+1] "Ok, convinced. if it lets us test and then switch to bullseye machines and unblocks that.. I'll go ahead. Was just concerned it might crea" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [20:07:57] (03PS1) 10Jcrespo: mariadb: Setup db1225 as a replacement for db1102 [puppet] - 10https://gerrit.wikimedia.org/r/908627 (https://phabricator.wikimedia.org/T334057) [20:08:32] 10SRE-swift-storage, 10MediaWiki-File-management, 10User-notice: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Ladsgroup) It's not just SVGs, it's all file types, the example I gave was a png file but it only happens on half of the traffic depending on t... [20:08:49] urbanecm is stealing Jimbo's powers [20:08:53] (at Jimbo's request :) [20:09:07] :) [20:10:55] (not a "first time offender", tbh) [20:14:56] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [20:15:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [20:15:08] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.3 refs T330210 [20:15:13] T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210 [20:15:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [20:15:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [20:15:46] urbanecm: all yours [20:15:50] thanks! [20:16:41] MatmaRex: Superpes: hi, still around? [20:16:50] sure [20:17:16] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [20:17:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908607 (https://phabricator.wikimedia.org/T334395) (owner: 10Bartosz Dziewoński) [20:17:48] my wmf.4 backport isn't testable on mwdebug, but i can watch the logs afterwards to see if it has the desired effect [20:18:11] (03Merged) 10jenkins-bot: Enable mobile page tabs for everyone in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908607 (https://phabricator.wikimedia.org/T334395) (owner: 10Bartosz Dziewoński) [20:18:11] and my other config changes (except for the ruwiki one) are unimportant, i'm happy to drop them since we're running late [20:18:18] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [20:19:06] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:908607|Enable mobile page tabs for everyone in ruwiki (T334395)]] [20:19:10] T334395: Enable mobile page tabs for everyone in ruwiki - https://phabricator.wikimedia.org/T334395 [20:20:00] MatmaRex: ack, I'll do them at the end if there's time. [20:20:26] !log urbanecm@deploy2002 urbanecm and matmarex: Backport for [[gerrit:908607|Enable mobile page tabs for everyone in ruwiki (T334395)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:20:40] MatmaRex: your ruwiki change is at mwdebug, can you test? [20:20:57] testing at https://ru.m.wikipedia.org/wiki/Тост_Мелба (random page), looks good [20:21:25] great, syncing [20:21:41] Superpes: hi, still around for B&C? :) [20:23:28] (03CR) 10Dzahn: [C: 03+2] "disabled puppet on doc1002 (active), deploying first doc2001" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [20:23:41] (03CR) 10Jcrespo: [C: 03+2] mariadb: Setup db1225 as a replacement for db1102 [puppet] - 10https://gerrit.wikimedia.org/r/908627 (https://phabricator.wikimedia.org/T334057) (owner: 10Jcrespo) [20:23:46] (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/908627/40675/db1225.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/908627 (https://phabricator.wikimedia.org/T334057) (owner: 10Jcrespo) [20:24:09] i wonder if we could deploy their change anyway if they aren't around, fixing the wikitech logo seems entirely uncontroversial [20:24:33] probably [20:24:44] logo changes sound very controversial, actually :) [20:24:50] jk, what is the fix [20:24:57] (03PS2) 10Urbanecm: [wikitech] Add a logo and a wordmark for Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908618 (https://phabricator.wikimedia.org/T334666) (owner: 10Superpes15) [20:25:02] (03CR) 10Urbanecm: [C: 03+2] [wikitech] Add a logo and a wordmark for Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908618 (https://phabricator.wikimedia.org/T334666) (owner: 10Superpes15) [20:25:13] mutante: adding a V22-compatible logo [20:25:24] so long it's not "switching wikitech to V22", should be uncontroversial :D [20:25:49] (03Merged) 10jenkins-bot: [wikitech] Add a logo and a wordmark for Vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908618 (https://phabricator.wikimedia.org/T334666) (owner: 10Superpes15) [20:25:53] adds the wordmark? ah [20:25:56] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:908607|Enable mobile page tabs for everyone in ruwiki (T334395)]] (duration: 06m 49s) [20:26:00] ok :) [20:26:01] T334395: Enable mobile page tabs for everyone in ruwiki - https://phabricator.wikimedia.org/T334395 [20:26:19] ruwiki change's deployed, proceeding with wikitech's [20:26:25] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:908618|[wikitech] Add a logo and a wordmark for Vector 2022 (T334666)]] [20:26:30] T334666: Add a wordmark + SVG icon logo for wikitech so it appears nicely in Vector 2022 - https://phabricator.wikimedia.org/T334666 [20:26:41] x-wikimedia-debug doesn't work for wikitech, so proceeding and let's hope. [20:27:03] !log doc2001 - switching PHP version from 7.3 to 7.4 for T322357 [20:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:07] T322357: OOUI PHP demos page is broken (again) - https://phabricator.wikimedia.org/T322357 [20:27:43] !log urbanecm@deploy2002 superpes and urbanecm: Backport for [[gerrit:908618|[wikitech] Add a logo and a wordmark for Vector 2022 (T334666)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:29:27] (03PS3) 10Urbanecm: enwiki: Remove userrights from `founder` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908614 (https://phabricator.wikimedia.org/T334692) [20:29:32] (03CR) 10Urbanecm: [C: 03+2] enwiki: Remove userrights from `founder` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908614 (https://phabricator.wikimedia.org/T334692) (owner: 10Urbanecm) [20:30:12] (03Merged) 10jenkins-bot: enwiki: Remove userrights from `founder` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908614 (https://phabricator.wikimedia.org/T334692) (owner: 10Urbanecm) [20:30:55] (ProbeDown) firing: Service doc2001.codfw.wmnet:443 has failed probes (http_doc2001_codfw_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc2001.codfw.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:31:21] ^ caused by my merge [20:31:27] but fixing [20:31:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [20:32:07] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:908618|[wikitech] Add a logo and a wordmark for Vector 2022 (T334666)]] (duration: 05m 41s) [20:32:11] T334666: Add a wordmark + SVG icon logo for wikitech so it appears nicely in Vector 2022 - https://phabricator.wikimedia.org/T334666 [20:32:41] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:908614|enwiki: Remove userrights from `founder` (T334692)]] [20:32:45] T334692: Remove "userrights" access from En.WP's Founder flag - https://phabricator.wikimedia.org/T334692 [20:33:21] MatmaRex: wikitech change deployed; if you can spotcheck, would be great. [20:33:22] (03CR) 10Dzahn: [C: 03+2] "My concerns were somewhat confirmed, puppet depdency errors and monitoring was triggered. BUT.. after 3 puppet runs things look better. It" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [20:33:58] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:908614|enwiki: Remove userrights from `founder` (T334692)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:35:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [20:35:01] urbanecm: seems to be working. [20:35:07] awesome, thanks [20:35:22] is it just me, or is the wordmark kind of big? i don't think that's a reason to revert though [20:35:28] (03CR) 10Urbanecm: [C: 03+2] Only log 'visualEditorFeatureUse' events if 'editAttemptStep' events are being logged [extensions/WikimediaEvents] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/907743 (https://phabricator.wikimedia.org/T334157) (owner: 10Bartosz Dziewoński) [20:35:34] (03CR) 10Jbond: dcops: add netdev duplex and speed checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/908556 (https://phabricator.wikimedia.org/T333007) (owner: 10Jbond) [20:35:42] (03PS2) 10Urbanecm: Stop using redundant $wmg variable for MobileFrontend extension (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905750 (owner: 10Bartosz Dziewoński) [20:35:48] (03PS2) 10Urbanecm: Stop using redundant $wmg variable for MobileFrontend extension (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905751 (https://phabricator.wikimedia.org/T119117) (owner: 10Bartosz Dziewoński) [20:35:55] (ProbeDown) resolved: Service doc2001.codfw.wmnet:443 has failed probes (http_doc2001_codfw_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#doc2001.codfw.wmnet:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:36:14] I'll finish the Jimbo change and do the low-prio changes for you MatmaRex too [20:36:34] fairly big, yeah. but not terribly big imo [20:36:35] cool, thanks. they need to be synced separately, in order [20:36:45] this is what i see https://usercontent.irccloud-cdn.com/file/8sz9SlW9/image.png [20:37:24] (03Merged) 10jenkins-bot: Only log 'visualEditorFeatureUse' events if 'editAttemptStep' events are being logged [extensions/WikimediaEvents] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/907743 (https://phabricator.wikimedia.org/T334157) (owner: 10Bartosz Dziewoński) [20:37:32] MatmaRex: fortunately, all deployments (except extensions.json / skin.json changes) now take effect at once, so there's no need to sync patches in order those days. [20:37:53] ooooh. so i didn't need to split it into two patches? [20:38:01] indeed. [20:38:24] that was very annoying, glad it won't be needed. maybe i'll squash them then [20:38:35] when was this fixed? [20:38:36] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:908614|enwiki: Remove userrights from `founder` (T334692)]] (duration: 05m 55s) [20:38:41] T334692: Remove "userrights" access from En.WP's Founder flag - https://phabricator.wikimedia.org/T334692 [20:40:17] (03Abandoned) 10Bartosz Dziewoński: Stop using redundant $wmg variable for MobileFrontend extension (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905750 (owner: 10Bartosz Dziewoński) [20:40:27] (03PS3) 10Bartosz Dziewoński: Stop using redundant $wmg variable for MobileFrontend extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905751 (https://phabricator.wikimedia.org/T119117) [20:40:31] (03CR) 10Dzahn: [C: 03+2] "tested on new PHP version:" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [20:40:56] afaik it was fixed in mid-2022, but i don't remember exactly. [20:42:24] (03Abandoned) 10Bartosz Dziewoński: Stop using redundant $wmg variables for VisualEditor extension (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905709 (owner: 10Bartosz Dziewoński) [20:42:30] (03Abandoned) 10Bartosz Dziewoński: Stop using redundant $wmg variables for VisualEditor extension (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905710 (owner: 10Bartosz Dziewoński) [20:42:40] (03PS2) 10Bartosz Dziewoński: Stop using redundant $wmg variables for VisualEditor extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905711 (https://phabricator.wikimedia.org/T119117) [20:46:49] !log doc2001 - systemctl stop php7.3-fpm; systemctl restart php7.4-fpm - needed because after gerrit:901612 we had BOTH PHP versions, 7.3 and 7.4 running their own php-fpm process, also packages for both versions are installed, so also manual package removal needed - apt-get remove php7.3* T322357 T319477 [20:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:56] T322357: OOUI PHP demos page is broken (again) - https://phabricator.wikimedia.org/T322357 [20:46:56] T319477: Migrate doc hosts to Bullseye - https://phabricator.wikimedia.org/T319477 [20:47:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905751 (https://phabricator.wikimedia.org/T119117) (owner: 10Bartosz Dziewoński) [20:48:13] (03Merged) 10jenkins-bot: Stop using redundant $wmg variable for MobileFrontend extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905751 (https://phabricator.wikimedia.org/T119117) (owner: 10Bartosz Dziewoński) [20:48:16] urbanecm: mind giving me a ping when you're done? going to do a bit of debug with Jdlrobson for the train blocker. [20:48:20] sure [20:48:40] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:907743|Only log 'visualEditorFeatureUse' events if 'editAttemptStep' events are being logged (T334157)]], [[gerrit:905751|Stop using redundant $wmg variable for MobileFrontend extension (T119117)]] [20:48:46] T334157: VisualEditorFeatureUse validation errors - https://phabricator.wikimedia.org/T334157 [20:48:46] T119117: Get rid of $wg = $wmg hack for extensions that have been converted to using extension.json - https://phabricator.wikimedia.org/T119117 [20:48:49] Sorry urbanecm I just connected, had a commitment a few minutes ago, if you are still around... [20:49:03] Superpes: yeah, i am, and i actually already deployed your patch with matmarex :) [20:49:05] please check [20:49:41] urbanecm Absolutely fine :) Thanks [20:50:01] !log urbanecm@deploy2002 urbanecm and matmarex: Backport for [[gerrit:907743|Only log 'visualEditorFeatureUse' events if 'editAttemptStep' events are being logged (T334157)]], [[gerrit:905751|Stop using redundant $wmg variable for MobileFrontend extension (T119117)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:50:15] MatmaRex: if i understodo you correctly, neither the backport or config are testable? [20:50:24] 10SRE, 10Traffic, 10conftool, 10serviceops: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki) - https://phabricator.wikimedia.org/T334703 (10CDanis) [20:50:35] yeah. i'll watch the logs for the backport, the config should be no-op [20:50:44] okay, proceeding [20:52:17] (03CR) 10Dzahn: [C: 03+2] "after this change both 7.3 and 7.4 packages were installed and both had a running php-fpm process. So I manually stopped the 7.3 one, rest" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [20:52:57] (03CR) 10Dzahn: [C: 03+2] "btw, we are testing all this stuff:" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [20:53:15] (03PS6) 10Ryan Kemper: wdqs: improve reliability of reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (https://phabricator.wikimedia.org/T325114) [20:54:33] (03PS3) 10Bartosz Dziewoński: Stop using redundant $wmg variables for VisualEditor extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905711 (https://phabricator.wikimedia.org/T119117) [20:54:35] (03PS2) 10Bartosz Dziewoński: Remove weird VisualEditor config hack from 2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905712 [20:54:45] (03CR) 10Ryan Kemper: wdqs: improve reliability of reboots (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (https://phabricator.wikimedia.org/T325114) (owner: 10Ryan Kemper) [20:55:07] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:907743|Only log 'visualEditorFeatureUse' events if 'editAttemptStep' events are being logged (T334157)]], [[gerrit:905751|Stop using redundant $wmg variable for MobileFrontend extension (T119117)]] (duration: 06m 26s) [20:55:13] T334157: VisualEditorFeatureUse validation errors - https://phabricator.wikimedia.org/T334157 [20:55:13] T119117: Get rid of $wg = $wmg hack for extensions that have been converted to using extension.json - https://phabricator.wikimedia.org/T119117 [20:55:20] MatmaRex: done. and i think that should be all? [20:55:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:15:00 on doc1002.eqiad.wmnet with reason: maintenance [20:55:31] thanks [20:55:42] np [20:55:44] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on doc1002.eqiad.wmnet with reason: maintenance [20:55:48] brennen: over to you! [20:56:09] urbanecm: ty! grabbing a debug box here. [21:01:57] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [21:02:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [21:02:29] !log doc1002 (doc.wikimedia.org) - switching from PHP 7.3 to 7.4 - systemctl stop php7.3-fpm, restart php7.4-fpm, apt-get remove --purge php7.3*, systemctl restart apache2. - all tests still working (on deployment server: httpbb --hosts doc1002.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml) T322357 T319477 [21:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:35] T322357: OOUI PHP demos page is broken (again) - https://phabricator.wikimedia.org/T322357 [21:02:35] T319477: Migrate doc hosts to Bullseye - https://phabricator.wikimedia.org/T319477 [21:03:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [21:03:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [21:03:34] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: improve reliability of reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (https://phabricator.wikimedia.org/T325114) (owner: 10Ryan Kemper) [21:03:58] rolling train forward again. [21:04:10] (03PS3) 10Ryan Kemper: delete query-preview.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [21:04:25] (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908630 (https://phabricator.wikimedia.org/T330210) [21:04:27] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908630 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [21:04:52] (03CR) 10Dzahn: "hey! thanks!let me actually check that monitoring is disabled" [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [21:05:11] (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/908630 (https://phabricator.wikimedia.org/T330210) (owner: 10TrainBranchBot) [21:06:19] (03PS4) 10Ryan Kemper: delete query-preview.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [21:08:07] (03CR) 10Dzahn: "yea, no monitoring, already skipped it" [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [21:10:25] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.4 refs T330210 [21:10:30] T330210: 1.41.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T330210 [21:16:02] (03PS1) 10Dzahn: ATS: remove map for query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/908631 (https://phabricator.wikimedia.org/T333656) [21:16:29] (03PS11) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) [21:17:41] (03CR) 10Cathal Mooney: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags (035 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [21:17:58] (03CR) 10Ryan Kemper: [C: 03+1] ATS: remove map for query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/908631 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [21:19:00] (03CR) 10Dzahn: [C: 03+2] ATS: remove map for query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/908631 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [21:20:15] (03CR) 10Dzahn: "In hindsight it was nicer to first do ATS: https://gerrit.wikimedia.org/r/c/operations/puppet/+/908631 then this" [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [21:20:59] (03CR) 10Dzahn: [C: 03+2] "all done on both backends. PHP7.3 packages purged. linked ticket was confirmed as resolved." [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [21:25:06] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [21:25:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [21:25:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [21:25:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [21:26:06] 10SRE, 10Data-Services, 10Traffic: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10BCornwall) 05Open→03Stalled [21:26:54] 10SRE, 10Data-Services, 10Traffic: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10BCornwall) @CDanis, thank you for your work on this ticket! Would you agree that it's worth closing this ticket? Is there a desire to f... [21:28:00] !log https://query-preview.wikidata.org has been deactivated at ATS layer - T333656 [21:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:04] T333656: Decommission query-preview.wikidata.org - https://phabricator.wikimedia.org/T333656 [21:30:09] (03PS5) 10Dzahn: delete query-preview.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) [21:37:11] !log Successfully Deployed analytics refinery using scap, then deployed onto hdfs. [21:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:49] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [21:38:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye executed... [21:39:20] (03CR) 10Cwhite: "This patch:" [puppet] - 10https://gerrit.wikimedia.org/r/907838 (https://phabricator.wikimedia.org/T333732) (owner: 10Cwhite) [21:41:02] (03PS1) 10Dzahn: httpbb: remove query-preview.wikidata.from tests for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/908633 (https://phabricator.wikimedia.org/T333656) [21:41:57] (03CR) 10Dzahn: [C: 03+2] httpbb: remove query-preview.wikidata.from tests for miscweb [puppet] - 10https://gerrit.wikimedia.org/r/908633 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [21:46:40] (03PS1) 10Dzahn: microsites/query_service: remove query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/908634 (https://phabricator.wikimedia.org/T333656) [21:49:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:51:42] (03PS1) 10Ryan Kemper: wdqs: remove query-preview microsite [puppet] - 10https://gerrit.wikimedia.org/r/908635 (https://phabricator.wikimedia.org/T333656) [21:53:14] (03CR) 10Dzahn: [C: 03+1] "ah, well, if you can remove the entire envoy, cool. wdqs1009/1010 just need checking after merge, right?" [puppet] - 10https://gerrit.wikimedia.org/r/908635 (https://phabricator.wikimedia.org/T333656) (owner: 10Ryan Kemper) [21:53:32] (03Abandoned) 10Dzahn: microsites/query_service: remove query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/908634 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [21:57:46] (03CR) 10Ryan Kemper: [C: 03+2] delete query-preview.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/905754 (https://phabricator.wikimedia.org/T333656) (owner: 10Dzahn) [22:00:30] !log T333656 `ryankemper@dns1001:~$ sudo -i authdns-update` after merge of https://gerrit.wikimedia.org/r/905754 => `OK - authdns-update successful on all nodes!` [22:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:35] T333656: Decommission query-preview.wikidata.org - https://phabricator.wikimedia.org/T333656 [22:02:53] (ProbeDown) firing: (2) Service irc2001:6667 has failed probes (tcp_mw_rc_irc_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#irc2001:6667 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:18] (03PS2) 10Ryan Kemper: wdqs: remove query-preview microsite [puppet] - 10https://gerrit.wikimedia.org/r/908635 (https://phabricator.wikimedia.org/T333656) [22:03:27] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/908635 (https://phabricator.wikimedia.org/T333656) (owner: 10Ryan Kemper) [22:03:27] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10Dzahn) >>! In T333884#8778796, @MarcoAurelio wrote: > Apologies for the delay. I emailed @KFrancis on the day she requested me to do so, however I had so... [22:04:17] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-MarcoAurelio: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10Dzahn) a:05MarcoAurelio→03None [22:04:39] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40676/console" [puppet] - 10https://gerrit.wikimedia.org/r/908635 (https://phabricator.wikimedia.org/T333656) (owner: 10Ryan Kemper) [22:10:47] (JobUnavailable) firing: Reduced availability for job udpmxircecho in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:11:42] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] wdqs: remove query-preview microsite [puppet] - 10https://gerrit.wikimedia.org/r/908635 (https://phabricator.wikimedia.org/T333656) (owner: 10Ryan Kemper) [22:30:44] (03CR) 10Dzahn: "I see nothing wrong with it and I like the detailed plan. Just that the whole "take down gerrit1001" and "moght be better to sync" says to" [puppet] - 10https://gerrit.wikimedia.org/r/908617 (https://phabricator.wikimedia.org/T333143) (owner: 10Hashar) [22:41:21] (03CR) 10Dzahn: [C: 04-1] gerrit: replace Icinga with Prometheus monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [22:45:10] (03CR) 10Dzahn: "I think this one should be uncontroversial, unlike the https check that is not merged yet, but this just translates previous thing 1:1 I w" [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [23:13:15] (03PS1) 10Dzahn: sre: update planned quarters and tickets for collab services [puppet] - 10https://gerrit.wikimedia.org/r/908644 [23:15:49] (03PS2) 10Dzahn: sre: update planned quarters and tickets for collab services [puppet] - 10https://gerrit.wikimedia.org/r/908644 [23:25:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [23:25:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudvirtlocal1001.eqiad.wmnet with OS bullseye [23:41:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [23:44:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage