[00:00:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5022.eqsin.wmnet with OS bullseye [00:04:20] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5022.eqsin.wmnet with OS bullseye completed: - cp5022 (**PASS**) - Removed from Puppet and PuppetDB if present -... [00:04:45] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:10] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp5022.eqsin.wmnet [00:06:43] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [00:25:04] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs - https://phabricator.wikimedia.org/T327404 (10Papaul) We are postponing the PDU's maintenance once again to a new date. We will update the task once we have the new date and time. Thank you [00:25:06] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs - https://phabricator.wikimedia.org/T327404 (10Papaul) [00:44:42] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5023.eqsin.wmnet with OS bullseye [00:44:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5023.eqsin.wmnet with OS bullseye [00:49:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:05:15] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) Using a slight modification of @jbond's script in T328593, the list of cp nodes in eqiad with the oudated firmware (`3.15.17.15`) is basically all the cp nodes in eqiad: ` cp1076.eqiad.wmnet cp1077.eqiad... [01:07:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1075.eqiad.wmnet with OS bullseye [01:07:54] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1075.eqiad.wmnet with OS bullseye [01:18:28] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5023.eqsin.wmnet with reason: host reimage [01:21:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5023.eqsin.wmnet with reason: host reimage [01:24:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1075.eqiad.wmnet with reason: host reimage [01:27:33] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1075.eqiad.wmnet with reason: host reimage [01:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:20] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1075.eqiad.wmnet with OS bullseye [01:49:26] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1075.eqiad.wmnet with OS bullseye completed: - cp1075 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [01:50:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:36] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [01:50:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=cdn [01:50:45] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=ats-be [01:55:25] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5023.eqsin.wmnet with OS bullseye [01:55:30] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5023.eqsin.wmnet with OS bullseye completed: - cp5023 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [01:55:45] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet [01:56:28] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS bullseye [01:56:35] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye [01:56:46] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [02:00:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:02] (03PS2) 10KartikMistry: Update cxserver to 2023-02-02-004918-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470) [02:10:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:08] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5024.eqsin.wmnet with OS bullseye [02:14:15] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye executed with errors: - cp5024 (**FAIL**) - Downtimed on Icinga/Alertmanager -... [02:14:35] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS bullseye [02:14:41] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye [02:15:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:45] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:37] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) Steps to follow for manual upgrade of the iDRAC firmwares for the cp hosts in eqiad for us and in case someone else stumbles on this issue. The TL;DR is that we need to manually update the iDRAC firmware... [02:30:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:33:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:34:40] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:35:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:39:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:41:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:42:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:46:08] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage [02:49:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage [03:00:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5024.eqsin.wmnet with OS bullseye [03:22:38] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye completed: - cp5024 (**PASS**) - Removed from Puppet and PuppetDB if present -... [03:30:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:40:05] (03CR) 10RLazarus: [C: 03+2] "LGTM, thanks again! I'll get this deployed -- sorry I didn't get to it today." [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [03:41:43] (03Merged) 10jenkins-bot: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos) [04:00:32] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet [04:01:13] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [04:15:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:49:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:53:10] 10SRE, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) a:03Marostegui [05:00:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:22:01] 10SRE, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) Thanks for triaging this [05:45:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:03] * kart_ updating cxserver [06:01:11] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-02-02-004918-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470) (owner: 10KartikMistry) [06:01:21] (03PS1) 10Marostegui: db2181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/885922 (https://phabricator.wikimedia.org/T328623) [06:01:57] (03CR) 10Marostegui: [C: 03+2] db2181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/885922 (https://phabricator.wikimedia.org/T328623) (owner: 10Marostegui) [06:03:21] 10ops-codfw, 10DBA, 10Patch-For-Review: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) Looks like hardware issues - @Papaul can you please reach out to dell? ` ------------------------------------------------------------------------------- Record: 8 Date/Time:... [06:06:30] 10ops-codfw, 10DBA, 10Patch-For-Review: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) a:05Marostegui→03Papaul The host cannot even be powered it back ON. [06:06:34] (03Merged) 10jenkins-bot: Update cxserver to 2023-02-02-004918-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470) (owner: 10KartikMistry) [06:09:08] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:09:31] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:12:58] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:12:59] (03CR) 10Winston Sung: "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470) (owner: 10KartikMistry) [06:13:52] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:15:38] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:16:30] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:17:08] !log Updated cxserver to 2023-02-02-004918-production (T129470, T172035, T327842) [06:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:14] T172035: Blockers for Wikimedia wiki domain renaming - https://phabricator.wikimedia.org/T172035 [06:17:14] T129470: CX can't load any pages from be-tarask Wikipedia - https://phabricator.wikimedia.org/T129470 [06:17:15] T327842: Post-creation work for gurwiki - https://phabricator.wikimedia.org/T327842 [06:30:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:07] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [06:54:47] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T0700) [07:00:05] kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T0700). [07:00:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:01] (03Abandoned) 10Gergő Tisza: [WIP] Update apache rules for 2.4 [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/225553 (owner: 10Gergő Tisza) [07:45:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:57] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment collect static files [puppet] - 10https://gerrit.wikimedia.org/r/885787 (owner: 10Slyngshede) [07:47:22] (03CR) 10Slyngshede: [C: 03+2] Add social_auth pipeline for group creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/885813 (owner: 10Slyngshede) [07:47:24] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add social_auth pipeline for group creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/885813 (owner: 10Slyngshede) [07:50:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:15] (03CR) 10Elukey: [C: 03+1] admin: add user santhosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/885842 (https://phabricator.wikimedia.org/T328517) (owner: 10Herron) [07:54:58] (03PS1) 10Gergő Tisza: campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/885928 (https://phabricator.wikimedia.org/T321370) [07:55:17] (03PS1) 10Gergő Tisza: campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885929 (https://phabricator.wikimedia.org/T321370) [07:55:29] (03PS1) 10Muehlenhoff: Point the webproxy in esams to install3002 [dns] - 10https://gerrit.wikimedia.org/r/885982 (https://phabricator.wikimedia.org/T327867) [07:56:44] (03PS1) 10Muehlenhoff: Apply installserver role to install3002 [puppet] - 10https://gerrit.wikimedia.org/r/885983 (https://phabricator.wikimedia.org/T327867) [08:00:05] Amir1, apergos, and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T0800). [08:00:05] Aishik and tgr: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:19] (03CR) 10Muehlenhoff: [C: 03+2] Apply installserver role to install3002 [puppet] - 10https://gerrit.wikimedia.org/r/885983 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:00:33] (03PS26) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [08:00:35] morning! there are no trainees signed up today, but 4 patches from 2 devs are on the calendar [08:01:15] present (none of my patches need checking though) [08:01:20] I don't see Aishik here just yet, so tgr do you want to proceed? [08:01:24] er tgr_ [08:01:40] yeah, thanks [08:01:40] and I assume you would self deploy? [08:01:46] I can, sure [08:02:05] all righty. I've got the logstash dashboards up and all that, go for it. [08:02:29] (03PS2) 10Gergő Tisza: Document the '+' pattern for specifying wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048 [08:02:38] (03CR) 10Gergő Tisza: [C: 03+2] Document the '+' pattern for specifying wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048 (owner: 10Gergő Tisza) [08:02:59] (03CR) 10Gergő Tisza: [C: 03+2] campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/885928 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza) [08:03:03] (03CR) 10Gergő Tisza: [C: 03+2] campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885929 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza) [08:03:23] (03Merged) 10jenkins-bot: Document the '+' pattern for specifying wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048 (owner: 10Gergő Tisza) [08:08:38] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [08:15:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:56] (03PS1) 10Muehlenhoff: Update DHCP config for esams [puppet] - 10https://gerrit.wikimedia.org/r/885984 (https://phabricator.wikimedia.org/T327867) [08:20:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:05] (03Merged) 10jenkins-bot: campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/885928 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza) [08:21:08] (03Merged) 10jenkins-bot: campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885929 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza) [08:21:37] tgr_: there's the merge [08:23:27] !log tgr@deploy1002 Started scap: Backport for [[gerrit:885928|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]], [[gerrit:885929|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]] [08:23:31] T321370: Thank You Pages: custom account creation pages for sv, it, ja, fr, nl - https://phabricator.wikimedia.org/T321370 [08:27:18] !log tgr@deploy1002 tgr: Backport for [[gerrit:885928|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]], [[gerrit:885929|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:27:51] Hey, I am waiting for my turn gerrit 885927 [08:28:38] Aishik: you'll be shortly. do you self-deploy or will you need me to deploy for you, I don't recall? [08:29:36] (03PS27) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [08:29:55] (03PS1) 10Alexandros Kosiaris: Adding Kavitha Appakayala to icinga [puppet] - 10https://gerrit.wikimedia.org/r/885985 (https://phabricator.wikimedia.org/T327403) [08:29:57] (03CR) 10CI reject: [V: 04-1] P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [08:29:58] tgr_ when you are done please speak up here so the next patch owner can proceed. [08:30:48] You have to do it, I have no experience with this [08:32:27] (03PS28) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 [08:32:33] no problem, I encourage you to sign up for a deployment training at some point, https://wikitech.wikimedia.org/wiki/Deployments/Training [08:34:17] Thanks [08:35:02] (03PS4) 10Aishik Rehman: Enable wgMinervaEnableSiteNotice for bnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885927 (https://phabricator.wikimedia.org/T328630) [08:37:32] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39362/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [08:37:48] tgr_: since you haven't replied I'm going to assume you checked out early [08:37:54] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:885928|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]], [[gerrit:885929|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]] (duration: 14m 26s) [08:37:54] moving ahead with your patch, Aishik [08:37:58] T321370: Thank You Pages: custom account creation pages for sv, it, ja, fr, nl - https://phabricator.wikimedia.org/T321370 [08:37:59] oh, nm [08:38:03] I was impatient [08:38:20] yeah, sorry, it took a while. Done now. [08:38:39] no worries, thanks! [08:38:46] now moving ahead with Aishik's patch [08:39:46] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica gitlab1004 to 15.7.6 [08:42:20] (03CR) 10ArielGlenn: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885927 (https://phabricator.wikimedia.org/T328630) (owner: 10Aishik Rehman) [08:43:04] (03CR) 10Muehlenhoff: [C: 03+2] Point the webproxy in esams to install3002 [dns] - 10https://gerrit.wikimedia.org/r/885982 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:44:01] (03CR) 10Muehlenhoff: [C: 03+2] Point DHCP server in esams to install3002 [homer/public] - 10https://gerrit.wikimedia.org/r/885805 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:44:12] oh looks like no pre merge jenkins, fine [08:44:33] (03CR) 10ArielGlenn: [C: 03+2] Enable wgMinervaEnableSiteNotice for bnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885927 (https://phabricator.wikimedia.org/T328630) (owner: 10Aishik Rehman) [08:44:39] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Point DHCP server in esams to install3002 [homer/public] - 10https://gerrit.wikimedia.org/r/885805 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:44:41] (03Merged) 10jenkins-bot: Point DHCP server in esams to install3002 [homer/public] - 10https://gerrit.wikimedia.org/r/885805 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:45:16] ah it was just slow [08:45:21] (03Merged) 10jenkins-bot: Enable wgMinervaEnableSiteNotice for bnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885927 (https://phabricator.wikimedia.org/T328630) (owner: 10Aishik Rehman) [08:46:54] !log ariel@deploy1002 Started scap: Backport for [[gerrit:885927|Enable wgMinervaEnableSiteNotice for bnwiktionary (T328630)]] [08:46:58] T328630: Enable wgMinervaEnableSiteNotice for bnwiktionary - https://phabricator.wikimedia.org/T328630 [08:47:38] (03CR) 10Slyngshede: [V: 03+1] P:IDM Configure OIDC and LDAP. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [08:48:48] !log ariel@deploy1002 ariel and aishik: Backport for [[gerrit:885927|Enable wgMinervaEnableSiteNotice for bnwiktionary (T328630)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:49:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:11] Aishik: please test your path, it is now live on mwdebug1001 [08:49:15] *patch [08:51:07] Everything is alright! [08:51:22] Thanks apergos [08:51:37] ok, I'll complete the scap now [08:57:12] (03CR) 10Muehlenhoff: [C: 03+2] Update DHCP config for esams [puppet] - 10https://gerrit.wikimedia.org/r/885984 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [08:57:51] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:885927|Enable wgMinervaEnableSiteNotice for bnwiktionary (T328630)]] (duration: 10m 56s) [08:57:54] T328630: Enable wgMinervaEnableSiteNotice for bnwiktionary - https://phabricator.wikimedia.org/T328630 [08:58:05] Aishik: your patch is live in production, please test [08:58:07] grrrrr [08:58:53] let's hope that was just a network issue or something and that they will be back shortly. [09:00:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:10] 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10jcrespo) I've manually disabled notifications on Icinga, as puppet cannot run on the host to apply T328623#8581258, to prevent further notifications. This will require manual removal later. [09:04:52] (03PS1) 10Muehlenhoff: Remove installserver role from install3001 [puppet] - 10https://gerrit.wikimedia.org/r/885989 (https://phabricator.wikimedia.org/T327867) [09:05:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:17] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:10:59] (03CR) 10Filippo Giunchedi: [C: 03+1] Adding Kavitha Appakayala to icinga [puppet] - 10https://gerrit.wikimedia.org/r/885985 (https://phabricator.wikimedia.org/T327403) (owner: 10Alexandros Kosiaris) [09:11:25] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: update o11y with opensearch roles and settings [puppet] - 10https://gerrit.wikimedia.org/r/885371 (owner: 10Filippo Giunchedi) [09:11:32] !log roll restart of eventgate-main pods in wikikube eqiad/codfw to pick up new stream configs - T328576 [09:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:35] T328576: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576 [09:12:59] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [09:13:03] I think our patch owner is not returning, so I'll cal this done, though a bit late [09:13:13] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [09:13:20] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the review -- agreed this doesn't fix the underlying permission problem. I'll followup in a (sub)task" [puppet] - 10https://gerrit.wikimedia.org/r/885373 (owner: 10Filippo Giunchedi) [09:13:29] (03CR) 10Filippo Giunchedi: [C: 03+2] opensearch: move to /run/ [puppet] - 10https://gerrit.wikimedia.org/r/885372 (owner: 10Filippo Giunchedi) [09:13:31] !log UTC morning backport and config training window done [09:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [09:16:00] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica gitlab1004 to 15.7.6 [09:19:25] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 398143 [09:19:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 398143 [09:21:11] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) >>! In T328287#8579474, @Trizek-WMF wrote: > As you gave 3 dates in the task description, can you... [09:22:28] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, found an issue while testing" [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [09:23:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway) [09:23:22] (03PS1) 10Ilias Sarantopoulos: httpbb: add tests for liftwing (prod/staging) [puppet] - 10https://gerrit.wikimedia.org/r/885990 [09:23:51] (03PS2) 10Ilias Sarantopoulos: httpbb: add tests for liftwing (prod/staging) [puppet] - 10https://gerrit.wikimedia.org/r/885990 [09:25:38] (03CR) 10CI reject: [V: 04-1] httpbb: add tests for liftwing (prod/staging) [puppet] - 10https://gerrit.wikimedia.org/r/885990 (owner: 10Ilias Sarantopoulos) [09:28:17] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:40:09] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [09:40:51] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [09:45:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:02] (03PS1) 10Elukey: changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302) [09:46:48] (03CR) 10CI reject: [V: 04-1] changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [09:50:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:19] (03PS2) 10Elukey: changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302) [09:51:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2001.codfw.wmnet [09:53:21] (03CR) 10Muehlenhoff: [C: 03+2] Remove installserver role from install3001 [puppet] - 10https://gerrit.wikimedia.org/r/885989 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [09:54:52] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [09:54:59] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede) [09:55:26] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [09:59:07] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2001.codfw.wmnet [10:02:43] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [10:04:26] !log installing tiff security updates [10:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:38] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [10:11:53] !log restarting FPM on mw canaries to pick up tiff security updates [10:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:33] (03PS1) 10Slyngshede: C:IDM parse static dir to deployment. [puppet] - 10https://gerrit.wikimedia.org/r/885995 [10:17:42] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:18:13] (03CR) 10Jaime Nuche: jenkins: add hieradata config for Scap3-based deployments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [10:19:53] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39363/console" [puppet] - 10https://gerrit.wikimedia.org/r/885995 (owner: 10Slyngshede) [10:19:53] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2002.codfw.wmnet [10:20:52] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:IDM parse static dir to deployment. [puppet] - 10https://gerrit.wikimedia.org/r/885995 (owner: 10Slyngshede) [10:25:01] (03CR) 10Jelto: [C: 04-1] "I'm not sure if you want to remove old scap/jenkins components when switching to scap3. If yes, you have to use the ensure flags otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [10:27:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2002.codfw.wmnet [10:30:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:22] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Clement_Goubert) Is this still relevant, does it need to be finished for {T327920}, or can it be closed? [10:37:35] (03PS1) 10Hokwelum: Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 [10:39:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "probes will start working once we're back to one centrallog server in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [10:40:25] (03CR) 10CI reject: [V: 04-1] Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum) [10:48:34] 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10cmooney) > BGP is smart about it (see '"first party" NEXT_HOP' in section 5.1.3.2 of the RFC), so it should just work on the router side. TIL didn't realise EBGP... [10:49:05] (03PS2) 10Hokwelum: Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 [10:49:59] (03CR) 10CI reject: [V: 04-1] Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum) [10:50:17] (03PS1) 10Mvolz: Update zotero to 2023-02-01-144124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/885998 [10:51:05] (03PS3) 10Hokwelum: Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 [10:51:58] (03CR) 10CI reject: [V: 04-1] Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum) [10:55:40] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:50] (03CR) 10Mvolz: [C: 03+2] Update zotero to 2023-02-01-144124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/885998 (owner: 10Mvolz) [11:00:05] mvolz: Your horoscope predicts another unfortunate Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1100). [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1100) [11:00:07] (03CR) 10Jbond: [C: 03+2] wmflib::ssl_ciphersuites: drop suppport for anything less then jessie [puppet] - 10https://gerrit.wikimedia.org/r/640467 (owner: 10Jbond) [11:00:49] (03Merged) 10jenkins-bot: Update zotero to 2023-02-01-144124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/885998 (owner: 10Mvolz) [11:01:36] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39364/console" [puppet] - 10https://gerrit.wikimedia.org/r/640467 (owner: 10Jbond) [11:02:14] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [11:03:34] (03PS1) 10Vgutierrez: varnish: Provide a valid DP key [labs/private] - 10https://gerrit.wikimedia.org/r/886000 (https://phabricator.wikimedia.org/T315676) [11:04:08] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] varnish: Provide a valid DP key [labs/private] - 10https://gerrit.wikimedia.org/r/886000 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [11:05:45] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:07:34] (03CR) 10Muehlenhoff: [C: 03+1] "One remaining concern was that a process spawned by systemd which has a shell configured (which isn't the case for the majority of service" [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [11:09:04] 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Marostegui) We really need this to be completed yes. I don't know in which state this is at the moment. [11:09:21] (03PS3) 10Elukey: changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302) [11:12:19] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:13:19] jouncebot: now [11:13:19] For the next 0 hour(s) and 46 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1100) [11:13:19] For the next 0 hour(s) and 46 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1100) [11:14:42] does anyone mind if I run a maintenance script to fix T328634? [11:14:42] T328634: Lost pages after deployed addtional namespaces on shn.wikibooks - https://phabricator.wikimedia.org/T328634 [11:14:57] I mean, I am trying to deploy right now [11:15:04] but it just failed [11:15:07] Error: UPGRADE FAILED: release staging failed, and has been rolled back due to atomic being set: timed out waiting for the condition [11:15:07] ah, sorry, I didn’t see that [11:15:17] I’ll hold then [11:15:22] I'm actually not sure if I should just try again? [11:15:24] don’t think I can help with helm errors though [11:15:27] That's helm errors [11:15:43] Yeah. [11:15:45] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:16:02] Does it tell you what namespace failed ? [11:17:50] Oh it's on zotero staging? [11:17:59] (03CR) 10Hnowlan: [C: 03+1] changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [11:18:32] claime: yup [11:18:40] ideas? [11:19:22] (03CR) 10Elukey: [C: 03+2] changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey) [11:19:28] mvolz: you're deploying directly via helmfile on deploy1002 right? [11:19:45] Lucas_WMDE: I think you should go ahead (did these windows always overlap? I feel like this is new) [11:20:17] what I’m doing doesn’t belong to either window, I just thought both windows might be inactive [11:20:19] The MW infra window is recent [11:20:24] claime: yes [11:20:25] (03PS1) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 [11:20:29] And it's for mw-on-k8s deployments [11:20:45] i.e for now mostly _joe_ and I [11:21:03] Ah okay [11:21:16] So don´t worry about overlap for now :) [11:21:33] Lucas_WMDE: go ahead I'm not touching anything on mw-on-k8s rn [11:21:50] <_joe_> mvolz: what are you trying to deploy? [11:21:54] alright, I’ll run the script and hope it works [11:22:00] shouldn’t affect citoid anyway [11:22:04] yup [11:22:04] good luck with your parts :) [11:22:06] <_joe_> citoid? [11:22:12] I'm trying to deploy zotero [11:22:16] <_joe_> mvolz: where were you deploying it? [11:22:18] <_joe_> staging? [11:22:20] *zotero [11:22:24] yes, staging [11:22:26] * Lucas_WMDE was confused [11:22:28] and it just timed out [11:22:29] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [11:22:30] <_joe_> jayme: ^^ [11:22:40] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [11:22:41] _joe_: https://pastebin.com/m953ABNj [11:22:45] <_joe_> issues in staging bringing up zotero [11:23:00] not terribly helpful message but that's what I got back after about 11 minutes [11:23:02] <_joe_> mvolz: yeah I'm redirecting the debugging to jayme sorry, I have my hands full with other stuff [11:23:12] ok [11:23:21] (03CR) 10CI reject: [V: 04-1] [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (owner: 10DCausse) [11:23:28] <_joe_> mvolz: yes that's what you get when k8s is unable to deploy something within some of its timeouts [11:23:44] <_joe_> mvolz: out of curiosity, were you deploying a new version of the image? [11:23:52] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php shnwikibooks --fix | tee T328634-namespaceDupes.out # T328634 – failed quickly, details in task [11:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:55] T328634: Lost pages after deployed addtional namespaces on shn.wikibooks - https://phabricator.wikimedia.org/T328634 [11:23:59] <_joe_> if so, was it significantly larger than the last one? [11:24:22] _joe_: yes, a new version but I don't think much bigger [11:24:36] <_joe_> ok, uhm [11:24:54] (03CR) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (owner: 10DCausse) [11:26:24] I could try to deploy a very minor change to citoid (package update) and see if it's all services being hinksy if you like or just a zotero specific issue [11:26:55] that I was going to do next [11:27:01] (03PS1) 10Jbond: puppet-merge: try to decode with erros=ignore on failure [puppet] - 10https://gerrit.wikimedia.org/r/886006 [11:27:06] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/865198 (owner: 10PipelineBot) [11:28:09] 15m Warning BackOff pod/zotero-staging-9cb5fcb5d-z68zr Back-off restarting failed container [11:28:10] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php shnwikibooks --fix --add-prefix=T328634/ | tee T328634-namespaceDupes-2.out # T328634 – another error but made more progress [11:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:13] Container fails to start [11:28:46] :( [11:29:03] should we revert and see if that works? [11:29:25] (03CR) 10Vgutierrez: [C: 03+2] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [11:29:33] The service should still be up, but on the former version [11:32:04] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/865198 (owner: 10PipelineBot) [11:32:25] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php shnwikibooks --fix --add-prefix=T328634/ | tee T328634-namespaceDupes-3.out # T328634 – seemed to finish the first 20 pages and then go into an infinite loop, I Ctrl+Ced it [11:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:28] T328634: Lost pages after deployed addtional namespaces on shn.wikibooks - https://phabricator.wikimedia.org/T328634 [11:33:00] yeah, it is. [11:34:31] claime: would you mind if I tried the citoid deploy, or would that be disruptive [11:36:15] (03CR) 10DCausse: flink-app: add preliminary H/A support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse) [11:36:24] (03Abandoned) 10DCausse: flink-app: add preliminary H/A support [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse) [11:37:04] mvolz: Nah go [11:37:19] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php shnwikibooks --fix | tee T328634-namespaceDupes-4.out # T328634 – made some progress then errored out again [11:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:55] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:38:27] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:39:22] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:39:25] (03PS1) 10Vgutierrez: varnish: Fix python3-nacl dependency order issue [puppet] - 10https://gerrit.wikimedia.org/r/886008 (https://phabricator.wikimedia.org/T315676) [11:40:05] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:40:44] I think I’m done with my maintenance script runs for now [11:40:53] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39365/console" [puppet] - 10https://gerrit.wikimedia.org/r/886008 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [11:41:13] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:41:34] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/886006 (owner: 10Jbond) [11:41:50] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:42:01] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:42:39] Lucas_WMDE: ack [11:42:52] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:43:02] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Fix python3-nacl dependency order issue [puppet] - 10https://gerrit.wikimedia.org/r/886008 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [11:44:23] well citoid deploy went fine [11:46:24] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:49:44] 10SRE, 10Traffic, 10Patch-For-Review: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Vgutierrez) Initial sanity checks confirms that the daily key generated on two different hosts is the same: ` vgutierrez@cumin1001:~$ sudo -i cumin 'cp[6015,6016].*' 'sha512sum /etc/varni... [11:54:01] (03PS1) 10Slyngshede: Switch to CAS OIDC for login button. [software/bitu] - 10https://gerrit.wikimedia.org/r/886010 [11:57:44] (03PS2) 10Slyngshede: Switch to CAS OIDC for login button. [software/bitu] - 10https://gerrit.wikimedia.org/r/886010 [12:00:06] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:11] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) >>! In T328287#8579336, @Trizek-WMF wrote: > @Clement_Goubert Has anything major changed in your p... [12:05:22] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:05] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:08:35] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:09:05] (03PS1) 10Stevemunene: Bump up mediawiki_history_snapshot to 2023-01 [puppet] - 10https://gerrit.wikimedia.org/r/886013 [12:13:50] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:14:36] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:18:23] marostegui: Amir1: doing something on db1117:3323 ? [12:18:24] (03CR) 10Jelto: [C: 03+1] jenkins: add hieradata config for Scap3-based deployments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [12:18:34] let me see [12:18:51] 3323? that should be m3 I think [12:19:38] there is another too [12:21:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2003.codfw.wmnet [12:22:18] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:22:23] the mysql has crashed [12:22:30] debugging [12:23:11] Is this an OK time to deploy a security patch? [12:23:14] Amir1: Yeah, other p[ort on 1117 is down [12:23:19] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1076.eqiad.wmnet with OS bullseye [12:23:21] (3322) [12:23:26] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1076.eqiad.wmnet with OS bullseye [12:23:37] jouncebot: nowandnext [12:23:37] No deployments scheduled for the next 1 hour(s) and 36 minute(s) [12:23:37] In 1 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1400) [12:23:38] In 1 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1400) [12:24:04] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:24:05] (03CR) 10Jaime Nuche: jenkins: use Scap3 deployment for releases instances (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [12:24:24] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:25:14] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:25:16] The unit mariadb@m3.service has successfully entered the 'dead' state. [12:25:45] ? [12:25:49] kostajh: I think you can go ahead [12:26:06] Amir1: that is me [12:26:07] jynus: m3 only on db1117 died out of nowhere [12:26:16] ah [12:26:24] marostegui: I brought it back, shall I kill it? [12:26:24] oh you started it??? [12:26:34] I need to start again :( [12:26:37] Well yeah, it started making haproxy alert marostegui ... [12:26:59] stopped m3 [12:27:19] Amir1: please leave it with me, I don't need it stopped yet [12:27:45] okay, brought it back [12:27:55] Amir1: please stop touching it [12:28:11] * Amir1 hands off now [12:28:57] db1164 is also out, it's m2. Is that you too? [12:29:04] !log btullis@deploy1002 Started deploy [analytics/superset/deploy@5175ad7]: Production deployment for numpy downgrade [12:29:10] yes [12:29:17] noted [12:29:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2003.codfw.wmnet [12:29:27] !log Work ongoing on m2 and m3 [12:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:45] claime: thanks [12:29:46] !log btullis@deploy1002 Finished deploy [analytics/superset/deploy@5175ad7]: Production deployment for numpy downgrade (duration: 00m 42s) [12:30:14] next question, is anyone able to help me with the security patch deployment, as I have not done that before? [12:30:38] kostajh: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Security_patches this should be good [12:31:51] Amir1: thanks, I've seen that... I guess I'll try the deployment via script approach and hope for the best [12:32:50] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:33:10] ^ expected [12:33:10] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:33:24] yes, all proxies are expected to irc alert [12:34:56] marostegui: No problem, tell me when you're done so I know I can start worrying again ;) [12:34:56] Amir1: I'm stuck at step 0. "No ED25519 host key is known for deployment.eqiad.wmnet". I've run `scripts/wmf-update-known-hosts-production` from the `wmf-sre-laptop` repo and other servers work... am I missing something obvious [12:35:15] kostajh: deploy1002.eqiad.wmnet [12:35:21] yup [12:35:22] claime: I will [12:35:34] marostegui: thanks mate <3 [12:35:35] aha [12:35:36] thanks [12:37:49] What am I missing in my SSH config? https://wikitech.wikimedia.org/wiki/Deploy1002 says to use `deployment.eqiad.wmnet` [12:38:09] My SSH config looks like https://wikitech.wikimedia.org/wiki/SRE/Production_access#Setting_up_your_SSH_config [12:39:29] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:39:40] (03CR) 10EoghanGaffney: [C: 03+2] Rotate aphlict logs either daily, or when they reach 1G [puppet] - 10https://gerrit.wikimedia.org/r/885858 (https://phabricator.wikimedia.org/T325246) (owner: 10EoghanGaffney) [12:39:44] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:40:13] (03CR) 10Jelto: [C: 04-1] "comment in line" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [12:40:16] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:41:04] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:41:06] (03PS1) 10Marostegui: db1164: Move it from m2 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/886030 (https://phabricator.wikimedia.org/T328402) [12:41:36] (03CR) 10Marostegui: [C: 03+2] db1164: Move it from m2 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/886030 (https://phabricator.wikimedia.org/T328402) (owner: 10Marostegui) [12:41:43] kostajh: Yeah so, deployment.eqiad.wmnet point to deploy1002.eqiad.wmnet, but apparently we don't create the know_host entry for it. [12:41:45] (03PS1) 10Muehlenhoff: Move webproxy in ulsfo to install4002 [dns] - 10https://gerrit.wikimedia.org/r/886031 (https://phabricator.wikimedia.org/T327867) [12:41:54] kostajh: just ssh deploy1002.eqiad.wmnet [12:41:55] eoghan: can i merge your changes? [12:42:01] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:42:14] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:42:16] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:42:18] marostegui: I was just about to but if I'm blocking you from doing it then go right ahead! [12:42:25] eoghan: doing it [12:42:36] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:42:44] marostegui: Wonderful, thakn you! [12:43:34] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:43:36] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:43:42] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:44:12] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:44:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1076.eqiad.wmnet with reason: host reimage [12:45:11] claime: ack [12:45:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:46:10] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:46:48] (03PS28) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [12:47:07] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:47:44] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1076.eqiad.wmnet with reason: host reimage [12:51:10] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39366/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [12:52:04] Amir1 / claime: about to run the script. There's no verification step via WikimediaDebug, AIUI. Is there a command to revert the deployment in case of unanticipated problems? [12:52:30] I don't remember [12:53:50] (03PS1) 10Muehlenhoff: Setup install4002 as install server [puppet] - 10https://gerrit.wikimedia.org/r/886036 (https://phabricator.wikimedia.org/T327867) [12:54:47] (03PS1) 10Muehlenhoff: Update DHCP config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/886037 (https://phabricator.wikimedia.org/T327867) [12:55:04] (03PS20) 10Jaime Nuche: jenkins: add hieradata config for Scap3-based deployments [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) [12:55:04] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2004.codfw.wmnet [12:55:06] (03PS8) 10Jaime Nuche: jenkins: use Scap3 deployment for releases instances [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) [12:55:08] (03PS7) 10Jaime Nuche: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909) [12:55:51] kostajh: which script? [12:55:52] kostajh: I don't know. jnuche do you or someone from your team ? [12:55:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:56:09] taavi: deploy_security.py I suppose [12:56:12] I'm running it now. taavi: this one https://gitlab.wikimedia.org/repos/releng/release/-/blob/master/deploy_security.py [12:57:17] scap sync-file --help says there is an argument called --pause-after-testserver-sync, which could probably be used by the script to implement an mwdebug step [12:59:46] ack [12:59:55] I'm on the "When you run it, sometimes it might look like it's stuck. Don't worry, it's doing stuff." step now... [13:00:16] claime, kostajh: if you're running `sync-file` manually, then the flag mentioned by taavi should stop so you can verify [13:00:36] (03PS1) 10Giuseppe Lavagetto: [WIP] Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [13:00:49] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:00:53] jnuche: kostajh is running the procedure at https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Deployment:_via_script [13:01:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2004.codfw.wmnet [13:01:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve1008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:02:06] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:03:11] !log kharlan: Deployed security patch for T328643 [13:03:52] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:03:55] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:04:15] claime: ack, we don't seem to have a flag for that in that script unfortunately [13:04:30] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:04:44] claime: all proxies are back, maintenance is finished on m2 and m3 [13:05:09] marostegui: Thanks, and sorry for the earlier disruption [13:06:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:08:21] taavi: I filed your idea about `--pause-after-testserver-sync` as T328667 [13:08:22] T328667: Add --pause-after-testserver-sync option to deploy_security.py - https://phabricator.wikimedia.org/T328667 [13:08:23] (03PS1) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:08:31] (03CR) 10Jaime Nuche: "Latest PCC: https://puppet-compiler.wmflabs.org/output/884887/39367/" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:08:44] jnuche: and I wonder if that script should be integrated in scap entirely.. having 'run this script you just downloaded' as the official workflow isn't ideal especially for security stuff [13:08:54] (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [13:09:08] yeah, would be nice if this was in scap [13:09:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1076.eqiad.wmnet with OS bullseye [13:09:42] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1076.eqiad.wmnet with OS bullseye completed: - cp1076 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled Pu... [13:10:01] !log kharlan: Deployed security patch for T328643 [13:11:05] (03PS2) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:11:12] seems to be done [13:11:13] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:31] (03CR) 10Muehlenhoff: [C: 03+2] Setup install4002 as install server [puppet] - 10https://gerrit.wikimedia.org/r/886036 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [13:11:39] (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [13:12:18] taavi, kostajh: agreed [13:13:09] (03PS3) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:13:30] (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [13:14:23] (03PS4) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:14:25] jnuche: I'm following https://wikitech.wikimedia.org/wiki/How_to_deploy_code#After and it says "You may have to ask a releng/SRE person to check that the build worked correctly if you don’t have the necessary access yourself.". I don't have access, could you please look? [13:14:44] (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [13:15:29] kostajh: I'll do it [13:16:07] that's not really up to date? scap is doing that by itself these days instead of leaving it to the jenkins build [13:16:34] yeah [13:16:39] kostajh: I can see your patch for both active branches in the deployment server [13:16:50] ok, thanks [13:17:25] (I updated the security section for How_to_deploy_code to make it a bit more readable for those who haven't done it before, if there are other things to update please change them!) [13:17:50] then, I think I'm done with this for now. I left a follow-up comment in the phab task for Security to clarify next steps [13:18:32] (03PS5) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:18:53] (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [13:19:45] All mw-on-k8s deployments have been updated to latest images [13:19:58] claime: thx [13:20:07] (03PS6) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:20:28] (03CR) 10jenkins-bot: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [13:23:06] (03PS7) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:23:27] (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [13:24:34] claime: any ideas how to debug zotero fail? Because it works locally and builds fine, I'm not sure what the next steps for me would be. Should I open a ticket? [13:25:24] (03PS8) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:29:13] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:32:02] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:34:41] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1076.eqiad.wmnet,service=cdn [13:34:41] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1076.eqiad.wmnet,service=ats-be [13:35:01] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [13:38:47] (03CR) 10Muehlenhoff: [C: 03+2] Move webproxy in ulsfo to install4002 [dns] - 10https://gerrit.wikimedia.org/r/886031 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [13:42:31] (03PS9) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [13:42:33] (03CR) 10Muehlenhoff: [C: 03+2] Point DHCP server in ulsfo to install4002 [homer/public] - 10https://gerrit.wikimedia.org/r/885806 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [13:45:53] (03PS3) 10Ilias Sarantopoulos: httpbb: add tests for liftwing (prod/staging) [puppet] - 10https://gerrit.wikimedia.org/r/885990 (https://phabricator.wikimedia.org/T327787) [13:50:24] (03CR) 10Jelto: [C: 03+2] jenkins: add hieradata config for Scap3-based deployments [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [13:55:32] (03CR) 10Muehlenhoff: [C: 03+2] Update DHCP config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/886037 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [13:59:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm) [13:59:35] (03PS10) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [14:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1400) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1400). [14:00:04] No Gerrit patches in the queue for this window AFAICS. [14:01:05] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39370/console" [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [14:01:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39371/console" [puppet] - 10https://gerrit.wikimedia.org/r/885990 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos) [14:02:40] (03CR) 10Elukey: [V: 03+1 C: 03+2] httpbb: add tests for liftwing (prod/staging) [puppet] - 10https://gerrit.wikimedia.org/r/885990 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos) [14:02:42] (03PS1) 10Muehlenhoff: Remove installserver role from install4001 [puppet] - 10https://gerrit.wikimedia.org/r/886049 (https://phabricator.wikimedia.org/T327867) [14:06:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove installserver role from install4001 [puppet] - 10https://gerrit.wikimedia.org/r/886049 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [14:08:22] (03PS1) 10Elukey: profile::httpbb: fix liftwing paths [puppet] - 10https://gerrit.wikimedia.org/r/886050 [14:10:35] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39372/console" [puppet] - 10https://gerrit.wikimedia.org/r/886050 (owner: 10Elukey) [14:10:40] (03CR) 10Jelto: [C: 03+2] jenkins: use Scap3 deployment for releases instances [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [14:10:47] (03CR) 10Herron: [C: 03+1] rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [14:11:20] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "my bad :)" [puppet] - 10https://gerrit.wikimedia.org/r/886050 (owner: 10Elukey) [14:12:12] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [14:13:02] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::httpbb: fix liftwing paths [puppet] - 10https://gerrit.wikimedia.org/r/886050 (owner: 10Elukey) [14:13:57] (03CR) 10Volans: "I did a very quick first pass" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [14:15:34] (03PS1) 10Muehlenhoff: Point DHCP server in eqsin to install5002 [homer/public] - 10https://gerrit.wikimedia.org/r/886053 (https://phabricator.wikimedia.org/T327867) [14:21:44] (03CR) 10Filippo Giunchedi: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [14:24:50] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2005.codfw.wmnet [14:25:42] !log installing containerd security updates on codfw k8s nodes [14:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:42] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) a:03Jhancock.wm [14:29:13] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [14:31:12] 10SRE, 10Discovery-Search: Revise elastic/open search and its /run + tmpfiles creation - https://phabricator.wikimedia.org/T328674 (10fgiunchedi) [14:31:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2005.codfw.wmnet [14:34:33] 10SRE, 10Discovery-Search: Revise elastic/open search and its /run + tmpfiles creation - https://phabricator.wikimedia.org/T328674 (10fgiunchedi) [14:37:35] 10SRE, 10DNS, 10Infrastructure-Foundations, 10Mail, and 3 others: Add SPF records for gitlab.wikimedia.org - https://phabricator.wikimedia.org/T328642 (10eoghan) p:05Triage→03Medium a:03eoghan [14:37:43] (03PS1) 10Filippo Giunchedi: elasticsearch: move to /run [puppet] - 10https://gerrit.wikimedia.org/r/886055 (https://phabricator.wikimedia.org/T328674) [14:39:16] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [14:42:05] (03PS1) 10Filippo Giunchedi: elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) [14:42:23] (03PS34) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:43:41] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.2.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/886061 [14:43:52] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) >>! In T316544#8575464, @Andrew wrote: > We have a ton of rebalancing to do for each of these switches.... [14:45:43] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [14:47:46] (03CR) 10Phedenskog: [C: 04-1] "Hi Filippo, I don't have the privileges to abandon the patch, when you have time could you please do it for me? This is something we will " [puppet] - 10https://gerrit.wikimedia.org/r/633202 (https://phabricator.wikimedia.org/T262962) (owner: 10Dave Pifke) [14:49:27] mvolz: Your container goes into crashloopbackoff [14:49:33] (03CR) 10Filippo Giunchedi: "Hi Peter, for sure! Easy enough" [puppet] - 10https://gerrit.wikimedia.org/r/633202 (https://phabricator.wikimedia.org/T262962) (owner: 10Dave Pifke) [14:49:37] (03PS35) 10Vgutierrez: varnish: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:49:39] (03Abandoned) 10Filippo Giunchedi: [WIP] Start puppetizing WebPageTest [puppet] - 10https://gerrit.wikimedia.org/r/633202 (https://phabricator.wikimedia.org/T262962) (owner: 10Dave Pifke) [14:50:16] mvolz: https://phabricator.wikimedia.org/P43581 [14:51:19] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39374/console" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:51:51] (03CR) 10Vgutierrez: "varnish tests are happy as well:" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:54:27] (03PS1) 10Ilias Sarantopoulos: profile::httpbb: fix liftwing hosts [puppet] - 10https://gerrit.wikimedia.org/r/886063 (https://phabricator.wikimedia.org/T327787) [14:54:37] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.2.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/886061 (owner: 10Volans) [14:55:04] claime: thanks! [14:55:27] mvolz: np :) [14:55:46] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [14:59:13] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.2.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/886061 (owner: 10Volans) [14:59:26] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2006.codfw.wmnet [14:59:59] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:00:55] !log rolling restart of varnish in cache::text - T315676 [15:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:59] T315676: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 [15:01:52] (03CR) 10Elukey: [C: 03+2] profile::httpbb: fix liftwing hosts [puppet] - 10https://gerrit.wikimedia.org/r/886063 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos) [15:02:23] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3004 was renamed as ganeti4004 - jmm@cumin2002" [15:02:42] 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff) [15:03:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3004 was renamed as ganeti4004 - jmm@cumin2002" [15:03:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:03:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/886010 (owner: 10Slyngshede) [15:05:46] (03CR) 10Ayounsi: [C: 03+1] Point DHCP server in eqsin to install5002 [homer/public] - 10https://gerrit.wikimedia.org/r/886053 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [15:06:08] (03PS1) 10Milimetric: Bump up mediawiki_history_snapshot to 2023-01 [puppet] - 10https://gerrit.wikimedia.org/r/886065 [15:06:56] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2006.codfw.wmnet [15:07:02] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) >>! In T321309#8581111, @ssingh wrote: > Steps to follow for manual upgrade of the iDRAC firmwares for the cp hosts in eqiad for us and in case someone else stumbles on th... [15:12:13] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10Jclark-ctr) Looks like all of those are the second connection for these servers Racking ticket T313983 cloudvirt1054 E4 U29 Port 36/37 Cableid 20220045 / 20220041 cloudvirt1055 E4 U30 ort 38/39 Cableid 20220... [15:17:22] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti3004 [15:17:38] (03Abandoned) 10Milimetric: Bump up mediawiki_history_snapshot to 2023-01 [puppet] - 10https://gerrit.wikimedia.org/r/886065 (owner: 10Milimetric) [15:20:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) >>! In T321309#8582443, @MoritzMuehlenhoff wrote: >>>! In T321309#8581111, @ssingh wrote: >> Steps to follow for manual upgrade of the iDRAC firmwares for the cp hosts in eqiad for u... [15:21:19] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10Papaul) @Jclark-ctr thank you I will fix it in Netbox [15:24:39] !log jmm@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti3004 [15:25:08] (03PS1) 10Ssingh: Release 3.8.0-1~wmf2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) [15:27:33] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10Jclark-ctr) RAID controller battery for an-worker1087 Replaced @BTullis [15:30:08] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:30:09] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/886069 (https://phabricator.wikimedia.org/T327663) (owner: 10Clément Goubert) [15:30:33] (03PS2) 10Bking: [WIP] wdqs: switch to using NFS for dump files [cookbooks] - 10https://gerrit.wikimedia.org/r/868465 (owner: 10Ryan Kemper) [15:30:46] (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: switch to using NFS for dump files [cookbooks] - 10https://gerrit.wikimedia.org/r/868465 (owner: 10Ryan Kemper) [15:33:15] (03PS1) 10Volans: Upstream release v1.2.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/886074 [15:33:32] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@e38efa6] (releasing): (no justification provided) [15:34:44] !log dzahn@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab Replica gitlab2002 to 15.7.6-ce.0 [15:35:29] !log aokoth@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release [15:35:47] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: Security Release [15:35:54] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 4 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:37:52] !log aokoth@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release [15:37:54] (03CR) 10Volans: [C: 03+2] Upstream release v1.2.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/886074 (owner: 10Volans) [15:38:10] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: Security Release [15:40:34] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@e38efa6] (releasing): (no justification provided) (duration: 07m 01s) [15:41:58] (03Merged) 10jenkins-bot: Upstream release v1.2.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/886074 (owner: 10Volans) [15:42:03] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10Jclark-ctr) Rebalanced pdu ports will monitor for a little bit before closing ticket [15:43:44] (03PS1) 10Mvolz: Revert "Update zotero to 2023-02-01-144124-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885945 [15:43:59] (03CR) 10Mvolz: [C: 03+2] Revert "Update zotero to 2023-02-01-144124-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885945 (owner: 10Mvolz) [15:47:25] (03PS11) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [15:48:48] (03Merged) 10jenkins-bot: Revert "Update zotero to 2023-02-01-144124-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885945 (owner: 10Mvolz) [15:52:41] (03PS1) 10Muehlenhoff: Add bookworm to pbuilder setup [puppet] - 10https://gerrit.wikimedia.org/r/886078 [15:53:35] (03PS1) 10Hnowlan: Revert "changeprop: remove remaining blocklist entries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886086 [15:53:43] (03PS2) 10Hnowlan: Revert "changeprop: remove remaining blocklist entries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886086 [15:54:49] (03CR) 10Volans: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/886078 (owner: 10Muehlenhoff) [15:55:12] (03CR) 10Muehlenhoff: [C: 03+2] Add bookworm to pbuilder setup [puppet] - 10https://gerrit.wikimedia.org/r/886078 (owner: 10Muehlenhoff) [15:59:04] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:46] claime: I'm an idiot -> https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/886080/1/config/production.json thanks for your help again [16:00:03] jouncebot: now [16:00:03] No deployments scheduled for the next 0 hour(s) and 59 minute(s) [16:00:31] (03CR) 10Ssingh: "recheck" [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:00:43] (03CR) 10CI reject: [V: 04-1] Release 3.8.0-1~wmf2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:01:11] mvolz: I have a list of 10 "fix json" commits in a rsyslog config somewhere, so we're on the same idiot-level then :p [16:01:21] (03PS12) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [16:01:29] hehe [16:02:54] (03PS2) 10Ssingh: Release 3.8.0-1~wmf2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) [16:03:02] mvolz: I've added a quick troubleshooting 101 to the kubernetes docs https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting [16:03:13] It should be helpful in the future [16:03:59] (03PS13) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [16:04:04] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:04:27] 🎉 [16:06:43] (03CR) 10Ssingh: "Ready for review but please note: I modified the gbp.conf as in Debian proper, to better suit our environment, so please check!" [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:07:09] (03PS1) 10Mvolz: Update zotero to 2023-02-02-155709-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886083 [16:07:52] (03CR) 10Mvolz: [C: 03+2] Update zotero to 2023-02-02-155709-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886083 (owner: 10Mvolz) [16:10:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab Replica gitlab2002 to 15.7.6-ce.0 [16:10:34] !log uploaded python3-wmflib_1.2.1 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [16:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:22] (03Merged) 10jenkins-bot: Update zotero to 2023-02-02-155709-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886083 (owner: 10Mvolz) [16:15:36] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [16:16:11] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [16:16:46] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [16:17:26] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [16:17:34] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2007.codfw.wmnet [16:17:46] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [16:18:27] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [16:25:21] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2007.codfw.wmnet [16:29:06] 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff) [16:32:22] (03CR) 10Elukey: [C: 03+1] Revert "changeprop: remove remaining blocklist entries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886086 (owner: 10Hnowlan) [16:35:46] (03PS3) 10Hnowlan: Revert "changeprop: remove remaining blocklist entries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886086 [16:36:41] (03PS2) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) [16:39:53] 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data ponit - https://phabricator.wikimedia.org/T324675 (10Aklapper) Would it be worth... [16:42:13] (03PS1) 10Mvolz: Update zotero to node 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/886107 [16:42:42] (03CR) 10Elukey: [C: 03+2] Revert "changeprop: remove remaining blocklist entries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886086 (owner: 10Hnowlan) [16:43:04] jouncebot: now [16:43:04] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [16:43:59] (03CR) 10Mvolz: [C: 03+2] Update zotero to node 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/886107 (owner: 10Mvolz) [16:46:45] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [16:47:16] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [16:48:11] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [16:48:13] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [16:49:40] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply [16:50:07] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply [16:50:17] !log dancy@deploy1002 Installing scap version "4.34.0" for 561 hosts [16:50:39] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [16:50:45] !log dancy@deploy1002 Installation of scap version "4.34.0" completed for 561 hosts [16:51:12] (03PS1) 10Cwhite: profile: pass haproxy silent-drop logs [puppet] - 10https://gerrit.wikimedia.org/r/885477 [16:51:28] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [16:52:11] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [16:53:46] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [16:55:20] (03CR) 10Ottomata: flink-app: add preliminary H/A support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse) [17:00:04] jbond and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1700). nyaa~ [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:08:31] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) a:05Trizek-WMF→03None [17:08:37] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) a:03Trizek-WMF [17:12:09] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [17:12:40] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [17:13:57] (03CR) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [17:18:49] (03CR) 10Giuseppe Lavagetto: [WIP] Add sre.discovery.datacenter-route (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto) [17:19:04] (03PS2) 10Giuseppe Lavagetto: [WIP] Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [17:20:45] (03CR) 10BBlack: [C: 03+1] Release 3.8.0-1~wmf2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [17:23:42] (03CR) 10Ottomata: [WIP] rdf-streaming-updater: add a test job using the k8s operator... (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [17:29:34] (03PS1) 10Nray: Enable client preferences everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886118 (https://phabricator.wikimedia.org/T327979) [17:29:58] !log aokoth@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Production (gitlab1004) to 15.7.6-ce.0 [17:30:54] (03PS3) 10Giuseppe Lavagetto: [WIP] Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [17:31:39] (03PS4) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 [17:32:19] (03PS1) 10Phuedx: Revert "Request high-entropy Sec-CH-UA* client hints" [puppet] - 10https://gerrit.wikimedia.org/r/886119 [17:32:45] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1037.eqiad.wmnet with OS bullseye [17:33:12] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2043.codfw.wmnet with OS bullseye [17:34:26] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:05] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1037.eqiad.wmnet with reason: host reimage [17:47:41] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1037.eqiad.wmnet with reason: host reimage [17:49:22] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2043.codfw.wmnet with reason: host reimage [17:51:46] (03CR) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [17:52:02] (03PS3) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) [17:52:16] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-01-30-121726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886122 [17:52:29] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2043.codfw.wmnet with reason: host reimage [17:53:15] (03PS4) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) [17:59:02] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-01-30-121726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886122 (owner: 10BryanDavis) [18:00:04] bd808: Time to snap out of that daydream and deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1800). [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1800) [18:02:57] * bd808 twiddles thumbs waiting on the merge [18:03:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1037.eqiad.wmnet with OS bullseye [18:04:09] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-01-30-121726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886122 (owner: 10BryanDavis) [18:05:25] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:05:44] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:06:14] (03CR) 10Ottomata: [WIP] rdf-streaming-updater: add a test job using the k8s operator... (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [18:06:22] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:07:00] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:08:24] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2043.codfw.wmnet with OS bullseye [18:08:24] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:08:54] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:08:58] !log aokoth@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Production (gitlab1004) to 15.7.6-ce.0 [18:16:16] looking good, arnoldokoth:) [18:19:34] Yeah. :D [18:21:11] (03CR) 10Stevemunene: [V: 03+1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [18:26:12] (03PS1) 10Zabe: Stop writing to cuc_comment in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886127 (https://phabricator.wikimedia.org/T233004) [18:26:40] (03PS14) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 [18:27:44] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39379/console" [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [18:28:33] (03PS1) 10Brennen Bearnes: gitlab shared runners: add dependabot-gitlab [puppet] - 10https://gerrit.wikimedia.org/r/886128 (https://phabricator.wikimedia.org/T326507) [18:33:12] (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_comment in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886127 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [18:33:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886127 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [18:34:02] (03Merged) 10jenkins-bot: Stop writing to cuc_comment in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886127 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [18:34:26] !log zabe@deploy1002 Started scap: Backport for [[gerrit:886127|Stop writing to cuc_comment in group1 wikis (T233004)]] [18:34:29] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [18:36:23] !log zabe@deploy1002 zabe: Backport for [[gerrit:886127|Stop writing to cuc_comment in group1 wikis (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [18:42:45] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:886127|Stop writing to cuc_comment in group1 wikis (T233004)]] (duration: 08m 19s) [18:42:48] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [18:49:59] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [18:51:54] (03PS1) 10Zabe: Stop writing to cuc_user and cuc_user_text everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886135 (https://phabricator.wikimedia.org/T233004) [19:00:05] dancy and brennen: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1900). [19:00:10] o/ [19:01:26] (03PS1) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) [19:02:10] (03PS2) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) [19:02:15] (03CR) 10CI reject: [V: 04-1] Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) (owner: 10EoghanGaffney) [19:02:27] (03CR) 10Andrea Denisse: [C: 03+2] rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [19:02:59] (03CR) 10CI reject: [V: 04-1] Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) (owner: 10EoghanGaffney) [19:05:33] (03PS3) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) [19:08:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:10:01] (03PS4) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) [19:13:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:19:20] (03PS2) 10Brennen Bearnes: gitlab runners: add dependabot-gitlab & elasticsearch to allowed_images [puppet] - 10https://gerrit.wikimedia.org/r/886128 (https://phabricator.wikimedia.org/T326507) [19:21:32] o/ Sorry got distracted. Rolling the train! [19:21:56] (03PS1) 10TrainBranchBot: group2 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886142 (https://phabricator.wikimedia.org/T325584) [19:21:58] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886142 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [19:22:41] (03Merged) 10jenkins-bot: group2 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886142 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot) [19:24:03] (03CR) 10Ryan Kemper: [C: 03+1] "Looks good. ty for the clear explanation!" [puppet] - 10https://gerrit.wikimedia.org/r/886055 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [19:27:33] (03PS2) 10Jcrespo: Add unit tests & coverage report [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428 [19:28:29] !log zabe@deploy1002 say aborted: (duration: 00m 03s) [19:30:26] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.40.0-wmf.21 refs T325584 [19:30:29] T325584: 1.40.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T325584 [19:48:28] (03PS2) 10Ryan Kemper: elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [19:48:36] (03PS3) 10Ryan Kemper: elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [19:49:07] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [19:49:21] (03CR) 10Bking: [C: 03+1] elasticsearch: move to /run [puppet] - 10https://gerrit.wikimedia.org/r/886055 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [19:49:39] (03CR) 10Bking: [C: 03+1] elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [19:49:54] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: move to /run [puppet] - 10https://gerrit.wikimedia.org/r/886055 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [19:52:14] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netbox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10BCornwall) [19:54:05] !log T328674 [Elastic] With puppet disabled on elastic* fleet, `ryankemper@elastic2037:~$ sudo run-puppet-agent --force` to verify changes in https://gerrit.wikimedia.org/r/886055 [19:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:08] T328674: Revise elastic/open search and its /run + tmpfiles creation - https://phabricator.wikimedia.org/T328674 [19:55:26] 10SRE, 10Infrastructure-Foundations, 10User-Elukey: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835 (10ayounsi) New cool tool on the block: https://github.com/jasonyates/netbox-circuitmaintenance [19:55:30] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic2037.codfw.wmnet [19:59:29] dancy, is it ok if I deploy a config patch? [19:59:37] OK w/ me [20:00:01] thanks :) [20:00:05] (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_user and cuc_user_text everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886135 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:01:13] (03Merged) 10jenkins-bot: Stop writing to cuc_user and cuc_user_text everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886135 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [20:01:29] !log zabe@deploy1002 Started scap: Backport for [[gerrit:886135|Stop writing to cuc_user and cuc_user_text everywhere (T233004)]] [20:01:44] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:02:09] (03CR) 10Ryan Kemper: [C: 03+1] "PCC looks reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [20:02:25] (03PS4) 10Ryan Kemper: elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [20:02:37] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [20:02:59] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic2037.codfw.wmnet [20:03:18] !log zabe@deploy1002 zabe: Backport for [[gerrit:886135|Stop writing to cuc_user and cuc_user_text everywhere (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:04:15] (03PS1) 10RLazarus: Release v0.0.3. [software/httpbb] - 10https://gerrit.wikimedia.org/r/886148 (https://phabricator.wikimedia.org/T328280) [20:06:18] (03CR) 10RLazarus: [C: 03+2] Release v0.0.3. [software/httpbb] - 10https://gerrit.wikimedia.org/r/886148 (https://phabricator.wikimedia.org/T328280) (owner: 10RLazarus) [20:11:09] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:886135|Stop writing to cuc_user and cuc_user_text everywhere (T233004)]] (duration: 09m 39s) [20:11:11] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10Papaul) 05Open→03Resolved Disable all the second interfaces after talking with @Andrew on IRC ` papaul: sorry, was in a meeting. We are trying to transition to a single-NIC connection... [20:11:12] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [20:12:40] (03PS1) 10Zabe: Stop writing to cuc_comment everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886149 (https://phabricator.wikimedia.org/T233004) [20:21:03] !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia ${HOME}/httpbb/buster/httpbb_${VERSION?}-1_amd64.changes # T328280 [20:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:06] T328280: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 [20:21:31] ^ that version should read 0.0.3-1, will edit the SAL [20:23:03] !log rzl@apt1001:~$ sudo -i reprepro -C main include bullseye-wikimedia /home/rzl/httpbb/bullseye/httpbb_0.0.3-1+deb11u1_amd64.changes # T328280 [20:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:25] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1077.eqiad.wmnet with OS bullseye [20:28:35] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1077.eqiad.wmnet with OS bullseye [20:28:58] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1078.eqiad.wmnet with OS bullseye [20:29:07] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1078.eqiad.wmnet with OS bullseye [20:30:42] 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the Search Platform Team - https://phabricator.wikimedia.org/T328702 (10bking) [20:33:50] 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 (10RLazarus) 05Open→03Resolved This is deployed! Thanks again for the patch, let me know if you need anything else. [20:49:52] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1077.eqiad.wmnet with reason: host reimage [20:52:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1077.eqiad.wmnet with reason: host reimage [20:59:06] (03PS2) 10Dreamy Jazz: Disable write old for CheckUserLog reason everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885359 (https://phabricator.wikimedia.org/T233004) [20:59:08] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1078.eqiad.wmnet with OS bullseye [20:59:17] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1078.eqiad.wmnet with OS bullseye executed with errors: - cp1078 (**FAIL**) - Downtimed on Ic... [20:59:20] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1078.eqiad.wmnet with OS bullseye [20:59:29] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1078.eqiad.wmnet with OS bullseye [21:00:04] brennen and TheresNoTime: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T2100). [21:00:04] Dreamy_Jazz and nray: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] \o [21:00:12] o/ [21:01:02] o/ [21:02:57] 10SRE, 10Traffic-Icebox, 10Sustainability (Incident Followup): Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801 (10BCornwall) @Vgutierrez and @BBlack: Is this still an issue? 6 years is a long time. :) [21:03:05] Dreamy_Jazz: starting with yours [21:03:13] Thanks. I can self test. [21:04:15] But will need someone to inspect the DB row for the check I make [21:04:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885359 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:05:48] (03Merged) 10jenkins-bot: Disable write old for CheckUserLog reason everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885359 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:06:04] !log brennen@deploy1002 Started scap: Backport for [[gerrit:885359|Disable write old for CheckUserLog reason everywhere (T233004)]] [21:06:07] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:07:48] !log brennen@deploy1002 brennen and dreamyjazz: Backport for [[gerrit:885359|Disable write old for CheckUserLog reason everywhere (T233004)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:08:13] Dreamy_Jazz: on test servers. i can query, but you'll have to tell me _what_ to query. :) [21:08:53] The SQL query that needs to be run on enwiki when I say is "SELECT cul_reason FROM `cu_log` JOIN `actor` `cu_log_actor` ON ((actor_id = cul_actor)) WHERE actor_name = 'Dreamy Jazz' ORDER BY cul_timestamp DESC LIMIT 1" [21:09:02] But let me check first [21:09:04] ack, thx [21:09:28] 10SRE, 10DNS, 10Traffic-Icebox: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619 (10BCornwall) 05Open→03Resolved a:03BCornwall Thanks for bringing this ticket to our attention, Nick! It's been quite a long time since you brought this up. As the actionable... [21:10:41] 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the Search Platform Team - https://phabricator.wikimedia.org/T328702 (10EBernhardson) This will be replacing an-airflow1001 which is also a ganeti VM, but it may take a month or two after provisioning for the old instance to be shut down. [21:10:46] Okay. Ran a check on mwdebug1001 [21:10:49] Please run that query [21:11:02] If everything is as expected cul_reason should be the empty string [21:12:10] Dreamy_Jazz: yep, as expected. proceeding with sync. [21:12:20] Thanks :) Good news. [21:16:30] (03PS2) 10Brennen Bearnes: Enable client preferences everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886118 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [21:16:52] nray: you're up next, soon as this sync finishes. [21:17:03] @brennen sounds good! [21:18:07] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:885359|Disable write old for CheckUserLog reason everywhere (T233004)]] (duration: 12m 02s) [21:18:10] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:18:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886118 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [21:19:24] (03Merged) 10jenkins-bot: Enable client preferences everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886118 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray) [21:19:38] !log brennen@deploy1002 Started scap: Backport for [[gerrit:886118|Enable client preferences everywhere (T327979)]] [21:19:41] T327979: Enable persistent fixed width setting for anonymous users - https://phabricator.wikimedia.org/T327979 [21:20:31] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1078.eqiad.wmnet with reason: host reimage [21:21:23] !log brennen@deploy1002 brennen and nray: Backport for [[gerrit:886118|Enable client preferences everywhere (T327979)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:21:52] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1077.eqiad.wmnet with OS bullseye [21:22:01] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1077.eqiad.wmnet with OS bullseye completed: - cp1077 (**PASS**) - Downtimed on Icinga/Alertm... [21:22:18] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1077.eqiad.wmnet [21:22:38] thank you @brennen , I'm checking now [21:22:38] nray: await your test [21:22:44] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [21:22:45] cool cool [21:22:49] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1079.eqiad.wmnet with OS bullseye [21:22:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1079.eqiad.wmnet with OS bullseye [21:23:58] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1078.eqiad.wmnet with reason: host reimage [21:24:55] @brennen things look good, you can proceed! [21:25:01] ack, going ahead. [21:30:53] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:886118|Enable client preferences everywhere (T327979)]] (duration: 11m 14s) [21:30:56] T327979: Enable persistent fixed width setting for anonymous users - https://phabricator.wikimedia.org/T327979 [21:30:59] !log end of utc late backport & config window [21:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:14] thanks for your help @brennen ! [21:31:20] sure thing [21:44:20] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1079.eqiad.wmnet with reason: host reimage [21:47:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1079.eqiad.wmnet with reason: host reimage [21:47:44] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1078.eqiad.wmnet with OS bullseye [21:47:53] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1078.eqiad.wmnet with OS bullseye completed: - cp1078 (**WARN**) - Removed from Puppet and Pu... [21:49:34] (03PS2) 10Zabe: Stop writing to cuc_comment everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886149 (https://phabricator.wikimedia.org/T233004) [21:49:41] (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_comment everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886149 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:49:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886149 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:50:25] (03Merged) 10jenkins-bot: Stop writing to cuc_comment everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886149 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:50:39] !log zabe@deploy1002 Started scap: Backport for [[gerrit:886149|Stop writing to cuc_comment everywhere (T233004)]] [21:50:42] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:52:26] !log zabe@deploy1002 zabe: Backport for [[gerrit:886149|Stop writing to cuc_comment everywhere (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:58:38] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:886149|Stop writing to cuc_comment everywhere (T233004)]] (duration: 07m 58s) [21:58:41] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:00:51] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1078.eqiad.wmnet [22:01:18] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:01:48] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1080.eqiad.wmnet with OS bullseye [22:01:58] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1080.eqiad.wmnet with OS bullseye [22:12:13] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1079.eqiad.wmnet with OS bullseye [22:12:22] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1079.eqiad.wmnet with OS bullseye completed: - cp1079 (**PASS**) - Downtimed on Icinga/Alertm... [22:15:50] !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1079.eqiad.wmnet [22:16:18] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [22:58:20] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1080.eqiad.wmnet with OS bullseye [22:58:31] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1080.eqiad.wmnet with OS bullseye executed with errors: - cp1080 (**FAIL**) - Downtimed on Ic... [23:09:12] (03PS1) 10Bking: wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) [23:10:13] (03Abandoned) 10Bking: [WIP] wdqs: switch to using NFS for dump files [cookbooks] - 10https://gerrit.wikimedia.org/r/868465 (owner: 10Ryan Kemper) [23:11:03] (03CR) 10CI reject: [V: 04-1] wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [23:12:38] (03PS2) 10Bking: wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) [23:14:23] (03CR) 10CI reject: [V: 04-1] wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [23:15:52] (03CR) 10Herron: opensearch: reverse-proxy access to opensearch API (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [23:30:45] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state