[00:00:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:14] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5022.eqsin.wmnet with OS bullseye
[00:04:20] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5022.eqsin.wmnet with OS bullseye completed: - cp5022 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[00:04:45] <icinga-wm>	 PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:33] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:10] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp5022.eqsin.wmnet
[00:06:43] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[00:25:04] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs - https://phabricator.wikimedia.org/T327404 (10Papaul) We are postponing the PDU's maintenance once again to a new date. We will update the task once we have the new date and time.   Thank you
[00:25:06] <wikibugs>	 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install rack A1 and A8 new PDUs  - https://phabricator.wikimedia.org/T327404 (10Papaul)
[00:44:42] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5023.eqsin.wmnet with OS bullseye
[00:44:50] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5023.eqsin.wmnet with OS bullseye
[00:49:04] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:05:15] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) Using a slight modification of @jbond's script in T328593, the list of cp nodes in eqiad with the oudated firmware (`3.15.17.15`) is basically all the cp nodes in eqiad:  ` cp1076.eqiad.wmnet cp1077.eqiad...
[01:07:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1075.eqiad.wmnet with OS bullseye
[01:07:54] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1075.eqiad.wmnet with OS bullseye
[01:18:28] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5023.eqsin.wmnet with reason: host reimage
[01:21:54] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5023.eqsin.wmnet with reason: host reimage
[01:24:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1075.eqiad.wmnet with reason: host reimage
[01:27:33] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1075.eqiad.wmnet with reason: host reimage
[01:45:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:20] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1075.eqiad.wmnet with OS bullseye
[01:49:26] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1075.eqiad.wmnet with OS bullseye completed: - cp1075 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[01:50:16] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:36] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[01:50:45] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=cdn
[01:50:45] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=ats-be
[01:55:25] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5023.eqsin.wmnet with OS bullseye
[01:55:30] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5023.eqsin.wmnet with OS bullseye completed: - cp5023 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[01:55:45] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet
[01:56:28] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS bullseye
[01:56:35] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye
[01:56:46] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[02:00:10] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:03:14] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:02] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2023-02-02-004918-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470)
[02:10:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:08] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5024.eqsin.wmnet with OS bullseye
[02:14:15] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye executed with errors: - cp5024 (**FAIL**)   - Downtimed on Icinga/Alertmanager   -...
[02:14:35] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS bullseye
[02:14:41] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye
[02:15:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:19:16] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:20:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:27:37] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) Steps to follow for manual upgrade of the iDRAC firmwares for the cp hosts in eqiad for us and in case someone else stumbles on this issue.  The TL;DR is that we need to manually update the iDRAC firmware...
[02:30:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:32:52] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:33:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:34:40] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:35:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:39:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 21 Apr 2023 05:11:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:41:22] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49565 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:42:08] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:46:08] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage
[02:49:15] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage
[03:00:35] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:04:31] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:22:30] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5024.eqsin.wmnet with OS bullseye
[03:22:38] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp5024.eqsin.wmnet with OS bullseye completed: - cp5024 (**PASS**)   - Removed from Puppet and PuppetDB if present   -...
[03:30:13] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:35:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:40:05] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "LGTM, thanks again! I'll get this deployed -- sorry I didn't get to it today." [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos)
[03:41:43] <wikibugs>	 (03Merged) 10jenkins-bot: feat: add json payload capability [software/httpbb] - 10https://gerrit.wikimedia.org/r/884920 (https://phabricator.wikimedia.org/T328280) (owner: 10Ilias Sarantopoulos)
[04:00:32] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet
[04:01:13] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[04:15:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:20:21] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:49:04] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:53:10] <wikibugs>	 10SRE, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) a:03Marostegui
[05:00:13] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:05:25] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:22:01] <wikibugs>	 10SRE, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) Thanks for triaging this
[05:45:17] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:50:31] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:01:03] * kart_ updating cxserver
[06:01:11] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-02-02-004918-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470) (owner: 10KartikMistry)
[06:01:21] <wikibugs>	 (03PS1) 10Marostegui: db2181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/885922 (https://phabricator.wikimedia.org/T328623)
[06:01:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2181: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/885922 (https://phabricator.wikimedia.org/T328623) (owner: 10Marostegui)
[06:03:21] <wikibugs>	 10ops-codfw, 10DBA, 10Patch-For-Review: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) Looks like hardware issues - @Papaul can you please reach out to dell? ` ------------------------------------------------------------------------------- Record:      8 Date/Time:...
[06:06:30] <wikibugs>	 10ops-codfw, 10DBA, 10Patch-For-Review: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10Marostegui) a:05Marostegui→03Papaul The host cannot even be powered it back ON.
[06:06:34] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-02-02-004918-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470) (owner: 10KartikMistry)
[06:09:08] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[06:09:31] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:12:58] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:12:59] <wikibugs>	 (03CR) 10Winston Sung: "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/882791 (https://phabricator.wikimedia.org/T129470) (owner: 10KartikMistry)
[06:13:52] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:15:38] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[06:16:30] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[06:17:08] <kart_>	 !log Updated cxserver to 2023-02-02-004918-production (T129470, T172035, T327842)
[06:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:14] <stashbot>	 T172035: Blockers for Wikimedia wiki domain renaming - https://phabricator.wikimedia.org/T172035
[06:17:14] <stashbot>	 T129470: CX can't load any pages from be-tarask Wikipedia - https://phabricator.wikimedia.org/T129470
[06:17:15] <stashbot>	 T327842: Post-creation work for gurwiki - https://phabricator.wikimedia.org/T327842
[06:30:33] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:35:45] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:07] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[06:54:47] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T0700).
[07:00:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:05:21] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:22:01] <wikibugs>	 (03Abandoned) 10Gergő Tisza: [WIP] Update apache rules for 2.4 [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/225553 (owner: 10Gergő Tisza)
[07:45:25] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:46:57] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:idm::deployment collect static files [puppet] - 10https://gerrit.wikimedia.org/r/885787 (owner: 10Slyngshede)
[07:47:22] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Add social_auth pipeline for group creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/885813 (owner: 10Slyngshede)
[07:47:24] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add social_auth pipeline for group creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/885813 (owner: 10Slyngshede)
[07:50:39] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:53:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] admin: add user santhosh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/885842 (https://phabricator.wikimedia.org/T328517) (owner: 10Herron)
[07:54:58] <wikibugs>	 (03PS1) 10Gergő Tisza: campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/885928 (https://phabricator.wikimedia.org/T321370)
[07:55:17] <wikibugs>	 (03PS1) 10Gergő Tisza: campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885929 (https://phabricator.wikimedia.org/T321370)
[07:55:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Point the webproxy in esams to install3002 [dns] - 10https://gerrit.wikimedia.org/r/885982 (https://phabricator.wikimedia.org/T327867)
[07:56:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Apply installserver role to install3002 [puppet] - 10https://gerrit.wikimedia.org/r/885983 (https://phabricator.wikimedia.org/T327867)
[08:00:05] <jouncebot>	 Amir1, apergos, and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T0800).
[08:00:05] <jouncebot>	 Aishik and tgr: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Apply installserver role to install3002 [puppet] - 10https://gerrit.wikimedia.org/r/885983 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[08:00:33] <wikibugs>	 (03PS26) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[08:00:35] <apergos>	 morning!  there are no trainees signed up today, but 4 patches from 2 devs are on the calendar
[08:01:15] <tgr_>	 present (none of my patches need checking though)
[08:01:20] <apergos>	 I don't see Aishik here just yet, so tgr do you want to proceed?
[08:01:24] <apergos>	 er tgr_
[08:01:40] <tgr_>	 yeah, thanks
[08:01:40] <apergos>	 and I assume you would self deploy? 
[08:01:46] <tgr_>	 I can, sure
[08:02:05] <apergos>	 all righty. I've got the logstash dashboards up and all that, go for it.
[08:02:29] <wikibugs>	 (03PS2) 10Gergő Tisza: Document the '+' pattern for specifying wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048
[08:02:38] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Document the '+' pattern for specifying wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048 (owner: 10Gergő Tisza)
[08:02:59] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/885928 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza)
[08:03:03] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885929 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza)
[08:03:23] <wikibugs>	 (03Merged) 10jenkins-bot: Document the '+' pattern for specifying wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885048 (owner: 10Gergő Tisza)
[08:08:38] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder)
[08:15:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Update DHCP config for esams [puppet] - 10https://gerrit.wikimedia.org/r/885984 (https://phabricator.wikimedia.org/T327867)
[08:20:25] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:21:05] <wikibugs>	 (03Merged) 10jenkins-bot: campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.20) - 10https://gerrit.wikimedia.org/r/885928 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza)
[08:21:08] <wikibugs>	 (03Merged) 10jenkins-bot: campaigns: Donor landing page translations for sv, it, ja, fr, nl [extensions/GrowthExperiments] (wmf/1.40.0-wmf.21) - 10https://gerrit.wikimedia.org/r/885929 (https://phabricator.wikimedia.org/T321370) (owner: 10Gergő Tisza)
[08:21:37] <apergos>	 tgr_:   there's the merge
[08:23:27] <logmsgbot>	 !log tgr@deploy1002 Started scap: Backport for [[gerrit:885928|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]], [[gerrit:885929|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]]
[08:23:31] <stashbot>	 T321370: Thank You Pages: custom account creation pages for sv, it, ja, fr, nl - https://phabricator.wikimedia.org/T321370
[08:27:18] <logmsgbot>	 !log tgr@deploy1002 tgr: Backport for [[gerrit:885928|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]], [[gerrit:885929|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[08:27:51] <Aishik>	 Hey, I am waiting for my turn gerrit 885927
[08:28:38] <apergos>	 Aishik:  you'll be shortly.  do you self-deploy or will you need me to deploy for you, I don't recall?
[08:29:36] <wikibugs>	 (03PS27) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[08:29:55] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Adding Kavitha Appakayala to icinga [puppet] - 10https://gerrit.wikimedia.org/r/885985 (https://phabricator.wikimedia.org/T327403)
[08:29:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[08:29:58] <apergos>	 tgr_ when you are done please speak up here so the next patch owner can proceed.
[08:30:48] <Aishik>	 You have to do it, I have no experience with this
[08:32:27] <wikibugs>	 (03PS28) 10Slyngshede: P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881
[08:32:33] <apergos>	 no problem, I encourage you to sign up for a deployment training at some point, https://wikitech.wikimedia.org/wiki/Deployments/Training
[08:34:17] <Aishik>	 Thanks
[08:35:02] <wikibugs>	 (03PS4) 10Aishik Rehman: Enable wgMinervaEnableSiteNotice for bnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885927 (https://phabricator.wikimedia.org/T328630)
[08:37:32] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39362/console" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[08:37:48] <apergos>	 tgr_:   since you haven't replied I'm going to assume you checked out early
[08:37:54] <logmsgbot>	 !log tgr@deploy1002 Finished scap: Backport for [[gerrit:885928|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]], [[gerrit:885929|campaigns: Donor landing page translations for sv, it, ja, fr, nl (T321370)]] (duration: 14m 26s)
[08:37:54] <apergos>	 moving ahead with your patch, Aishik
[08:37:58] <stashbot>	 T321370: Thank You Pages: custom account creation pages for sv, it, ja, fr, nl - https://phabricator.wikimedia.org/T321370
[08:37:59] <apergos>	 oh, nm
[08:38:03] <apergos>	 I was impatient
[08:38:20] <tgr_>	 yeah, sorry, it took a while. Done now.
[08:38:39] <apergos>	 no worries, thanks!
[08:38:46] <apergos>	 now moving ahead with Aishik's patch
[08:39:46] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica gitlab1004 to 15.7.6
[08:42:20] <wikibugs>	 (03CR) 10ArielGlenn: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885927 (https://phabricator.wikimedia.org/T328630) (owner: 10Aishik Rehman)
[08:43:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Point the webproxy in esams to install3002 [dns] - 10https://gerrit.wikimedia.org/r/885982 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[08:44:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Point DHCP server in esams to install3002 [homer/public] - 10https://gerrit.wikimedia.org/r/885805 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[08:44:12] <apergos>	 oh looks like no pre merge jenkins, fine
[08:44:33] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Enable wgMinervaEnableSiteNotice for bnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885927 (https://phabricator.wikimedia.org/T328630) (owner: 10Aishik Rehman)
[08:44:39] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Point DHCP server in esams to install3002 [homer/public] - 10https://gerrit.wikimedia.org/r/885805 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[08:44:41] <wikibugs>	 (03Merged) 10jenkins-bot: Point DHCP server in esams to install3002 [homer/public] - 10https://gerrit.wikimedia.org/r/885805 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[08:45:16] <apergos>	 ah it was just slow
[08:45:21] <wikibugs>	 (03Merged) 10jenkins-bot: Enable wgMinervaEnableSiteNotice for bnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885927 (https://phabricator.wikimedia.org/T328630) (owner: 10Aishik Rehman)
[08:46:54] <logmsgbot>	 !log ariel@deploy1002 Started scap: Backport for [[gerrit:885927|Enable wgMinervaEnableSiteNotice for bnwiktionary (T328630)]]
[08:46:58] <stashbot>	 T328630: Enable wgMinervaEnableSiteNotice for bnwiktionary - https://phabricator.wikimedia.org/T328630
[08:47:38] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] P:IDM Configure OIDC and LDAP. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[08:48:48] <logmsgbot>	 !log ariel@deploy1002 ariel and aishik: Backport for [[gerrit:885927|Enable wgMinervaEnableSiteNotice for bnwiktionary (T328630)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[08:49:04] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:49:11] <apergos>	 Aishik:  please test your path, it is now live on mwdebug1001 
[08:49:15] <apergos>	 *patch
[08:51:07] <Aishik>	 Everything is alright!
[08:51:22] <Aishik>	 Thanks apergos
[08:51:37] <apergos>	 ok, I'll complete the scap now
[08:57:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update DHCP config for esams [puppet] - 10https://gerrit.wikimedia.org/r/885984 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[08:57:51] <logmsgbot>	 !log ariel@deploy1002 Finished scap: Backport for [[gerrit:885927|Enable wgMinervaEnableSiteNotice for bnwiktionary (T328630)]] (duration: 10m 56s)
[08:57:54] <stashbot>	 T328630: Enable wgMinervaEnableSiteNotice for bnwiktionary - https://phabricator.wikimedia.org/T328630
[08:58:05] <apergos>	 Aishik:  your patch is live in production, please test
[08:58:07] <apergos>	 grrrrr
[08:58:53] <apergos>	 let's hope that was just a network issue or something and that they will be back shortly.
[09:00:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:10] <wikibugs>	 10ops-codfw, 10DBA: db2181 stopped answering ping - https://phabricator.wikimedia.org/T328623 (10jcrespo) I've manually disabled notifications on Icinga, as puppet cannot run on the host to apply T328623#8581258, to prevent further notifications. This will require manual removal later.
[09:04:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove installserver role from install3001 [puppet] - 10https://gerrit.wikimedia.org/r/885989 (https://phabricator.wikimedia.org/T327867)
[09:05:25] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:08:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:10:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Adding Kavitha Appakayala to icinga [puppet] - 10https://gerrit.wikimedia.org/r/885985 (https://phabricator.wikimedia.org/T327403) (owner: 10Alexandros Kosiaris)
[09:11:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: update o11y with opensearch roles and settings [puppet] - 10https://gerrit.wikimedia.org/r/885371 (owner: 10Filippo Giunchedi)
[09:11:32] <elukey>	 !log roll restart of eventgate-main pods in wikikube eqiad/codfw to pick up new stream configs - T328576
[09:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:35] <stashbot>	 T328576: Implement new mediawiki.revision-score streams with Lift Wing - https://phabricator.wikimedia.org/T328576
[09:12:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync
[09:13:03] <apergos>	 I think our patch owner is not returning, so I'll cal this done, though a bit late
[09:13:13] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync
[09:13:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the review -- agreed this doesn't fix the underlying permission problem. I'll followup in a (sub)task" [puppet] - 10https://gerrit.wikimedia.org/r/885373 (owner: 10Filippo Giunchedi)
[09:13:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] opensearch: move to /run/ [puppet] - 10https://gerrit.wikimedia.org/r/885372 (owner: 10Filippo Giunchedi)
[09:13:31] <apergos>	 !log UTC morning backport and config training window done
[09:13:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[09:16:00] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica gitlab1004 to 15.7.6
[09:19:25] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 398143
[09:19:49] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 398143
[09:21:11] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) >>! In T328287#8579474, @Trizek-WMF wrote: > As you gave 3 dates in the task description, can you...
[09:22:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, found an issue while testing" [puppet] - 10https://gerrit.wikimedia.org/r/874891 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[09:23:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/885441 (https://phabricator.wikimedia.org/T320553) (owner: 10JHathaway)
[09:23:22] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: httpbb: add tests for liftwing (prod/staging) [puppet] - 10https://gerrit.wikimedia.org/r/885990
[09:23:51] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: httpbb: add tests for liftwing (prod/staging) [puppet] - 10https://gerrit.wikimedia.org/r/885990
[09:25:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] httpbb: add tests for liftwing (prod/staging) [puppet] - 10https://gerrit.wikimedia.org/r/885990 (owner: 10Ilias Sarantopoulos)
[09:28:17] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:40:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync
[09:40:51] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync
[09:45:15] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:46:02] <wikibugs>	 (03PS1) 10Elukey: changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302)
[09:46:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey)
[09:50:33] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:51:19] <wikibugs>	 (03PS2) 10Elukey: changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302)
[09:51:59] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2001.codfw.wmnet
[09:53:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove installserver role from install3001 [puppet] - 10https://gerrit.wikimedia.org/r/885989 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[09:54:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync
[09:54:59] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Configure OIDC and LDAP. [puppet] - 10https://gerrit.wikimedia.org/r/884881 (owner: 10Slyngshede)
[09:55:26] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync
[09:59:07] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2001.codfw.wmnet
[10:02:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff)
[10:04:26] <moritzm>	 !log installing tiff security updates
[10:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:38] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[10:11:53] <moritzm>	 !log restarting FPM on mw canaries to pick up tiff security updates
[10:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:33] <wikibugs>	 (03PS1) 10Slyngshede: C:IDM parse static dir to deployment. [puppet] - 10https://gerrit.wikimedia.org/r/885995
[10:17:42] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[10:18:13] <wikibugs>	 (03CR) 10Jaime Nuche: jenkins: add hieradata config for Scap3-based deployments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[10:19:53] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39363/console" [puppet] - 10https://gerrit.wikimedia.org/r/885995 (owner: 10Slyngshede)
[10:19:53] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2002.codfw.wmnet
[10:20:52] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:IDM parse static dir to deployment. [puppet] - 10https://gerrit.wikimedia.org/r/885995 (owner: 10Slyngshede)
[10:25:01] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "I'm not sure if you want to remove old scap/jenkins components when switching to scap3. If yes, you have to use the ensure flags otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[10:27:21] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2002.codfw.wmnet
[10:30:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:35:22] <wikibugs>	 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Clement_Goubert) Is this still relevant, does it need to be finished for {T327920}, or can it be closed?
[10:37:35] <wikibugs>	 (03PS1) 10Hokwelum: Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997
[10:39:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "probes will start working once we're back to one centrallog server in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[10:40:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum)
[10:48:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10serviceops: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10cmooney) >  BGP is smart about it (see '"first party" NEXT_HOP' in section 5.1.3.2 of the RFC), so it should just work on the router side.  TIL didn't realise EBGP...
[10:49:05] <wikibugs>	 (03PS2) 10Hokwelum: Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997
[10:49:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum)
[10:50:17] <wikibugs>	 (03PS1) 10Mvolz: Update zotero to 2023-02-01-144124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/885998
[10:51:05] <wikibugs>	 (03PS3) 10Hokwelum: Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997
[10:51:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum)
[10:55:40] <icinga-wm>	 PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:55:50] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] Update zotero to 2023-02-01-144124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/885998 (owner: 10Mvolz)
[11:00:05] <jouncebot>	 mvolz: Your horoscope predicts another unfortunate Services – Citoid / Zotero deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1100).
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1100)
[11:00:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib::ssl_ciphersuites: drop suppport for anything less then jessie [puppet] - 10https://gerrit.wikimedia.org/r/640467 (owner: 10Jbond)
[11:00:49] <wikibugs>	 (03Merged) 10jenkins-bot: Update zotero to 2023-02-01-144124-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/885998 (owner: 10Mvolz)
[11:01:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39364/console" [puppet] - 10https://gerrit.wikimedia.org/r/640467 (owner: 10Jbond)
[11:02:14] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[11:03:34] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Provide a valid DP key [labs/private] - 10https://gerrit.wikimedia.org/r/886000 (https://phabricator.wikimedia.org/T315676)
[11:04:08] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] varnish: Provide a valid DP key [labs/private] - 10https://gerrit.wikimedia.org/r/886000 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[11:05:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:07:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "One remaining concern was that a process spawned by systemd which has a shell configured (which isn't the case for the majority of service" [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[11:09:04] <wikibugs>	 10SRE, 10DBA, 10Datacenter-Switchover, 10Patch-For-Review: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Marostegui) We really need this to be completed yes. I don't know in which state this is at the moment.
[11:09:21] <wikibugs>	 (03PS3) 10Elukey: changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302)
[11:12:19] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[11:13:19] <Lucas_WMDE>	 jouncebot: now
[11:13:19] <jouncebot>	 For the next 0 hour(s) and 46 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1100)
[11:13:19] <jouncebot>	 For the next 0 hour(s) and 46 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1100)
[11:14:42] <Lucas_WMDE>	 does anyone mind if I run a maintenance script to fix T328634?
[11:14:42] <stashbot>	 T328634: Lost pages after deployed addtional namespaces on  shn.wikibooks - https://phabricator.wikimedia.org/T328634
[11:14:57] <mvolz>	 I mean, I am trying to deploy right now
[11:15:04] <mvolz>	 but it just failed
[11:15:07] <mvolz>	 Error: UPGRADE FAILED: release staging failed, and has been rolled back due to atomic being set: timed out waiting for the condition
[11:15:07] <Lucas_WMDE>	 ah, sorry, I didn’t see that
[11:15:17] <Lucas_WMDE>	 I’ll hold then
[11:15:22] <mvolz>	 I'm actually not sure if I should just try again? 
[11:15:24] <Lucas_WMDE>	 don’t think I can help with helm errors though
[11:15:27] <claime>	 That's helm errors
[11:15:43] <mvolz>	 Yeah.
[11:15:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:16:02] <claime>	 Does it tell you what namespace failed ?
[11:17:50] <claime>	 Oh it's on zotero staging?
[11:17:59] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey)
[11:18:32] <mvolz>	 claime: yup
[11:18:40] <mvolz>	 ideas? 
[11:19:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] changeprop: refactor match template for liftwing [deployment-charts] - 10https://gerrit.wikimedia.org/r/885991 (https://phabricator.wikimedia.org/T327302) (owner: 10Elukey)
[11:19:28] <claime>	 mvolz: you're deploying directly via helmfile on deploy1002 right?
[11:19:45] <mvolz>	 Lucas_WMDE: I think you should go ahead (did these windows always overlap? I feel like this is new)
[11:20:17] <Lucas_WMDE>	 what I’m doing doesn’t belong to either window, I just thought both windows might be inactive
[11:20:19] <claime>	 The MW infra window is recent
[11:20:24] <mvolz>	 claime: yes
[11:20:25] <wikibugs>	 (03PS1) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005
[11:20:29] <claime>	 And it's for mw-on-k8s deployments
[11:20:45] <claime>	 i.e for now mostly _joe_ and I
[11:21:03] <mvolz>	 Ah okay
[11:21:16] <claime>	 So don´t worry about overlap for now :)
[11:21:33] <claime>	 Lucas_WMDE: go ahead I'm not touching anything on mw-on-k8s rn
[11:21:50] <_joe_>	 mvolz: what are you trying to deploy?
[11:21:54] <Lucas_WMDE>	 alright, I’ll run the script and hope it works
[11:22:00] <Lucas_WMDE>	 shouldn’t affect citoid anyway
[11:22:04] <mvolz>	 yup
[11:22:04] <Lucas_WMDE>	 good luck with your parts :)
[11:22:06] <_joe_>	 citoid?
[11:22:12] <mvolz>	 I'm trying to deploy zotero
[11:22:16] <_joe_>	 mvolz: where were you deploying it?
[11:22:18] <_joe_>	 staging?
[11:22:20] <Lucas_WMDE>	 *zotero
[11:22:24] <mvolz>	 yes, staging
[11:22:26] * Lucas_WMDE was confused
[11:22:28] <mvolz>	 and it just timed out
[11:22:29] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync
[11:22:30] <_joe_>	 jayme: ^^
[11:22:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[11:22:41] <mvolz>	 _joe_: https://pastebin.com/m953ABNj
[11:22:45] <_joe_>	 issues in staging bringing up zotero
[11:23:00] <mvolz>	 not terribly helpful message  but that's what I got back after about 11 minutes
[11:23:02] <_joe_>	 mvolz: yeah I'm redirecting the debugging to jayme sorry, I have my hands full with other stuff
[11:23:12] <mvolz>	 ok
[11:23:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (owner: 10DCausse)
[11:23:28] <_joe_>	 mvolz: yes that's what you get when k8s is unable to deploy something within some of its timeouts
[11:23:44] <_joe_>	 mvolz: out of curiosity, were you deploying a new version of the image?
[11:23:52] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php shnwikibooks --fix | tee T328634-namespaceDupes.out # T328634 – failed quickly, details in task
[11:23:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:55] <stashbot>	 T328634: Lost pages after deployed addtional namespaces on  shn.wikibooks - https://phabricator.wikimedia.org/T328634
[11:23:59] <_joe_>	 if so, was it significantly larger than the last one?
[11:24:22] <mvolz>	 _joe_: yes, a new version but I don't think much bigger
[11:24:36] <_joe_>	 ok, uhm
[11:24:54] <wikibugs>	 (03CR) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (owner: 10DCausse)
[11:26:24] <mvolz>	 I could try to deploy a very minor change to citoid (package update) and see if it's all services being hinksy if you like or just a zotero specific issue
[11:26:55] <mvolz>	 that I was going to do next
[11:27:01] <wikibugs>	 (03PS1) 10Jbond: puppet-merge: try to decode with erros=ignore on failure [puppet] - 10https://gerrit.wikimedia.org/r/886006
[11:27:06] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/865198 (owner: 10PipelineBot)
[11:28:09] <claime>	 15m         Warning   BackOff             pod/zotero-staging-9cb5fcb5d-z68zr    Back-off restarting failed container
[11:28:10] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php shnwikibooks --fix --add-prefix=T328634/ | tee T328634-namespaceDupes-2.out # T328634 – another error but made more progress
[11:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:13] <claime>	 Container fails to start
[11:28:46] <mvolz>	 :(
[11:29:03] <mvolz>	 should we revert and see if that works? 
[11:29:25] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] varnish: Generate a DP subkey daily [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[11:29:33] <claime>	 The service should still be up, but on the former version
[11:32:04] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/865198 (owner: 10PipelineBot)
[11:32:25] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php shnwikibooks --fix --add-prefix=T328634/ | tee T328634-namespaceDupes-3.out # T328634 – seemed to finish the first 20 pages and then go into an infinite loop, I Ctrl+Ced it
[11:32:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:28] <stashbot>	 T328634: Lost pages after deployed addtional namespaces on  shn.wikibooks - https://phabricator.wikimedia.org/T328634
[11:33:00] <mvolz>	 yeah, it is. 
[11:34:31] <mvolz>	 claime: would you mind if I tried the citoid deploy, or would that be disruptive 
[11:36:15] <wikibugs>	 (03CR) 10DCausse: flink-app: add preliminary H/A support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse)
[11:36:24] <wikibugs>	 (03Abandoned) 10DCausse: flink-app: add preliminary H/A support [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse)
[11:37:04] <claime>	 mvolz: Nah go
[11:37:19] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php shnwikibooks --fix | tee T328634-namespaceDupes-4.out # T328634 – made some progress then errored out again
[11:37:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:55] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply
[11:38:27] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[11:39:22] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply
[11:39:25] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Fix python3-nacl dependency order issue [puppet] - 10https://gerrit.wikimedia.org/r/886008 (https://phabricator.wikimedia.org/T315676)
[11:40:05] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[11:40:44] <Lucas_WMDE>	 I think I’m done with my maintenance script runs for now
[11:40:53] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39365/console" [puppet] - 10https://gerrit.wikimedia.org/r/886008 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[11:41:13] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[11:41:34] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/886006 (owner: 10Jbond)
[11:41:50] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[11:42:01] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[11:42:39] <claime>	 Lucas_WMDE: ack
[11:42:52] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[11:43:02] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Fix python3-nacl dependency order issue [puppet] - 10https://gerrit.wikimedia.org/r/886008 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez)
[11:44:23] <mvolz>	 well citoid deploy went fine
[11:46:24] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[11:49:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676 (10Vgutierrez) Initial sanity checks confirms that the daily key generated on two different hosts is the same: ` vgutierrez@cumin1001:~$ sudo -i cumin 'cp[6015,6016].*' 'sha512sum /etc/varni...
[11:54:01] <wikibugs>	 (03PS1) 10Slyngshede: Switch to CAS OIDC for login button. [software/bitu] - 10https://gerrit.wikimedia.org/r/886010
[11:57:44] <wikibugs>	 (03PS2) 10Slyngshede: Switch to CAS OIDC for login button. [software/bitu] - 10https://gerrit.wikimedia.org/r/886010
[12:00:06] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:11] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Clement_Goubert) >>! In T328287#8579336, @Trizek-WMF wrote: > @Clement_Goubert Has anything major changed in your p...
[12:05:22] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:05] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[12:08:35] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[12:09:05] <wikibugs>	 (03PS1) 10Stevemunene: Bump up mediawiki_history_snapshot to 2023-01 [puppet] - 10https://gerrit.wikimedia.org/r/886013
[12:13:50] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:14:36] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:18:23] <claime>	 marostegui: Amir1: doing something on db1117:3323 ?
[12:18:24] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] jenkins: add hieradata config for Scap3-based deployments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[12:18:34] <Amir1>	 let me see
[12:18:51] <Amir1>	 3323? that should be m3 I think
[12:19:38] <Amir1>	 there is another too
[12:21:59] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2003.codfw.wmnet
[12:22:18] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:22:23] <Amir1>	 the mysql has crashed
[12:22:30] <Amir1>	 debugging 
[12:23:11] <kostajh>	 Is this an OK time to deploy a security patch?
[12:23:14] <claime>	 Amir1: Yeah, other p[ort on 1117 is down
[12:23:19] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1076.eqiad.wmnet with OS bullseye
[12:23:21] <claime>	 (3322)
[12:23:26] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1076.eqiad.wmnet with OS bullseye
[12:23:37] <claime>	 jouncebot: nowandnext
[12:23:37] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 36 minute(s)
[12:23:37] <jouncebot>	 In 1 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1400)
[12:23:38] <jouncebot>	 In 1 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1400)
[12:24:04] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:24:05] <wikibugs>	 (03CR) 10Jaime Nuche: jenkins: use Scap3 deployment for releases instances (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[12:24:24] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:25:14] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:25:16] <Amir1>	 The unit mariadb@m3.service has successfully entered the 'dead' state.
[12:25:45] <jynus>	 ?
[12:25:49] <claime>	 kostajh: I think you can go ahead
[12:26:06] <marostegui>	 Amir1: that is me
[12:26:07] <Amir1>	 jynus: m3 only on db1117 died out of nowhere
[12:26:16] <Amir1>	 ah
[12:26:24] <Amir1>	 marostegui: I brought it back, shall I kill it?
[12:26:24] <marostegui>	 oh you started it???
[12:26:34] <marostegui>	 I need to start again :(
[12:26:37] <claime>	 Well yeah, it started making haproxy alert marostegui ...
[12:26:59] <Amir1>	 stopped m3
[12:27:19] <marostegui>	 Amir1: please leave it with me, I don't need it stopped yet
[12:27:45] <Amir1>	 okay, brought it back
[12:27:55] <marostegui>	 Amir1: please stop touching it
[12:28:11] * Amir1 hands off now
[12:28:57] <Amir1>	 db1164 is also out, it's m2. Is that you too?
[12:29:04] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/superset/deploy@5175ad7]: Production deployment for numpy downgrade
[12:29:10] <marostegui>	 yes
[12:29:17] <Amir1>	 noted
[12:29:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2003.codfw.wmnet
[12:29:27] <claime>	 !log Work ongoing on m2 and m3
[12:29:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:45] <kostajh>	 claime: thanks
[12:29:46] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/superset/deploy@5175ad7]: Production deployment for numpy downgrade (duration: 00m 42s)
[12:30:14] <kostajh>	 next question, is anyone able to help me with the security patch deployment, as I have not done that before?
[12:30:38] <Amir1>	 kostajh: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Security_patches this should be good
[12:31:51] <kostajh>	 Amir1: thanks, I've seen that... I guess I'll try the deployment via script approach and hope for the best
[12:32:50] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:33:10] <claime>	 ^ expected
[12:33:10] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:33:24] <marostegui>	 yes, all proxies are expected to irc alert
[12:34:56] <claime>	 marostegui: No problem, tell me when you're done so I know I can start worrying again ;)
[12:34:56] <kostajh>	 Amir1: I'm stuck at step 0. "No ED25519 host key is known for deployment.eqiad.wmnet". I've run `scripts/wmf-update-known-hosts-production` from the `wmf-sre-laptop` repo and other servers work... am I missing something obvious
[12:35:15] <claime>	 kostajh: deploy1002.eqiad.wmnet
[12:35:21] <Amir1>	 yup
[12:35:22] <marostegui>	 claime: I will
[12:35:34] <claime>	 marostegui: thanks mate <3
[12:35:35] <kostajh>	 aha
[12:35:36] <kostajh>	 thanks
[12:37:49] <kostajh>	 What am I missing in my SSH config? https://wikitech.wikimedia.org/wiki/Deploy1002 says to use `deployment.eqiad.wmnet`
[12:38:09] <kostajh>	 My SSH config looks like https://wikitech.wikimedia.org/wiki/SRE/Production_access#Setting_up_your_SSH_config
[12:39:29] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[12:39:40] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Rotate aphlict logs either daily, or when they reach 1G [puppet] - 10https://gerrit.wikimedia.org/r/885858 (https://phabricator.wikimedia.org/T325246) (owner: 10EoghanGaffney)
[12:39:44] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[12:40:13] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "comment in line" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[12:40:16] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:41:04] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:41:06] <wikibugs>	 (03PS1) 10Marostegui: db1164: Move it from m2 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/886030 (https://phabricator.wikimedia.org/T328402)
[12:41:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1164: Move it from m2 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/886030 (https://phabricator.wikimedia.org/T328402) (owner: 10Marostegui)
[12:41:43] <claime>	 kostajh: Yeah so, deployment.eqiad.wmnet point to deploy1002.eqiad.wmnet, but apparently we don't create the know_host entry for it.
[12:41:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Move webproxy in ulsfo to install4002 [dns] - 10https://gerrit.wikimedia.org/r/886031 (https://phabricator.wikimedia.org/T327867)
[12:41:54] <claime>	 kostajh: just ssh deploy1002.eqiad.wmnet
[12:41:55] <marostegui>	 eoghan: can i merge your changes?
[12:42:01] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[12:42:14] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:42:16] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[12:42:18] <eoghan>	 marostegui: I was just about to but if I'm blocking you from doing it then go right ahead!
[12:42:25] <marostegui>	 eoghan: doing it
[12:42:36] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[12:42:44] <eoghan>	 marostegui: Wonderful, thakn you!
[12:43:34] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:43:36] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:43:42] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:44:12] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[12:44:41] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1076.eqiad.wmnet with reason: host reimage
[12:45:11] <kostajh>	 claime: ack
[12:45:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:46:10] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[12:46:48] <wikibugs>	 (03PS28) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[12:47:07] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[12:47:44] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1076.eqiad.wmnet with reason: host reimage
[12:51:10] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39366/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[12:52:04] <kostajh>	 Amir1 / claime: about to run the script. There's no verification step via WikimediaDebug, AIUI. Is there a command to revert the deployment in case of unanticipated problems?
[12:52:30] <Amir1>	 I don't remember 
[12:53:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Setup install4002 as install server [puppet] - 10https://gerrit.wikimedia.org/r/886036 (https://phabricator.wikimedia.org/T327867)
[12:54:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Update DHCP config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/886037 (https://phabricator.wikimedia.org/T327867)
[12:55:04] <wikibugs>	 (03PS20) 10Jaime Nuche: jenkins: add hieradata config for Scap3-based deployments [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909)
[12:55:04] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2004.codfw.wmnet
[12:55:06] <wikibugs>	 (03PS8) 10Jaime Nuche: jenkins: use Scap3 deployment for releases instances [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909)
[12:55:08] <wikibugs>	 (03PS7) 10Jaime Nuche: jenkins: enable Scap3 deployment for active releases instance [puppet] - 10https://gerrit.wikimedia.org/r/884891 (https://phabricator.wikimedia.org/T323909)
[12:55:51] <taavi>	 kostajh: which script?
[12:55:52] <claime>	 kostajh: I don't know. jnuche do you or someone from your team ?
[12:55:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:56:09] <claime>	 taavi: deploy_security.py I suppose
[12:56:12] <kostajh>	 I'm running it now. taavi: this one https://gitlab.wikimedia.org/repos/releng/release/-/blob/master/deploy_security.py
[12:57:17] <taavi>	 scap sync-file --help says there is an argument called --pause-after-testserver-sync, which could probably be used by the script to implement an mwdebug step
[12:59:46] <kostajh>	 ack
[12:59:55] <kostajh>	 I'm on the "When you run it, sometimes it might look like it's stuck. Don't worry, it's doing stuff." step now...
[13:00:16] <jnuche>	 claime, kostajh: if you're running `sync-file` manually, then the flag mentioned by taavi should stop so you can verify
[13:00:36] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: [WIP] Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038
[13:00:49] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:00:53] <claime>	 jnuche: kostajh is running the procedure at https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Deployment:_via_script
[13:01:40] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2004.codfw.wmnet
[13:01:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve1008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:02:06] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[13:03:11] <logmsgbot>	 !log kharlan: Deployed security patch for T328643
[13:03:52] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:03:55] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:04:15] <jnuche>	 claime: ack, we don't seem to have a flag for that in that script unfortunately
[13:04:30] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[13:04:44] <marostegui>	 claime: all proxies are back, maintenance is finished on m2 and m3
[13:05:09] <claime>	 marostegui: Thanks, and sorry for the earlier disruption
[13:06:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on ml-serve1008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve1008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:06:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:08:21] <kostajh>	 taavi: I filed your idea about `--pause-after-testserver-sync` as T328667
[13:08:22] <stashbot>	 T328667: Add --pause-after-testserver-sync option to deploy_security.py - https://phabricator.wikimedia.org/T328667
[13:08:23] <wikibugs>	 (03PS1) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[13:08:31] <wikibugs>	 (03CR) 10Jaime Nuche: "Latest PCC: https://puppet-compiler.wmflabs.org/output/884887/39367/" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[13:08:44] <taavi>	 jnuche: and I wonder if that script should be integrated in scap entirely.. having 'run this script you just downloaded' as the official workflow isn't ideal especially for security stuff
[13:08:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede)
[13:09:08] <kostajh>	 yeah, would be nice if this was in scap
[13:09:36] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1076.eqiad.wmnet with OS bullseye
[13:09:42] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1076.eqiad.wmnet with OS bullseye completed: - cp1076 (**PASS**)   - Downtimed on Icinga/Alertmanager   - Disabled Pu...
[13:10:01] <logmsgbot>	 !log kharlan: Deployed security patch for T328643
[13:11:05] <wikibugs>	 (03PS2) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[13:11:12] <kostajh>	 seems to be done
[13:11:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:11:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Setup install4002 as install server [puppet] - 10https://gerrit.wikimedia.org/r/886036 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[13:11:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede)
[13:12:18] <jnuche>	 taavi, kostajh: agreed
[13:13:09] <wikibugs>	 (03PS3) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[13:13:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede)
[13:14:23] <wikibugs>	 (03PS4) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[13:14:25] <kostajh>	 jnuche: I'm following https://wikitech.wikimedia.org/wiki/How_to_deploy_code#After and it says "You may have to ask a releng/SRE person to check that the build worked correctly if you don’t have the necessary access yourself.". I don't have access, could you please look?
[13:14:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede)
[13:15:29] <claime>	 kostajh: I'll do it
[13:16:07] <taavi>	 that's not really up to date? scap is doing that by itself these days instead of leaving it to the jenkins build
[13:16:34] <claime>	 yeah
[13:16:39] <jnuche>	 kostajh: I can see your patch for both active branches in the deployment server
[13:16:50] <kostajh>	 ok, thanks
[13:17:25] <kostajh>	 (I updated the security section for How_to_deploy_code to make it a bit more readable for those who haven't done it before, if there are other things to update please change them!)
[13:17:50] <kostajh>	 then, I think I'm done with this for now. I left a follow-up comment in the phab task for Security to clarify next steps
[13:18:32] <wikibugs>	 (03PS5) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[13:18:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede)
[13:19:45] <claime>	 All mw-on-k8s deployments have been updated to latest images
[13:19:58] <kostajh>	 claime: thx
[13:20:07] <wikibugs>	 (03PS6) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[13:20:28] <wikibugs>	 (03CR) 10jenkins-bot: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede)
[13:23:06] <wikibugs>	 (03PS7) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[13:23:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede)
[13:24:34] <mvolz>	 claime: any ideas how to debug zotero fail? Because it works locally and builds fine, I'm not sure what the next steps for me would be. Should I open a ticket? 
[13:25:24] <wikibugs>	 (03PS8) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[13:29:13] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[13:32:02] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[13:34:41] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1076.eqiad.wmnet,service=cdn
[13:34:41] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1076.eqiad.wmnet,service=ats-be
[13:35:01] <wikibugs>	 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[13:38:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Move webproxy in ulsfo to install4002 [dns] - 10https://gerrit.wikimedia.org/r/886031 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[13:42:31] <wikibugs>	 (03PS9) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[13:42:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Point DHCP server in ulsfo to install4002 [homer/public] - 10https://gerrit.wikimedia.org/r/885806 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[13:45:53] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: httpbb: add tests for liftwing (prod/staging) [puppet] - 10https://gerrit.wikimedia.org/r/885990 (https://phabricator.wikimedia.org/T327787)
[13:50:24] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] jenkins: add hieradata config for Scap3-based deployments [puppet] - 10https://gerrit.wikimedia.org/r/883913 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[13:55:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update DHCP config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/886037 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[13:59:18] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Jhancock.wm)
[13:59:35] <wikibugs>	 (03PS10) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[14:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1400)
[14:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1400).
[14:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:01:05] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39370/console" [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede)
[14:01:55] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39371/console" [puppet] - 10https://gerrit.wikimedia.org/r/885990 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos)
[14:02:40] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] httpbb: add tests for liftwing (prod/staging) [puppet] - 10https://gerrit.wikimedia.org/r/885990 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos)
[14:02:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove installserver role from install4001 [puppet] - 10https://gerrit.wikimedia.org/r/886049 (https://phabricator.wikimedia.org/T327867)
[14:06:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove installserver role from install4001 [puppet] - 10https://gerrit.wikimedia.org/r/886049 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[14:08:22] <wikibugs>	 (03PS1) 10Elukey: profile::httpbb: fix liftwing paths [puppet] - 10https://gerrit.wikimedia.org/r/886050
[14:10:35] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39372/console" [puppet] - 10https://gerrit.wikimedia.org/r/886050 (owner: 10Elukey)
[14:10:40] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] jenkins: use Scap3 deployment for releases instances [puppet] - 10https://gerrit.wikimedia.org/r/884887 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche)
[14:10:47] <wikibugs>	 (03CR) 10Herron: [C: 03+1] rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[14:11:20] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "my bad :)" [puppet] - 10https://gerrit.wikimedia.org/r/886050 (owner: 10Elukey)
[14:12:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff)
[14:13:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::httpbb: fix liftwing paths [puppet] - 10https://gerrit.wikimedia.org/r/886050 (owner: 10Elukey)
[14:13:57] <wikibugs>	 (03CR) 10Volans: "I did a very quick first pass" [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[14:15:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Point DHCP server in eqsin to install5002 [homer/public] - 10https://gerrit.wikimedia.org/r/886053 (https://phabricator.wikimedia.org/T327867)
[14:21:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[14:24:50] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2005.codfw.wmnet
[14:25:42] <moritzm>	 !log installing containerd security updates on codfw k8s nodes
[14:25:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) a:03Jhancock.wm
[14:29:13] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[14:31:12] <wikibugs>	 10SRE, 10Discovery-Search: Revise elastic/open search and its /run + tmpfiles creation - https://phabricator.wikimedia.org/T328674 (10fgiunchedi)
[14:31:58] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2005.codfw.wmnet
[14:34:33] <wikibugs>	 10SRE, 10Discovery-Search: Revise elastic/open search and its /run + tmpfiles creation - https://phabricator.wikimedia.org/T328674 (10fgiunchedi)
[14:37:35] <wikibugs>	 10SRE, 10DNS, 10Infrastructure-Foundations, 10Mail, and 3 others: Add SPF records for gitlab.wikimedia.org - https://phabricator.wikimedia.org/T328642 (10eoghan) p:05Triage→03Medium a:03eoghan
[14:37:43] <wikibugs>	 (03PS1) 10Filippo Giunchedi: elasticsearch: move to /run [puppet] - 10https://gerrit.wikimedia.org/r/886055 (https://phabricator.wikimedia.org/T328674)
[14:39:16] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[14:42:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674)
[14:42:23] <wikibugs>	 (03PS34) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[14:43:41] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.2.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/886061
[14:43:52] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) >>! In T316544#8575464, @Andrew wrote: > We have a ton of rebalancing to do for each of these switches....
[14:45:43] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[14:47:46] <wikibugs>	 (03CR) 10Phedenskog: [C: 04-1] "Hi Filippo, I don't have the privileges to abandon the patch, when you have time could you please do it for me? This is something we will " [puppet] - 10https://gerrit.wikimedia.org/r/633202 (https://phabricator.wikimedia.org/T262962) (owner: 10Dave Pifke)
[14:49:27] <claime>	 mvolz: Your container goes into crashloopbackoff
[14:49:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Hi Peter, for sure! Easy enough" [puppet] - 10https://gerrit.wikimedia.org/r/633202 (https://phabricator.wikimedia.org/T262962) (owner: 10Dave Pifke)
[14:49:37] <wikibugs>	 (03PS35) 10Vgutierrez: varnish: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[14:49:39] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: [WIP] Start puppetizing WebPageTest [puppet] - 10https://gerrit.wikimedia.org/r/633202 (https://phabricator.wikimedia.org/T262962) (owner: 10Dave Pifke)
[14:50:16] <claime>	 mvolz: https://phabricator.wikimedia.org/P43581
[14:51:19] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39374/console" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[14:51:51] <wikibugs>	 (03CR) 10Vgutierrez: "varnish tests are happy as well:" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson)
[14:54:27] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: profile::httpbb: fix liftwing hosts [puppet] - 10https://gerrit.wikimedia.org/r/886063 (https://phabricator.wikimedia.org/T327787)
[14:54:37] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.2.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/886061 (owner: 10Volans)
[14:55:04] <mvolz>	 claime: thanks!
[14:55:27] <claime>	 mvolz: np :)
[14:55:46] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[14:59:13] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.2.1 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/886061 (owner: 10Volans)
[14:59:26] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2006.codfw.wmnet
[14:59:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[15:00:55] <vgutierrez>	 !log rolling restart of varnish in cache::text - T315676
[15:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:59] <stashbot>	 T315676: Add DP cookie for pageview filtering - https://phabricator.wikimedia.org/T315676
[15:01:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::httpbb: fix liftwing hosts [puppet] - 10https://gerrit.wikimedia.org/r/886063 (https://phabricator.wikimedia.org/T327787) (owner: 10Ilias Sarantopoulos)
[15:02:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3004 was renamed as ganeti4004 - jmm@cumin2002"
[15:02:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff)
[15:03:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast3004 was renamed as ganeti4004 - jmm@cumin2002"
[15:03:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:03:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/886010 (owner: 10Slyngshede)
[15:05:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Point DHCP server in eqsin to install5002 [homer/public] - 10https://gerrit.wikimedia.org/r/886053 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff)
[15:06:08] <wikibugs>	 (03PS1) 10Milimetric: Bump up mediawiki_history_snapshot to 2023-01 [puppet] - 10https://gerrit.wikimedia.org/r/886065
[15:06:56] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2006.codfw.wmnet
[15:07:02] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) >>! In T321309#8581111, @ssingh wrote: > Steps to follow for manual upgrade of the iDRAC firmwares for the cp hosts in eqiad for us and in case someone else stumbles on th...
[15:12:13] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10Jclark-ctr) Looks like all of those are the second connection for these servers     Racking ticket T313983 cloudvirt1054 E4 U29 Port 36/37 Cableid 20220045 / 20220041 cloudvirt1055 E4 U30 ort 38/39 Cableid 20220...
[15:17:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti3004
[15:17:38] <wikibugs>	 (03Abandoned) 10Milimetric: Bump up mediawiki_history_snapshot to 2023-01 [puppet] - 10https://gerrit.wikimedia.org/r/886065 (owner: 10Milimetric)
[15:20:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) >>! In T321309#8582443, @MoritzMuehlenhoff wrote: >>>! In T321309#8581111, @ssingh wrote: >> Steps to follow for manual upgrade of the iDRAC firmwares for the cp hosts in eqiad for u...
[15:21:19] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10Papaul) @Jclark-ctr thank you I will fix it in Netbox
[15:24:39] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti3004
[15:25:08] <wikibugs>	 (03PS1) 10Ssingh: Release 3.8.0-1~wmf2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309)
[15:27:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: RAID controller battery for an-worker1087.eqiad.wmnet - https://phabricator.wikimedia.org/T328119 (10Jclark-ctr) RAID controller battery for an-worker1087 Replaced @BTullis
[15:30:08] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:30:09] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/886069 (https://phabricator.wikimedia.org/T327663) (owner: 10Clément Goubert)
[15:30:33] <wikibugs>	 (03PS2) 10Bking: [WIP] wdqs: switch to using NFS for dump files [cookbooks] - 10https://gerrit.wikimedia.org/r/868465 (owner: 10Ryan Kemper)
[15:30:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: switch to using NFS for dump files [cookbooks] - 10https://gerrit.wikimedia.org/r/868465 (owner: 10Ryan Kemper)
[15:33:15] <wikibugs>	 (03PS1) 10Volans: Upstream release v1.2.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/886074
[15:33:32] <logmsgbot>	 !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@e38efa6] (releasing): (no justification provided)
[15:34:44] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab Replica gitlab2002 to 15.7.6-ce.0
[15:35:29] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release
[15:35:47] <logmsgbot>	 !log aokoth@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: Security Release
[15:35:54] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 4 probes of 794 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:37:52] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Security Release
[15:37:54] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v1.2.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/886074 (owner: 10Volans)
[15:38:10] <logmsgbot>	 !log aokoth@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: Security Release
[15:40:34] <logmsgbot>	 !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@e38efa6] (releasing): (no justification provided) (duration: 07m 01s)
[15:41:58] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v1.2.1 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/886074 (owner: 10Volans)
[15:42:03] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10Jclark-ctr) Rebalanced pdu ports will monitor for a little bit before closing ticket
[15:43:44] <wikibugs>	 (03PS1) 10Mvolz: Revert "Update zotero to 2023-02-01-144124-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885945
[15:43:59] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] Revert "Update zotero to 2023-02-01-144124-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885945 (owner: 10Mvolz)
[15:47:25] <wikibugs>	 (03PS11) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[15:48:48] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Update zotero to 2023-02-01-144124-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/885945 (owner: 10Mvolz)
[15:52:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Add bookworm to pbuilder setup [puppet] - 10https://gerrit.wikimedia.org/r/886078
[15:53:35] <wikibugs>	 (03PS1) 10Hnowlan: Revert "changeprop: remove remaining blocklist entries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886086
[15:53:43] <wikibugs>	 (03PS2) 10Hnowlan: Revert "changeprop: remove remaining blocklist entries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886086
[15:54:49] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/886078 (owner: 10Muehlenhoff)
[15:55:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add bookworm to pbuilder setup [puppet] - 10https://gerrit.wikimedia.org/r/886078 (owner: 10Muehlenhoff)
[15:59:04] <jinxer-wm>	 (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:59:46] <mvolz>	 claime: I'm an idiot -> https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/886080/1/config/production.json thanks for your help again
[16:00:03] <mvolz>	 jouncebot: now
[16:00:03] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 59 minute(s)
[16:00:31] <wikibugs>	 (03CR) 10Ssingh: "recheck" [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:00:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 3.8.0-1~wmf2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:01:11] <claime>	 mvolz: I have a list of 10 "fix json" commits in a rsyslog config somewhere, so we're on the same idiot-level then :p
[16:01:21] <wikibugs>	 (03PS12) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[16:01:29] <mvolz>	 hehe
[16:02:54] <wikibugs>	 (03PS2) 10Ssingh: Release 3.8.0-1~wmf2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309)
[16:03:02] <claime>	 mvolz: I've added a quick troubleshooting 101 to the kubernetes docs https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting
[16:03:13] <claime>	 It should be helpful in the future
[16:03:59] <wikibugs>	 (03PS13) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[16:04:04] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:04:27] <mvolz>	 🎉
[16:06:43] <wikibugs>	 (03CR) 10Ssingh: "Ready for review but please note: I modified the gbp.conf as in Debian proper, to better suit our environment, so please check!" [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[16:07:09] <wikibugs>	 (03PS1) 10Mvolz: Update zotero to 2023-02-02-155709-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886083
[16:07:52] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] Update zotero to 2023-02-02-155709-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886083 (owner: 10Mvolz)
[16:10:32] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab Replica gitlab2002 to 15.7.6-ce.0
[16:10:34] <volans>	 !log uploaded python3-wmflib_1.2.1 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia
[16:10:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:22] <wikibugs>	 (03Merged) 10jenkins-bot: Update zotero to 2023-02-02-155709-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886083 (owner: 10Mvolz)
[16:15:36] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[16:16:11] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[16:16:46] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply
[16:17:26] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[16:17:34] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs2007.codfw.wmnet
[16:17:46] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[16:18:27] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[16:25:21] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs2007.codfw.wmnet
[16:29:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Repurpose bast3004 as ganeti node - https://phabricator.wikimedia.org/T325361 (10MoritzMuehlenhoff)
[16:32:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Revert "changeprop: remove remaining blocklist entries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886086 (owner: 10Hnowlan)
[16:35:46] <wikibugs>	 (03PS3) 10Hnowlan: Revert "changeprop: remove remaining blocklist entries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886086
[16:36:41] <wikibugs>	 (03PS2) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675)
[16:39:53] <wikibugs>	 10SRE, 10API Platform, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10MW-1.40-notes (1.40.0-wmf.21; 2023-01-30): UserImpact: Fetch information for more articles when calculating most-viewed-articles data ponit - https://phabricator.wikimedia.org/T324675 (10Aklapper) Would it be worth...
[16:42:13] <wikibugs>	 (03PS1) 10Mvolz: Update zotero to node 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/886107
[16:42:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "changeprop: remove remaining blocklist entries" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886086 (owner: 10Hnowlan)
[16:43:04] <mvolz>	 jouncebot: now
[16:43:04] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 16 minute(s)
[16:43:59] <wikibugs>	 (03CR) 10Mvolz: [C: 03+2] Update zotero to node 14 [deployment-charts] - 10https://gerrit.wikimedia.org/r/886107 (owner: 10Mvolz)
[16:46:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[16:47:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[16:48:11] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[16:48:13] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[16:49:40] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/zotero: apply
[16:50:07] <logmsgbot>	 !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[16:50:17] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.34.0" for 561 hosts
[16:50:39] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply
[16:50:45] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.34.0" completed for 561 hosts
[16:51:12] <wikibugs>	 (03PS1) 10Cwhite: profile: pass haproxy silent-drop logs [puppet] - 10https://gerrit.wikimedia.org/r/885477
[16:51:28] <logmsgbot>	 !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[16:52:11] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[16:53:46] <logmsgbot>	 !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[16:55:20] <wikibugs>	 (03CR) 10Ottomata: flink-app: add preliminary H/A support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/885832 (owner: 10DCausse)
[17:00:04] <jouncebot>	 jbond and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1700). nyaa~
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:08:31] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) a:05Trizek-WMF→03None
[17:08:37] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) a:03Trizek-WMF
[17:12:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[17:12:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[17:13:57] <wikibugs>	 (03CR) 10Ottomata: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[17:18:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [WIP] Add sre.discovery.datacenter-route (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/886038 (owner: 10Giuseppe Lavagetto)
[17:19:04] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: [WIP] Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038
[17:20:45] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] Release 3.8.0-1~wmf2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/886068 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[17:23:42] <wikibugs>	 (03CR) 10Ottomata: [WIP] rdf-streaming-updater: add a test job using the k8s operator... (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[17:29:34] <wikibugs>	 (03PS1) 10Nray: Enable client preferences everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886118 (https://phabricator.wikimedia.org/T327979)
[17:29:58] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Production (gitlab1004) to 15.7.6-ce.0
[17:30:54] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: [WIP] Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038
[17:31:39] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add sre.discovery.datacenter-route [cookbooks] - 10https://gerrit.wikimedia.org/r/886038
[17:32:19] <wikibugs>	 (03PS1) 10Phuedx: Revert "Request high-entropy Sec-CH-UA* client hints" [puppet] - 10https://gerrit.wikimedia.org/r/886119
[17:32:45] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1037.eqiad.wmnet with OS bullseye
[17:33:12] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc2043.codfw.wmnet with OS bullseye
[17:34:26] <icinga-wm>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:45:05] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1037.eqiad.wmnet with reason: host reimage
[17:47:41] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1037.eqiad.wmnet with reason: host reimage
[17:49:22] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2043.codfw.wmnet with reason: host reimage
[17:51:46] <wikibugs>	 (03CR) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[17:52:02] <wikibugs>	 (03PS3) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675)
[17:52:16] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-01-30-121726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886122
[17:52:29] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2043.codfw.wmnet with reason: host reimage
[17:53:15] <wikibugs>	 (03PS4) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675)
[17:59:02] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-01-30-121726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886122 (owner: 10BryanDavis)
[18:00:04] <jouncebot>	 bd808: Time to snap out of that daydream and deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1800).
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1800)
[18:02:57] * bd808 twiddles thumbs waiting on the merge
[18:03:20] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1037.eqiad.wmnet with OS bullseye
[18:04:09] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-01-30-121726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/886122 (owner: 10BryanDavis)
[18:05:25] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[18:05:44] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[18:06:14] <wikibugs>	 (03CR) 10Ottomata: [WIP] rdf-streaming-updater: add a test job using the k8s operator... (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[18:06:22] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[18:07:00] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[18:08:24] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2043.codfw.wmnet with OS bullseye
[18:08:24] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[18:08:54] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[18:08:58] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Production (gitlab1004) to 15.7.6-ce.0
[18:16:16] <mutante>	 looking good, arnoldokoth:)
[18:19:34] <arnoldokoth>	 Yeah. :D
[18:21:11] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[18:26:12] <wikibugs>	 (03PS1) 10Zabe: Stop writing to cuc_comment in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886127 (https://phabricator.wikimedia.org/T233004)
[18:26:40] <wikibugs>	 (03PS14) 10Slyngshede: C:IDM Add timers and background workers. [puppet] - 10https://gerrit.wikimedia.org/r/886039
[18:27:44] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39379/console" [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede)
[18:28:33] <wikibugs>	 (03PS1) 10Brennen Bearnes: gitlab shared runners: add dependabot-gitlab [puppet] - 10https://gerrit.wikimedia.org/r/886128 (https://phabricator.wikimedia.org/T326507)
[18:33:12] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_comment in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886127 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[18:33:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886127 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[18:34:02] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to cuc_comment in group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886127 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[18:34:26] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:886127|Stop writing to cuc_comment in group1 wikis (T233004)]]
[18:34:29] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[18:36:23] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:886127|Stop writing to cuc_comment in group1 wikis (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[18:42:45] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:886127|Stop writing to cuc_comment in group1 wikis (T233004)]] (duration: 08m 19s)
[18:42:48] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[18:49:59] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[18:51:54] <wikibugs>	 (03PS1) 10Zabe: Stop writing to cuc_user and cuc_user_text everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886135 (https://phabricator.wikimedia.org/T233004)
[19:00:05] <jouncebot>	 dancy and brennen: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T1900).
[19:00:10] <brennen>	 o/
[19:01:26] <wikibugs>	 (03PS1) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642)
[19:02:10] <wikibugs>	 (03PS2) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642)
[19:02:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) (owner: 10EoghanGaffney)
[19:02:27] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination [puppet] - 10https://gerrit.wikimedia.org/r/882761 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[19:02:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642) (owner: 10EoghanGaffney)
[19:05:33] <wikibugs>	 (03PS3) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642)
[19:08:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[19:10:01] <wikibugs>	 (03PS4) 10EoghanGaffney: Add spf record for gitlab.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/886137 (https://phabricator.wikimedia.org/T328642)
[19:13:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[19:19:20] <wikibugs>	 (03PS2) 10Brennen Bearnes: gitlab runners: add dependabot-gitlab & elasticsearch to allowed_images [puppet] - 10https://gerrit.wikimedia.org/r/886128 (https://phabricator.wikimedia.org/T326507)
[19:21:32] <dancy>	 o/ Sorry got distracted. Rolling the train!
[19:21:56] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886142 (https://phabricator.wikimedia.org/T325584)
[19:21:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886142 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot)
[19:22:41] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.40.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886142 (https://phabricator.wikimedia.org/T325584) (owner: 10TrainBranchBot)
[19:24:03] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] "Looks good. ty for the clear explanation!" [puppet] - 10https://gerrit.wikimedia.org/r/886055 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[19:27:33] <wikibugs>	 (03PS2) 10Jcrespo: Add unit tests & coverage report [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428
[19:28:29] <logmsgbot>	 !log zabe@deploy1002 say aborted:  (duration: 00m 03s)
[19:30:26] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.40.0-wmf.21  refs T325584
[19:30:29] <stashbot>	 T325584: 1.40.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T325584
[19:48:28] <wikibugs>	 (03PS2) 10Ryan Kemper: elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[19:48:36] <wikibugs>	 (03PS3) 10Ryan Kemper: elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[19:49:07] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[19:49:21] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elasticsearch: move to /run [puppet] - 10https://gerrit.wikimedia.org/r/886055 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[19:49:39] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[19:49:54] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: move to /run [puppet] - 10https://gerrit.wikimedia.org/r/886055 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[19:52:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netbox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10BCornwall)
[19:54:05] <ryankemper>	 !log T328674 [Elastic] With puppet disabled on elastic* fleet, `ryankemper@elastic2037:~$ sudo run-puppet-agent --force` to verify changes in https://gerrit.wikimedia.org/r/886055
[19:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:08] <stashbot>	 T328674: Revise elastic/open search and its /run + tmpfiles creation - https://phabricator.wikimedia.org/T328674
[19:55:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-Elukey: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835 (10ayounsi) New cool tool on the block: https://github.com/jasonyates/netbox-circuitmaintenance
[19:55:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host elastic2037.codfw.wmnet
[19:59:29] <zabe>	 dancy, is it ok if I deploy a config patch?
[19:59:37] <dancy>	 OK w/ me
[20:00:01] <zabe>	 thanks :)
[20:00:05] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_user and cuc_user_text everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886135 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:01:13] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to cuc_user and cuc_user_text everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886135 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[20:01:29] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:886135|Stop writing to cuc_user and cuc_user_text everywhere (T233004)]]
[20:01:44] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[20:02:09] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] "PCC looks reasonable" [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[20:02:25] <wikibugs>	 (03PS4) 10Ryan Kemper: elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[20:02:37] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elasticsearch: service depends on tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[20:02:59] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic2037.codfw.wmnet
[20:03:18] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:886135|Stop writing to cuc_user and cuc_user_text everywhere (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:04:15] <wikibugs>	 (03PS1) 10RLazarus: Release v0.0.3. [software/httpbb] - 10https://gerrit.wikimedia.org/r/886148 (https://phabricator.wikimedia.org/T328280)
[20:06:18] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Release v0.0.3. [software/httpbb] - 10https://gerrit.wikimedia.org/r/886148 (https://phabricator.wikimedia.org/T328280) (owner: 10RLazarus)
[20:11:09] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:886135|Stop writing to cuc_user and cuc_user_text everywhere (T233004)]] (duration: 09m 39s)
[20:11:11] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10Papaul) 05Open→03Resolved Disable all the second interfaces after talking with @Andrew on IRC ` papaul: sorry, was in a meeting. We are trying to transition to a single-NIC                        connection...
[20:11:12] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[20:12:40] <wikibugs>	 (03PS1) 10Zabe: Stop writing to cuc_comment everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886149 (https://phabricator.wikimedia.org/T233004)
[20:21:03] <rzl>	 !log rzl@apt1001:~$ sudo -i reprepro -C main include buster-wikimedia ${HOME}/httpbb/buster/httpbb_${VERSION?}-1_amd64.changes  # T328280
[20:21:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:06] <stashbot>	 T328280: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280
[20:21:31] <rzl>	 ^ that version should read 0.0.3-1, will edit the SAL
[20:23:03] <rzl>	 !log rzl@apt1001:~$ sudo -i reprepro -C main include bullseye-wikimedia /home/rzl/httpbb/bullseye/httpbb_0.0.3-1+deb11u1_amd64.changes  # T328280
[20:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:25] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1077.eqiad.wmnet with OS bullseye
[20:28:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1077.eqiad.wmnet with OS bullseye
[20:28:58] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1078.eqiad.wmnet with OS bullseye
[20:29:07] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1078.eqiad.wmnet with OS bullseye
[20:30:42] <wikibugs>	 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the Search Platform Team - https://phabricator.wikimedia.org/T328702 (10bking)
[20:33:50] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Machine-Learning-Team: httpbb with HTTP POSTs and json payload - https://phabricator.wikimedia.org/T328280 (10RLazarus) 05Open→03Resolved This is deployed! Thanks again for the patch, let me know if you need anything else.
[20:49:52] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1077.eqiad.wmnet with reason: host reimage
[20:52:39] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1077.eqiad.wmnet with reason: host reimage
[20:59:06] <wikibugs>	 (03PS2) 10Dreamy Jazz: Disable write old for CheckUserLog reason everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885359 (https://phabricator.wikimedia.org/T233004)
[20:59:08] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1078.eqiad.wmnet with OS bullseye
[20:59:17] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1078.eqiad.wmnet with OS bullseye executed with errors: - cp1078 (**FAIL**)   - Downtimed on Ic...
[20:59:20] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1078.eqiad.wmnet with OS bullseye
[20:59:29] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1078.eqiad.wmnet with OS bullseye
[21:00:04] <jouncebot>	 brennen and TheresNoTime: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230202T2100).
[21:00:04] <jouncebot>	 Dreamy_Jazz and nray: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:08] <Dreamy_Jazz>	 \o
[21:00:12] <nray>	 o/
[21:01:02] <brennen>	 o/
[21:02:57] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Sustainability (Incident Followup): Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801 (10BCornwall) @Vgutierrez and @BBlack: Is this still an issue? 6 years is a long time. :)
[21:03:05] <brennen>	 Dreamy_Jazz: starting with yours
[21:03:13] <Dreamy_Jazz>	 Thanks. I can self test.
[21:04:15] <Dreamy_Jazz>	 But will need someone to inspect the DB row for the check I make
[21:04:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885359 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz)
[21:05:48] <wikibugs>	 (03Merged) 10jenkins-bot: Disable write old for CheckUserLog reason everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885359 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz)
[21:06:04] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:885359|Disable write old for CheckUserLog reason everywhere (T233004)]]
[21:06:07] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[21:07:48] <logmsgbot>	 !log brennen@deploy1002 brennen and dreamyjazz: Backport for [[gerrit:885359|Disable write old for CheckUserLog reason everywhere (T233004)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[21:08:13] <brennen>	 Dreamy_Jazz: on test servers.  i can query, but you'll have to tell me _what_ to query. :)
[21:08:53] <Dreamy_Jazz>	 The SQL query that needs to be run on enwiki when I say is "SELECT cul_reason FROM `cu_log` JOIN `actor` `cu_log_actor` ON ((actor_id = cul_actor)) WHERE actor_name = 'Dreamy Jazz' ORDER BY cul_timestamp DESC LIMIT 1"
[21:09:02] <Dreamy_Jazz>	 But let me check first
[21:09:04] <brennen>	 ack, thx
[21:09:28] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox: DNS domains registered to WMF no longer redirecting - https://phabricator.wikimedia.org/T146619 (10BCornwall) 05Open→03Resolved a:03BCornwall Thanks for bringing this ticket to our attention, Nick! It's been quite a long time since you brought this up. As the actionable...
[21:10:41] <wikibugs>	 10SRE, 10vm-requests: eqiad: 1 VMs requested for airflow on behalf of the Search Platform Team - https://phabricator.wikimedia.org/T328702 (10EBernhardson) This will be replacing an-airflow1001 which is also a ganeti VM, but it may take a month or two after provisioning for the old instance to be shut down.
[21:10:46] <Dreamy_Jazz>	 Okay. Ran a check on mwdebug1001
[21:10:49] <Dreamy_Jazz>	 Please run that query
[21:11:02] <Dreamy_Jazz>	 If everything is as expected cul_reason should be the empty string
[21:12:10] <brennen>	 Dreamy_Jazz: yep, as expected.  proceeding with sync.
[21:12:20] <Dreamy_Jazz>	 Thanks :) Good news.
[21:16:30] <wikibugs>	 (03PS2) 10Brennen Bearnes: Enable client preferences everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886118 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray)
[21:16:52] <brennen>	 nray: you're up next, soon as this sync finishes.
[21:17:03] <nray>	 @brennen sounds good!
[21:18:07] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:885359|Disable write old for CheckUserLog reason everywhere (T233004)]] (duration: 12m 02s)
[21:18:10] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[21:18:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886118 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray)
[21:19:24] <wikibugs>	 (03Merged) 10jenkins-bot: Enable client preferences everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886118 (https://phabricator.wikimedia.org/T327979) (owner: 10Nray)
[21:19:38] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:886118|Enable client preferences everywhere (T327979)]]
[21:19:41] <stashbot>	 T327979: Enable persistent fixed width setting for anonymous users - https://phabricator.wikimedia.org/T327979
[21:20:31] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1078.eqiad.wmnet with reason: host reimage
[21:21:23] <logmsgbot>	 !log brennen@deploy1002 brennen and nray: Backport for [[gerrit:886118|Enable client preferences everywhere (T327979)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[21:21:52] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1077.eqiad.wmnet with OS bullseye
[21:22:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1077.eqiad.wmnet with OS bullseye completed: - cp1077 (**PASS**)   - Downtimed on Icinga/Alertm...
[21:22:18] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1077.eqiad.wmnet
[21:22:38] <nray>	 thank you @brennen , I'm checking now
[21:22:38] <brennen>	 nray: await your test
[21:22:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[21:22:45] <brennen>	 cool cool
[21:22:49] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1079.eqiad.wmnet with OS bullseye
[21:22:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1079.eqiad.wmnet with OS bullseye
[21:23:58] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1078.eqiad.wmnet with reason: host reimage
[21:24:55] <nray>	 @brennen things look good, you can proceed!
[21:25:01] <brennen>	 ack, going ahead.
[21:30:53] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:886118|Enable client preferences everywhere (T327979)]] (duration: 11m 14s)
[21:30:56] <stashbot>	 T327979: Enable persistent fixed width setting for anonymous users - https://phabricator.wikimedia.org/T327979
[21:30:59] <brennen>	 !log end of utc late backport & config window
[21:31:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:14] <nray>	 thanks for your help @brennen !
[21:31:20] <brennen>	 sure thing
[21:44:20] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1079.eqiad.wmnet with reason: host reimage
[21:47:29] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1079.eqiad.wmnet with reason: host reimage
[21:47:44] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1078.eqiad.wmnet with OS bullseye
[21:47:53] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1078.eqiad.wmnet with OS bullseye completed: - cp1078 (**WARN**)   - Removed from Puppet and Pu...
[21:49:34] <wikibugs>	 (03PS2) 10Zabe: Stop writing to cuc_comment everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886149 (https://phabricator.wikimedia.org/T233004)
[21:49:41] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Stop writing to cuc_comment everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886149 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:49:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886149 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:50:25] <wikibugs>	 (03Merged) 10jenkins-bot: Stop writing to cuc_comment everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/886149 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe)
[21:50:39] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:886149|Stop writing to cuc_comment everywhere (T233004)]]
[21:50:42] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[21:52:26] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:886149|Stop writing to cuc_comment everywhere (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
[21:58:38] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:886149|Stop writing to cuc_comment everywhere (T233004)]] (duration: 07m 58s)
[21:58:41] <stashbot>	 T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004
[22:00:51] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1078.eqiad.wmnet
[22:01:18] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[22:01:48] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp1080.eqiad.wmnet with OS bullseye
[22:01:58] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp1080.eqiad.wmnet with OS bullseye
[22:12:13] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1079.eqiad.wmnet with OS bullseye
[22:12:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1079.eqiad.wmnet with OS bullseye completed: - cp1079 (**PASS**)   - Downtimed on Icinga/Alertm...
[22:15:50] <logmsgbot>	 !log brett@cumin2002 conftool action : set/pooled=yes; selector: name=cp1079.eqiad.wmnet
[22:16:18] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[22:58:20] <logmsgbot>	 !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1080.eqiad.wmnet with OS bullseye
[22:58:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp1080.eqiad.wmnet with OS bullseye executed with errors: - cp1080 (**FAIL**)   - Downtimed on Ic...
[23:09:12] <wikibugs>	 (03PS1) 10Bking: wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114)
[23:10:13] <wikibugs>	 (03Abandoned) 10Bking: [WIP] wdqs: switch to using NFS for dump files [cookbooks] - 10https://gerrit.wikimedia.org/r/868465 (owner: 10Ryan Kemper)
[23:11:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[23:12:38] <wikibugs>	 (03PS2) 10Bking: wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114)
[23:14:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs/data-reload.py: validate dump date (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/886173 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[23:15:52] <wikibugs>	 (03CR) 10Herron: opensearch: reverse-proxy access to opensearch API (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/881839 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[23:30:45] <icinga-wm>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state