[00:38:59] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997844
[00:39:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997844 (owner: 10TrainBranchBot)
[00:51:27] <tzatziki>	 !log removing 21 files for legal compliance
[00:51:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:03:11] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997844 (owner: 10TrainBranchBot)
[01:14:39] <wikibugs>	 (03PS1) 10BryanDavis: striker: Bump container version to 2024-02-07-005708-production [puppet] - 10https://gerrit.wikimedia.org/r/997990
[01:14:56] <zabe>	 jouncebot: nowandnext
[01:14:56] <jouncebot>	 No deployments scheduled for the next 5 hour(s) and 45 minute(s)
[01:14:56] <jouncebot>	 In 5 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0700)
[01:19:48] <wikibugs>	 (03PS4) 10Zabe: Update mediawiki/mediawiki-codesniffer to 43.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404
[01:20:53] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update mediawiki/mediawiki-codesniffer to 43.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404 (owner: 10Zabe)
[01:21:37] <wikibugs>	 (03Merged) 10jenkins-bot: Update mediawiki/mediawiki-codesniffer to 43.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404 (owner: 10Zabe)
[01:22:20] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:996404|Update mediawiki/mediawiki-codesniffer to 43.0.0]]
[01:23:52] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:996404|Update mediawiki/mediawiki-codesniffer to 43.0.0]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[01:24:12] <logmsgbot>	 !log zabe@deploy2002 zabe: Continuing with sync
[01:25:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:30:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) prometheus-phpfpm-statustext-textfile.service Failed on mw1364:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:30:46] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:996404|Update mediawiki/mediawiki-codesniffer to 43.0.0]] (duration: 08m 25s)
[01:31:45] <wikibugs>	 (03PS1) 10Eevans: sessionstore: provision sessionstore2004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997991 (https://phabricator.wikimedia.org/T356829)
[01:31:47] <wikibugs>	 (03PS1) 10Eevans: sessionstore: provision sessionstore2005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997992 (https://phabricator.wikimedia.org/T356829)
[01:31:49] <wikibugs>	 (03PS1) 10Eevans: sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829)
[01:35:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (35) prometheus-phpfpm-statustext-textfile.service Failed on mw1354:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:57:08] <wikibugs>	 (03PS5) 10Zabe: Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) (owner: 10Caenus)
[01:58:26] <wikibugs>	 (03PS1) 10Zabe: throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997996
[02:00:36] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) (owner: 10Caenus)
[02:00:48] <wikibugs>	 (03PS2) 10Zabe: throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997996
[02:00:48] <jinxer-wm>	 (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:00:51] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997996 (owner: 10Zabe)
[02:01:24] <wikibugs>	 (03Merged) 10jenkins-bot: Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) (owner: 10Caenus)
[02:01:44] <wikibugs>	 (03Merged) 10jenkins-bot: throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997996 (owner: 10Zabe)
[02:02:24] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:952817|Deleting Ns:104 in itwikivoyage]], [[gerrit:997996|throttle: Remove expired throttle]]
[02:03:52] <logmsgbot>	 !log zabe@deploy2002 caenus and zabe: Backport for [[gerrit:952817|Deleting Ns:104 in itwikivoyage]], [[gerrit:997996|throttle: Remove expired throttle]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[02:04:15] <logmsgbot>	 !log zabe@deploy2002 caenus and zabe: Continuing with sync
[02:05:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:10:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (17) prometheus-phpfpm-statustext-textfile.service Failed on mw1352:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:10:46] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:952817|Deleting Ns:104 in itwikivoyage]], [[gerrit:997996|throttle: Remove expired throttle]] (duration: 08m 22s)
[02:11:07] <zabe>	 !log zabe@mwmaint2002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki mediawikiwiki "Wikimedia Apps/Suggested edits" "Wikimedia Apps/Android Suggested edits" "Zabe" --reason "per request [[:phab:T348875|T348875]]"
[02:11:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:11:11] <stashbot>	 T348875: Move [Wikimedia Apps/Suggested edits] to [Wikimedia Apps/Android Suggested edits] on MediaWiki.org - https://phabricator.wikimedia.org/T348875
[02:15:26] <jinxer-wm>	 (SystemdUnitFailed) resolved: (44) prometheus-phpfpm-statustext-textfile.service Failed on mw1352:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:17:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:36:40] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:39:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:38] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:10:48] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:29:32] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:31:00] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:36:25] <jinxer-wm>	 (SystemdUnitFailed) firing: docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:38:40] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:41:22] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:42:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:17:06] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:18:36] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:51:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[05:52:04] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[05:52:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2105 (T355609)', diff saved to https://phabricator.wikimedia.org/P56381 and previous config saved to /var/cache/conftool/dbconfig/20240207-055210-marostegui.json
[05:52:15] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[05:53:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2030', diff saved to https://phabricator.wikimedia.org/P56382 and previous config saved to /var/cache/conftool/dbconfig/20240207-055301-root.json
[05:53:50] <wikibugs>	 (03PS1) 10Marostegui: es2030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/998008 (https://phabricator.wikimedia.org/T351916)
[05:55:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2030.codfw.wmnet with OS bookworm
[05:56:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/998008 (https://phabricator.wikimedia.org/T351916) (owner: 10Marostegui)
[05:59:47] <wikibugs>	 (03PS5) 10Vgutierrez: fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[06:00:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[06:00:48] <jinxer-wm>	 (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:02:02] <wikibugs>	 (03CR) 10Marostegui: "Filippo, I have tried to submit this change+merge but I cannot submit, as after the +2, there seem to be some other changes that need to b" [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[06:02:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[06:05:30] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[06:14:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T355609)', diff saved to https://phabricator.wikimedia.org/P56383 and previous config saved to /var/cache/conftool/dbconfig/20240207-061424-marostegui.json
[06:14:29] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[06:14:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2030.codfw.wmnet with reason: host reimage
[06:17:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2030.codfw.wmnet with reason: host reimage
[06:17:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:29:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P56384 and previous config saved to /var/cache/conftool/dbconfig/20240207-062931-marostegui.json
[06:34:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2030.codfw.wmnet with OS bookworm
[06:34:30] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2030: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/998028
[06:35:20] <wikibugs>	 (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/998028 (owner: 10Marostegui)
[06:36:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Switch es1 master T351916', diff saved to https://phabricator.wikimedia.org/P56385 and previous config saved to /var/cache/conftool/dbconfig/20240207-063659-marostegui.json
[06:37:04] <stashbot>	 T351916: Migrate es1 to Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351916
[06:37:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2030: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/998028 (owner: 10Marostegui)
[06:39:53] <wikibugs>	 (03PS1) 10Marostegui: es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/998145 (https://phabricator.wikimedia.org/T351916)
[06:39:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1032', diff saved to https://phabricator.wikimedia.org/P56386 and previous config saved to /var/cache/conftool/dbconfig/20240207-063957-root.json
[06:41:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1032.eqiad.wmnet with OS bookworm
[06:41:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/998145 (https://phabricator.wikimedia.org/T351916) (owner: 10Marostegui)
[06:41:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P56387 and previous config saved to /var/cache/conftool/dbconfig/20240207-064142-root.json
[06:44:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P56388 and previous config saved to /var/cache/conftool/dbconfig/20240207-064438-marostegui.json
[06:52:04] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:52:18] <vgutierrez>	 uh?
[06:52:45] <wikibugs>	 10SRE, 10Traffic: A poor internet connection should not result in a HTTP 503 error - https://phabricator.wikimedia.org/T356025 (10Bugreporter)
[06:53:26] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:53:36] <wikibugs>	 10SRE, 10Traffic: Cannot edit wikipedia from my work computer - https://phabricator.wikimedia.org/T356799 (10Bugreporter)
[06:54:27] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1032.eqiad.wmnet with reason: host reimage
[06:56:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P56389 and previous config saved to /var/cache/conftool/dbconfig/20240207-065647-root.json
[06:57:02] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1032.eqiad.wmnet with reason: host reimage
[06:59:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T355609)', diff saved to https://phabricator.wikimedia.org/P56390 and previous config saved to /var/cache/conftool/dbconfig/20240207-065944-marostegui.json
[06:59:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[06:59:50] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[07:00:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0700)
[07:00:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2109 (T355609)', diff saved to https://phabricator.wikimedia.org/P56391 and previous config saved to /var/cache/conftool/dbconfig/20240207-070007-marostegui.json
[07:03:58] <wikibugs>	 (03PS1) 10Marostegui: Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/998032
[07:05:41] <wikibugs>	 10SRE, 10Traffic: A poor internet connection should not result in a HTTP 503 error - https://phabricator.wikimedia.org/T356025 (10Vgutierrez) sadly varnish is not able to tell between a client that goes away earlier than expected (by poor Internet access) triggering a backend fetch error from an actual backend...
[07:05:50] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:08:08] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:11:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P56392 and previous config saved to /var/cache/conftool/dbconfig/20240207-071152-root.json
[07:16:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/998032 (owner: 10Marostegui)
[07:16:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1032.eqiad.wmnet with OS bookworm
[07:17:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P56393 and previous config saved to /var/cache/conftool/dbconfig/20240207-071707-root.json
[07:25:58] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:toolforge::mailrelay: don't blindly reject any bounces [puppet] - 10https://gerrit.wikimedia.org/r/994250 (owner: 10Majavah)
[07:26:47] <_joe_>	 jouncebot: next
[07:26:47] <jouncebot>	 In 0 hour(s) and 33 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0800)
[07:26:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P56394 and previous config saved to /var/cache/conftool/dbconfig/20240207-072657-root.json
[07:27:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997879 (https://phabricator.wikimedia.org/T356780) (owner: 10Jforrester)
[07:28:03] <_joe_>	 I am saving time in the backport window as this is a branch backport for a train blocker
[07:28:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T355609)', diff saved to https://phabricator.wikimedia.org/P56395 and previous config saved to /var/cache/conftool/dbconfig/20240207-072851-marostegui.json
[07:28:55] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[07:32:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P56396 and previous config saved to /var/cache/conftool/dbconfig/20240207-073212-root.json
[07:42:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P56397 and previous config saved to /var/cache/conftool/dbconfig/20240207-074203-root.json
[07:43:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P56398 and previous config saved to /var/cache/conftool/dbconfig/20240207-074357-marostegui.json
[07:46:50] <wikibugs>	 (03Merged) 10jenkins-bot: Set the memory limit in bytes. [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997879 (https://phabricator.wikimedia.org/T356780) (owner: 10Jforrester)
[07:47:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P56399 and previous config saved to /var/cache/conftool/dbconfig/20240207-074717-root.json
[07:47:34] <logmsgbot>	 !log oblivian@deploy2002 Started scap: Backport for [[gerrit:997879|Set the memory limit in bytes. (T356780)]]
[07:47:38] <stashbot>	 T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780
[07:49:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend config-master Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/997888 (owner: 10Muehlenhoff)
[07:49:20] <logmsgbot>	 !log oblivian@deploy2002 oblivian and jforrester: Backport for [[gerrit:997879|Set the memory limit in bytes. (T356780)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:50:05] <logmsgbot>	 !log oblivian@deploy2002 oblivian and jforrester: Continuing with sync
[07:51:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:51:49] <moritzm>	 !log rebalance ganeti codfw/row B following completed switch maintenance T355860
[07:51:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:53] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[07:56:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:57:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P56400 and previous config saved to /var/cache/conftool/dbconfig/20240207-075708-root.json
[07:57:11] <logmsgbot>	 !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:997879|Set the memory limit in bytes. (T356780)]] (duration: 09m 36s)
[07:57:14] <stashbot>	 T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780
[07:58:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet
[07:59:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P56401 and previous config saved to /var/cache/conftool/dbconfig/20240207-075904-marostegui.json
[07:59:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0800). nyaa~
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:01:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (34) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:01:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet
[08:02:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P56402 and previous config saved to /var/cache/conftool/dbconfig/20240207-080222-root.json
[08:03:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet
[08:06:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (35) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:09:23] <wikibugs>	 (03PS2) 10Slyngshede: Provide context for account creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584)
[08:09:39] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:11:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (34) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:11:35] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 38 hosts with reason: Primary switchover s4 T356649
[08:11:41] <stashbot>	 T356649: Switchover s4 master (db1160 -> db1238) - https://phabricator.wikimedia.org/T356649
[08:11:57] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:12:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 38 hosts with reason: Primary switchover s4 T356649
[08:12:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P56403 and previous config saved to /var/cache/conftool/dbconfig/20240207-081213-root.json
[08:12:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db1238 with weight 0 T356649', diff saved to https://phabricator.wikimedia.org/P56404 and previous config saved to /var/cache/conftool/dbconfig/20240207-081220-arnaudb.json
[08:12:43] <wikibugs>	 (03CR) 10Slyngshede: Provide context for account creation. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede)
[08:12:45] <_joe_>	 jouncebot: now
[08:12:45] <jouncebot>	 For the next 0 hour(s) and 47 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0800)
[08:14:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T355609)', diff saved to https://phabricator.wikimedia.org/P56405 and previous config saved to /var/cache/conftool/dbconfig/20240207-081410-marostegui.json
[08:14:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[08:14:15] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[08:14:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance
[08:14:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T355609)', diff saved to https://phabricator.wikimedia.org/P56406 and previous config saved to /var/cache/conftool/dbconfig/20240207-081433-marostegui.json
[08:15:21] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] striker: Bump container version to 2024-02-07-005708-production [puppet] - 10https://gerrit.wikimedia.org/r/997990 (owner: 10BryanDavis)
[08:17:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P56407 and previous config saved to /var/cache/conftool/dbconfig/20240207-081727-root.json
[08:18:39] <wikibugs>	 (03PS2) 10Slyngshede: P:docker::builder clean docker image cache regularly. [puppet] - 10https://gerrit.wikimedia.org/r/997796
[08:21:28] <wikibugs>	 (03PS2) 10Muehlenhoff: debmonitor: Remove legacy cert handling [puppet] - 10https://gerrit.wikimedia.org/r/995183
[08:24:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/997480 (https://phabricator.wikimedia.org/T354959) (owner: 10Muehlenhoff)
[08:26:16] <wikibugs>	 (03CR) 10Slyngshede: P:docker::builder clean docker image cache regularly. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997796 (owner: 10Slyngshede)
[08:32:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P56408 and previous config saved to /var/cache/conftool/dbconfig/20240207-083233-root.json
[08:33:46] <wikibugs>	 (03PS2) 10Slyngshede: Allow users to view the entire SSH key [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140)
[08:34:02] <wikibugs>	 (03CR) 10Slyngshede: Allow users to view the entire SSH key (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140) (owner: 10Slyngshede)
[08:36:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T355609)', diff saved to https://phabricator.wikimedia.org/P56409 and previous config saved to /var/cache/conftool/dbconfig/20240207-083650-marostegui.json
[08:36:54] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[08:44:28] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: Promote db1238 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/997487 (https://phabricator.wikimedia.org/T356649) (owner: 10Gerrit maintenance bot)
[08:45:09] <arnaudb>	 !log Starting s4 eqiad failover from db1160 to db1238 - T356649
[08:45:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:13] <stashbot>	 T356649: Switchover s4 master (db1160 -> db1238) - https://phabricator.wikimedia.org/T356649
[08:46:49] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:46:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db1238 to s4 primary T356649', diff saved to https://phabricator.wikimedia.org/P56410 and previous config saved to /var/cache/conftool/dbconfig/20240207-084654-arnaudb.json
[08:47:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P56411 and previous config saved to /var/cache/conftool/dbconfig/20240207-084738-root.json
[08:48:09] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51452 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:51:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P56412 and previous config saved to /var/cache/conftool/dbconfig/20240207-085157-marostegui.json
[08:57:43] <wikibugs>	 (03PS2) 10Filippo Giunchedi: envoy: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997818 (https://phabricator.wikimedia.org/T337831)
[08:58:18] <wikibugs>	 (03PS1) 10Majavah: network: allow passing 'cloud' as realm to slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/998259
[08:58:20] <wikibugs>	 (03PS1) 10Majavah: network::constants: use 'cloud' where possible [puppet] - 10https://gerrit.wikimedia.org/r/998260
[08:58:22] <wikibugs>	 (03PS1) 10Majavah: P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850)
[08:58:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] envoy: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997818 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[08:58:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140) (owner: 10Slyngshede)
[08:59:40] <wikibugs>	 (03PS3) 10Filippo Giunchedi: graphite: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997806 (https://phabricator.wikimedia.org/T337831)
[09:00:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] graphite: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997806 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[09:01:46] <wikibugs>	 (03PS2) 10Majavah: P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850)
[09:02:18] <wikibugs>	 (03PS3) 10Filippo Giunchedi: profile: remove absented statsd hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/997803 (https://phabricator.wikimedia.org/T239862)
[09:02:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] profile: remove absented statsd hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/997803 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi)
[09:03:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'mathching old db1238 weight https://phabricator.wikimedia.org/P56404', diff saved to https://phabricator.wikimedia.org/P56413 and previous config saved to /var/cache/conftool/dbconfig/20240207-090316-arnaudb.json
[09:04:52] <wikibugs>	 (03PS2) 10Filippo Giunchedi: mariadb: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831)
[09:04:55] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1292/c" [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[09:06:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[09:06:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for following up, indeed the patch was part of a chain of related and independent patches. I've now moved the patch to be stand " [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[09:07:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P56414 and previous config saved to /var/cache/conftool/dbconfig/20240207-090703-marostegui.json
[09:07:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[09:08:05] <wikibugs>	 (03CR) 10Volans: "small missing nit, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff)
[09:17:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Please note this is good, but *not enough for the ticket scope*- there needs to be a change on the job defaults config (see netbox)." [puppet] - 10https://gerrit.wikimedia.org/r/997935 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis)
[09:19:46] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Temporarily enable Dockerfile frontend on trusted runners (part 2, rev 2) [puppet] - 10https://gerrit.wikimedia.org/r/997516 (https://phabricator.wikimedia.org/T356418) (owner: 10Ahmon Dancy)
[09:19:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::core_test
[09:20:14] <wikibugs>	 (03CR) 10JMeybohm: New cookbook to reboot/restart config-master hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 (owner: 10Muehlenhoff)
[09:21:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mariadb::core_test to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998294 (https://phabricator.wikimedia.org/T349619)
[09:22:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T355609)', diff saved to https://phabricator.wikimedia.org/P56415 and previous config saved to /var/cache/conftool/dbconfig/20240207-092210-marostegui.json
[09:22:13] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[09:22:14] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[09:22:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance
[09:22:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance
[09:22:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance
[09:22:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T355609)', diff saved to https://phabricator.wikimedia.org/P56416 and previous config saved to /var/cache/conftool/dbconfig/20240207-092248-marostegui.json
[09:23:18] <wikibugs>	 (03CR) 10Muehlenhoff: New cookbook to reboot/restart config-master hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 (owner: 10Muehlenhoff)
[09:23:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::core_test to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998294 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:24:35] <Dreamy_Jazz>	 !log Doing security deploy for T356183
[09:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998259 (owner: 10Majavah)
[09:25:58] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Allow users to view the entire SSH key [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140) (owner: 10Slyngshede)
[09:26:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998260 (owner: 10Majavah)
[09:26:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1160', diff saved to https://phabricator.wikimedia.org/P56417 and previous config saved to /var/cache/conftool/dbconfig/20240207-092614-root.json
[09:30:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::core_test
[09:30:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "LGTM, but since you are already posting https://gerrit.wikimedia.org/r/c/operations/puppet/+/998260/1 as a followup, I suppose you intend " [puppet] - 10https://gerrit.wikimedia.org/r/998259 (owner: 10Majavah)
[09:31:20] <jayme>	 !log removing a bunch of old kernel versions from chartmuseum* to free ~3.5GB disk space
[09:31:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "looking good, it needs some work though (check the inline comments)" [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall)
[09:31:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[09:36:20] <wikibugs>	 (03PS2) 10Filippo Giunchedi: confd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831)
[09:36:22] <wikibugs>	 (03PS2) 10Filippo Giunchedi: chartmuseum: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997816 (https://phabricator.wikimedia.org/T337831)
[09:36:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: docker_registry: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997817 (https://phabricator.wikimedia.org/T337831)
[09:36:26] <wikibugs>	 (03PS2) 10Filippo Giunchedi: mediawiki: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831)
[09:36:28] <wikibugs>	 (03PS2) 10Filippo Giunchedi: etcd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831)
[09:38:25] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "I'm going to try, but I fear as long as `$::realm` is `'labs'` it's going to be somewhat difficult to do that cleanly." [puppet] - 10https://gerrit.wikimedia.org/r/998259 (owner: 10Majavah)
[09:38:39] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] network::constants: use 'cloud' where possible [puppet] - 10https://gerrit.wikimedia.org/r/998260 (owner: 10Majavah)
[09:39:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T355609)', diff saved to https://phabricator.wikimedia.org/P56418 and previous config saved to /var/cache/conftool/dbconfig/20240207-093953-marostegui.json
[09:39:56] <Amir1>	 arnaudb: are you done with the old s4 replica, I need it for some schema changes (yes, I'm a vulture) 
[09:39:58] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[09:40:02] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[09:40:10] <wikibugs>	 (03PS3) 10Majavah: P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850)
[09:41:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic2107.codfw.wmnet
[09:41:55] <wikibugs>	 (03PS2) 10Muehlenhoff: New cookbook to reboot/restart config-master hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/997887
[09:42:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[09:42:46] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 (owner: 10Muehlenhoff)
[09:43:17] <arnaudb>	 Amir1: not yet, I am afk for a few moments and will be performing a schema update after
[09:43:28] <Amir1>	 ping me once done :D
[09:43:52] <arnaudb>	 sure
[09:44:28] <wikibugs>	 (03PS4) 10Majavah: P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850)
[09:45:49] <logmsgbot>	 !log dreamyjazz Deployed security patch for T356183
[09:45:52] <wikibugs>	 (03PS1) 10Majavah: network: rename 'labs' in data to 'cloud' [puppet] - 10https://gerrit.wikimedia.org/r/998299
[09:46:18] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] service: move thumbor from thumbor pool to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/951545 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[09:46:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (29) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:46:41] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] conftool: clean up thumbor pools [puppet] - 10https://gerrit.wikimedia.org/r/951546 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[09:46:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] network: rename 'labs' in data to 'cloud' [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah)
[09:47:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] New cookbook to reboot/restart config-master hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 (owner: 10Muehlenhoff)
[09:48:16] <wikibugs>	 (03PS1) 10Brouberol: superset: fix database connection test for our mysql DBs [puppet] - 10https://gerrit.wikimedia.org/r/998300 (https://phabricator.wikimedia.org/T335356)
[09:48:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic2107.codfw.wmnet
[09:50:12] <wikibugs>	 (03PS1) 10Slyngshede: Add links with information to footer. [software/bitu] - 10https://gerrit.wikimedia.org/r/998301 (https://phabricator.wikimedia.org/T351137)
[09:50:48] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1299/co" [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[09:51:11] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah)
[09:51:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (38) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:51:31] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Great! Thanks for looking nito this." [puppet] - 10https://gerrit.wikimedia.org/r/998300 (https://phabricator.wikimedia.org/T335356) (owner: 10Brouberol)
[09:52:05] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] superset: fix database connection test for our mysql DBs [puppet] - 10https://gerrit.wikimedia.org/r/998300 (https://phabricator.wikimedia.org/T335356) (owner: 10Brouberol)
[09:53:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] conftool: clean up thumbor pools [puppet] - 10https://gerrit.wikimedia.org/r/951546 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[09:53:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] service: move thumbor from thumbor pool to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/951545 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[09:55:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P56419 and previous config saved to /var/cache/conftool/dbconfig/20240207-095500-marostegui.json
[09:55:42] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10LSobanski)
[09:56:20] <wikibugs>	 (03PS22) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[09:56:26] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to reflect the new Etherpad release (1.9.7). See below for a list of changes: * Notable enhancements and fixes ** Added Live Plug...
[09:57:08] <wikibugs>	 (03PS2) 10Majavah: network: rename 'labs' in data to 'cloud' [puppet] - 10https://gerrit.wikimedia.org/r/998299
[09:59:26] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah)
[10:00:10] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "The change to ferm config looks ok, only ordering and comments are changed." [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah)
[10:00:41] <wikibugs>	 (03CR) 10Brouberol: Add a deployment chart for Superset (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[10:00:48] <jinxer-wm>	 (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:10:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P56420 and previous config saved to /var/cache/conftool/dbconfig/20240207-101006-marostegui.json
[10:11:56] <wikibugs>	 (03PS1) 10MVernon: swift: removed drained ms-be10[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/998305 (https://phabricator.wikimedia.org/T353149)
[10:12:28] <Dreamy_Jazz>	 !log Continuing security deploy for T356183
[10:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:09] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans)
[10:15:34] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] sessionstore: provision sessionstore2005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997992 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans)
[10:16:17] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] sessionstore: provision sessionstore2004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997991 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans)
[10:17:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:19:09] <logmsgbot>	 !log dreamyjazz Deployed security patch for T356183
[10:21:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (31) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:22:40] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore2004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997991 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans)
[10:23:06] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore2005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997992 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans)
[10:23:21] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1009.eqiad.wmnet
[10:23:28] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2009.codfw.wmnet
[10:23:36] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans)
[10:24:12] <wikibugs>	 (03PS1) 10Volans: setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326
[10:24:36] <Dreamy_Jazz>	 !log Finished security deploys for T356183
[10:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] swift: removed drained ms-be10[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/998305 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon)
[10:25:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T355609)', diff saved to https://phabricator.wikimedia.org/P56421 and previous config saved to /var/cache/conftool/dbconfig/20240207-102513-marostegui.json
[10:25:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[10:25:18] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[10:25:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance
[10:25:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T355609)', diff saved to https://phabricator.wikimedia.org/P56422 and previous config saved to /var/cache/conftool/dbconfig/20240207-102535-marostegui.json
[10:26:02] <wikibugs>	 (03CR) 10Slyngshede: setup.py: actually use install_requires (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans)
[10:26:08] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] swift: removed drained ms-be10[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/998305 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon)
[10:26:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (31) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:26:46] <wikibugs>	 (03PS2) 10Volans: setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326
[10:26:56] <wikibugs>	 (03CR) 10Volans: setup.py: actually use install_requires (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans)
[10:27:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/993797 (owner: 10JHathaway)
[10:28:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] P:kerberos::kadminserver absent Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/995181 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[10:28:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, may want to check nothing in cloud use the old method" [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff)
[10:29:28] <wikibugs>	 (03PS3) 10Volans: setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326
[10:29:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/995211 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff)
[10:30:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans)
[10:35:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/995213 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff)
[10:36:52] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[10:37:08] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[10:37:55] <wikibugs>	 (03PS4) 10Volans: setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326
[10:39:19] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:39:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM may want to add cole or kieth" [puppet] - 10https://gerrit.wikimedia.org/r/997555 (owner: 10JHathaway)
[10:40:31] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans)
[10:41:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah)
[10:41:50] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] confd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[10:42:21] <wikibugs>	 (03PS1) 10Klausman: admin_ng: drop version on apiGroups perms for exp NS in LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/998330 (https://phabricator.wikimedia.org/T354516)
[10:43:26] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] mariadb: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[10:43:42] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:44:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic2108.codfw.wmnet
[10:44:22] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:45:59] <wikibugs>	 (03CR) 10Volans: [C: 03+2] setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans)
[10:47:11] <wikibugs>	 (03CR) 10Clément Goubert: confd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[10:47:55] <wikibugs>	 (03Merged) 10jenkins-bot: setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans)
[10:47:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T355609)', diff saved to https://phabricator.wikimedia.org/P56423 and previous config saved to /var/cache/conftool/dbconfig/20240207-104757-marostegui.json
[10:48:02] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[10:48:43] <wikibugs>	 (03PS2) 10Slyngshede: Add links with information to footer. [software/bitu] - 10https://gerrit.wikimedia.org/r/998301 (https://phabricator.wikimedia.org/T351137)
[10:51:14] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.3.5 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998335
[10:51:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic2108.codfw.wmnet
[10:51:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::parsercache
[10:53:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mariadb::parsercache to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998336 (https://phabricator.wikimedia.org/T349619)
[10:53:20] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.3.5 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998335 (owner: 10Volans)
[10:53:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic2109.codfw.wmnet
[10:53:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[10:54:51] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.3.5 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998335 (owner: 10Volans)
[10:57:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::parsercache to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998336 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:58:56] <wikibugs>	 (03PS1) 10Btullis: Update data-engineering-postgresql bacula job defaults [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655)
[11:00:04] <wikibugs>	 (03CR) 10Btullis: "Thanks Jaime. I hadn't spotted that. I have added that change in a second patch in this chain." [puppet] - 10https://gerrit.wikimedia.org/r/997935 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1100)
[11:00:07] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] [DPE Postgres] Only backup the latest postgres dump file [puppet] - 10https://gerrit.wikimedia.org/r/997935 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis)
[11:00:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic2109.codfw.wmnet
[11:00:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update data-engineering-postgresql bacula job defaults [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis)
[11:00:45] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: eventrouter: Add port 8080 to containerPorts [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167)
[11:02:00] <wikibugs>	 (03PS1) 10Slyngshede: Improve unix username auto-fill [software/bitu] - 10https://gerrit.wikimedia.org/r/998338 (https://phabricator.wikimedia.org/T347634)
[11:03:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P56424 and previous config saved to /var/cache/conftool/dbconfig/20240207-110304-marostegui.json
[11:04:18] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: drop version on apiGroups perms for exp NS in LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/998330 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman)
[11:06:02] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-2] "Mediawiki nodes are buster (https://phabricator.wikimedia.org/T356787) and will be progressively re-imaged to be kubernetes worker nodes o" [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[11:06:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::parsercache
[11:07:57] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: WebVideoTranscodeJob: also add time limits [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998318 (https://phabricator.wikimedia.org/T356780)
[11:08:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[11:08:43] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[11:09:43] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] eventrouter: Add port 8080 to containerPorts [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris)
[11:10:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "The crashloop detection SystemdUnitCrashLoop provides was not part of the nrpe check. So even if the SystemdUnitCrashLoop alert does not w" [puppet] - 10https://gerrit.wikimedia.org/r/997817 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[11:11:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "The crashloop detection SystemdUnitCrashLoop provides was not part of the nrpe check. So even if the SystemdUnitCrashLoop alert does not w" [puppet] - 10https://gerrit.wikimedia.org/r/997816 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[11:11:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] etcd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[11:14:20] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:14:42] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "alert[1001,2001].wikimedia.org,deploy2002.codfw.wmnet,deploy1002.eqiad.wmnet,mwmaint2002.codfw.wmnet,mwmaint1002.eqiad.wmnet,puppetmaster[" [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[11:16:34] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "The crashloop detection SystemdUnitCrashLoop provides was not part of the nrpe check. So even if the SystemdUnitCrashLoop alert does not w" [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[11:18:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P56425 and previous config saved to /var/cache/conftool/dbconfig/20240207-111810-marostegui.json
[11:18:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/998301 (https://phabricator.wikimedia.org/T351137) (owner: 10Slyngshede)
[11:20:31] <wikibugs>	 (03PS1) 10Btullis: Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792)
[11:21:50] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1303/co" [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis)
[11:26:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, thanks for following up!" [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah)
[11:26:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] eventrouter: Add port 8080 to containerPorts [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris)
[11:27:20] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] network: rename 'labs' in data to 'cloud' [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah)
[11:29:13] <wikibugs>	 (03Merged) 10jenkins-bot: eventrouter: Add port 8080 to containerPorts [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris)
[11:29:29] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+1] "alert[1001,2001].wikimedia.org,deploy2002.codfw.wmnet,deploy1002.eqiad.wmnet,mwmaint2002.codfw.wmnet,mwmaint1002.eqiad.wmnet,puppetmaster[" [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[11:29:55] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "The nodes being buster only means they will not benefit from the crashloop detection." [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[11:31:58] <wikibugs>	 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10fgiunchedi) Thank you for the report and investigation, I took this chance to update https://wikitech.wikimedia.org/wiki/Thanos and make...
[11:33:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T355609)', diff saved to https://phabricator.wikimedia.org/P56426 and previous config saved to /var/cache/conftool/dbconfig/20240207-113317-marostegui.json
[11:33:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance
[11:33:28] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[11:33:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance
[11:33:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T355609)', diff saved to https://phabricator.wikimedia.org/P56427 and previous config saved to /var/cache/conftool/dbconfig/20240207-113339-marostegui.json
[11:34:48] <icinga-wm>	 PROBLEM - Check systemd state on mw1363 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:35:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "It is missing the if ending bracket, otherwise it looks good. :-)" [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis)
[11:36:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:36:44] <wikibugs>	 (03PS2) 10Btullis: Update data-engineering-postgresql bacula job defaults [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655)
[11:37:35] <claime>	 moritzm: still interested in k8s hosts failing ferm or should I just go ahead and manually restart that one?
[11:41:18] <wikibugs>	 (03PS1) 10Volans: Upstream release v0.3.5 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355
[11:41:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:42:01] <wikibugs>	 (03CR) 10Volans: Upstream release v0.3.5 (031 comment) [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 (owner: 10Volans)
[11:42:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] Update data-engineering-postgresql bacula job defaults [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis)
[11:44:18] <wikibugs>	 (03PS5) 10Gmodena: WIP - add webrequest.frontend stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata)
[11:45:04] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Add a deployment chart for Superset (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[11:45:13] <wikibugs>	 (03PS1) 10Majavah: wikireplicas: maintain-views: try depooling host on lock failure [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427)
[11:45:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[11:45:42] <moritzm>	 claime: just restart them, when I find some time I'll add a toil class for it, but not this week
[11:45:52] <wikibugs>	 (03CR) 10Majavah: "This is untested for now." [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[11:46:05] <claime>	 ack
[11:47:20] <icinga-wm>	 RECOVERY - Check systemd state on mw1363 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:33] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[11:48:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master-codfw
[11:49:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors
[11:49:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors
[11:50:17] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good, agree on keeping the apt dependency." [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 (owner: 10Volans)
[11:51:14] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Upstream release v0.3.5 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 (owner: 10Volans)
[11:51:18] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update data-engineering-postgresql bacula job defaults [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis)
[11:51:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:51:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 (owner: 10Volans)
[11:52:05] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 40% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/998358 (https://phabricator.wikimedia.org/T355532)
[11:52:49] <wikibugs>	 (03Merged) 10jenkins-bot: Upstream release v0.3.5 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 (owner: 10Volans)
[11:52:52] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: move 40% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/998359 (https://phabricator.wikimedia.org/T355532)
[11:53:38] <icinga-wm>	 PROBLEM - Check systemd state on mw1473 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master-codfw
[11:54:58] <icinga-wm>	 RECOVERY - Check systemd state on mw1473 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:10] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1381 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[11:56:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master-eqiad
[11:56:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:56:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors
[11:56:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors
[11:56:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1013.eqiad.wmnet
[11:58:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T355609)', diff saved to https://phabricator.wikimedia.org/P56428 and previous config saved to /var/cache/conftool/dbconfig/20240207-115849-marostegui.json
[11:58:54] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[12:01:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master-eqiad
[12:01:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:02:10] <volans>	 !log uploaded debmonitor-client_0.3.5 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia
[12:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1013.eqiad.wmnet
[12:02:35] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] trafficserver: move 40% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/998359 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert)
[12:02:55] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 40% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/998358 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert)
[12:03:42] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:09:10] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 40% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/998358 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert)
[12:10:14] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 40% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/998358 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert)
[12:10:44] <wikibugs>	 (03CR) 10Btullis: "Looking great! A few question on the networkpolicies, but all good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis)
[12:12:00] <claime>	 !log mw-web, mw-api-ext: Raise replicas for 40% traffic - T355532
[12:12:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:04] <stashbot>	 T355532: Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532
[12:12:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[12:12:35] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[12:12:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[12:13:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[12:13:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P56429 and previous config saved to /var/cache/conftool/dbconfig/20240207-121356-marostegui.json
[12:14:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[12:14:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1014.eqiad.wmnet
[12:14:14] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[12:14:24] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[12:14:32] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[12:16:36] <wikibugs>	 (03PS23) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[12:17:06] <claime>	 arnaudb: Emperor: Heads up, raising mw-on-k8s traffic to 40% external traffic
[12:17:39] <wikibugs>	 (03PS1) 10Slyngshede: Add informative titles to all pages. [software/bitu] - 10https://gerrit.wikimedia.org/r/998365 (https://phabricator.wikimedia.org/T351136)
[12:17:59] <claime>	 !log trafficserver: move 40% of traffic to mw on k8s - T355532
[12:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:04] <stashbot>	 T355532: Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532
[12:18:05] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 40% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/998359 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert)
[12:18:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1014.eqiad.wmnet
[12:19:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1015.eqiad.wmnet
[12:21:18] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:21:41] <jinxer-wm>	 (SystemdUnitFailed) firing: docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:24:46] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add links with information to footer. [software/bitu] - 10https://gerrit.wikimedia.org/r/998301 (https://phabricator.wikimedia.org/T351137) (owner: 10Slyngshede)
[12:25:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1015.eqiad.wmnet
[12:25:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1016.eqiad.wmnet
[12:25:38] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1381 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:28:45] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] service: move thumbor from thumbor pool to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/951545 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[12:29:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P56430 and previous config saved to /var/cache/conftool/dbconfig/20240207-122903-marostegui.json
[12:30:28] <wikibugs>	 (03CR) 10Brouberol: Add a deployment chart for Superset (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis)
[12:31:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1016.eqiad.wmnet
[12:31:25] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T334488)
[12:31:29] <stashbot>	 T334488: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488
[12:32:34] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T334488)
[12:33:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir
[12:33:55] <wikibugs>	 (03PS1) 10Clément Goubert: docker-registry: Raise nginx timeouts to 240s [puppet] - 10https://gerrit.wikimedia.org/r/998392
[12:34:56] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T334488)
[12:35:53] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T334488)
[12:41:36] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis)
[12:44:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T355609)', diff saved to https://phabricator.wikimedia.org/P56431 and previous config saved to /var/cache/conftool/dbconfig/20240207-124409-marostegui.json
[12:44:15] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[12:45:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2105.codfw.wmnet with reason: T344589 - kernel upgrade
[12:45:39] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2105.codfw.wmnet with reason: T344589 - kernel upgrade
[12:46:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T344589 - depool db2105', diff saved to https://phabricator.wikimedia.org/P56432 and previous config saved to /var/cache/conftool/dbconfig/20240207-124605-arnaudb.json
[12:54:34] <icinga-wm>	 PROBLEM - Check systemd state on ncredir2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ipip0.service,ifup@ipip60.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:57:25] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[12:57:51] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 (10Clement_Goubert) 05Open→03Resolved
[12:58:44] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[12:59:10] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:59:58] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:02] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:30] <icinga-wm>	 PROBLEM - Check systemd state on an-master1003 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:32] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[13:09:24] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:09:38] <wikibugs>	 (03PS1) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338)
[13:10:14] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:26] <icinga-wm>	 RECOVERY - Check systemd state on an-master1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:28] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[13:11:30] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1304/co" [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah)
[13:12:20] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:13:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah)
[13:15:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] docker_registry: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997817 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[13:15:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] chartmuseum: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997816 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[13:15:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] etcd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[13:15:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] confd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[13:15:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[13:23:28] <wikibugs>	 (03PS2) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338)
[13:24:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 15%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56433 and previous config saved to /var/cache/conftool/dbconfig/20240207-132402-arnaudb.json
[13:24:57] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: make 5 appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/998403 (https://phabricator.wikimedia.org/T351074)
[13:25:10] <icinga-wm>	 PROBLEM - Check systemd state on an-master1003 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:25:10] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[13:25:34] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on es2024.codfw.wmnet with reason: T344589 - kernel upgrade
[13:25:38] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1305/co" [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah)
[13:25:48] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2024.codfw.wmnet with reason: T344589 - kernel upgrade
[13:26:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T344589 - depool es2024', diff saved to https://phabricator.wikimedia.org/P56434 and previous config saved to /var/cache/conftool/dbconfig/20240207-132559-arnaudb.json
[13:26:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah)
[13:28:37] <wikibugs>	 (03PS1) 10Majavah: templates/56.15.185.in-addr.arpa: add missing includes [dns] - 10https://gerrit.wikimedia.org/r/998404 (https://phabricator.wikimedia.org/T341338)
[13:29:05] <wikibugs>	 (03PS3) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338)
[13:29:36] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: openstack: overhaul the floating IP updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah)
[13:32:40] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=1) rolling reboot on A:ncredir
[13:34:40] <wikibugs>	 (03CR) 10Slyngshede: "LGTM, this does roll out the nrpe script tool all hosts though, but we're removing it in a little while, so I think it's fine." [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[13:35:24] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: ParserObserver: Limit the size of cache of previous parse traces [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998384 (https://phabricator.wikimedia.org/T351732)
[13:35:29] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: ParserObserver: Limit the size of cache of previous parse traces [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732)
[13:36:41] <wikibugs>	 (03PS4) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338)
[13:38:12] <wikibugs>	 (03PS5) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338)
[13:39:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 30%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56435 and previous config saved to /var/cache/conftool/dbconfig/20240207-133907-arnaudb.json
[13:39:27] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[13:39:30] <wikibugs>	 (03CR) 10Majavah: openstack: overhaul the floating IP updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah)
[13:39:59] <icinga-wm>	 RECOVERY - Check systemd state on an-master1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:40:50] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1307/co" [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah)
[13:48:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 15%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56436 and previous config saved to /var/cache/conftool/dbconfig/20240207-134801-arnaudb.json
[13:48:56] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] admin_ng: drop version on apiGroups perms for exp NS in LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/998330 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman)
[13:51:35] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: drop version on apiGroups perms for exp NS in LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/998330 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman)
[13:52:22] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[13:52:35] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[13:52:43] <jynus>	 confd maintenance?
[13:52:48] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[13:53:31] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[13:53:38] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[13:54:12] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[13:54:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 60%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56437 and previous config saved to /var/cache/conftool/dbconfig/20240207-135412-arnaudb.json
[13:56:19] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:57:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] docker-registry: Raise nginx timeouts to 240s [puppet] - 10https://gerrit.wikimedia.org/r/998392 (owner: 10Clément Goubert)
[13:57:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/998338 (https://phabricator.wikimedia.org/T347634) (owner: 10Slyngshede)
[13:57:47] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:59:03] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.652 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:59:46] <wikibugs>	 (03CR) 10Vgutierrez: "any chance of increasing the timeout rather than disabling it?" [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis)
[13:59:48] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10klausman) After dropping the version specifiers (`/v...`) at the end of the `apiGroups` directives, this is now working properly.
[14:00:00] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1400).
[14:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:49] <jinxer-wm>	 (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:00:51] <MatmaRex>	 hi
[14:00:59] <Lucas_WMDE>	 o/
[14:01:00] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:08] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:01:15] <Lucas_WMDE>	 I can deploy, I guess ^^
[14:01:35] <wikibugs>	 (03PS3) 10Muehlenhoff: debmonitor: Remove legacy cert handling [puppet] - 10https://gerrit.wikimedia.org/r/995183
[14:02:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Flow] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997878 (https://phabricator.wikimedia.org/T356223) (owner: 10Jforrester)
[14:02:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Flow] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997877 (https://phabricator.wikimedia.org/T356223) (owner: 10Jforrester)
[14:02:39] <MatmaRex>	 all of my changes aren't really testable on mwdebug
[14:02:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] debmonitor: Remove legacy cert handling [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff)
[14:02:47] <MatmaRex>	 the Flow fix will hopefully show up in the logs
[14:02:58] <MatmaRex>	 the core fix is in preparation for a maintenance script i want to run
[14:03:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 30%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56438 and previous config saved to /var/cache/conftool/dbconfig/20240207-140306-arnaudb.json
[14:03:24] <Lucas_WMDE>	 aha, I see you found the cause of the memory leak \o/
[14:03:24] <MatmaRex>	 (actually, if you're feeling bored, you could start that maintenance script for me ;) it's https://phabricator.wikimedia.org/T315510)
[14:03:27] <Lucas_WMDE>	 yeah, that seems hardly fixable
[14:03:31] <Lucas_WMDE>	 *testable
[14:03:43] <wikibugs>	 (03CR) 10Muehlenhoff: "Cloud uses "puppet", so that's fine as well." [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff)
[14:03:46] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:08] <Lucas_WMDE>	 MatmaRex: only after the core backport is done, I assume?
[14:04:29] <MatmaRex>	 yeah
[14:04:46] <MatmaRex>	 the commands are listed here: https://phabricator.wikimedia.org/T315510#9312431
[14:07:33] <wikibugs>	 (03PS4) 10Muehlenhoff: debmonitor: Remove legacy cert handling [puppet] - 10https://gerrit.wikimedia.org/r/995183
[14:09:15] <wikibugs>	 (03Merged) 10jenkins-bot: Fix PermissionException being logged [extensions/Flow] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997878 (https://phabricator.wikimedia.org/T356223) (owner: 10Jforrester)
[14:09:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56439 and previous config saved to /var/cache/conftool/dbconfig/20240207-140918-arnaudb.json
[14:09:23] <wikibugs>	 (03Merged) 10jenkins-bot: Fix PermissionException being logged [extensions/Flow] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997877 (https://phabricator.wikimedia.org/T356223) (owner: 10Jforrester)
[14:09:47] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:997878|Fix PermissionException being logged (T356223)]], [[gerrit:997877|Fix PermissionException being logged (T356223)]]
[14:09:51] <stashbot>	 T356223: Flow errors - Insufficient permissions to see userlinks for rev_id and InvalidTopicUuidException - https://phabricator.wikimedia.org/T356223
[14:10:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit already" [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998384 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:10:32] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit already" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:11:14] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:11:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 jforrester and lucaswerkmeister-wmde: Backport for [[gerrit:997878|Fix PermissionException being logged (T356223)]], [[gerrit:997877|Fix PermissionException being logged (T356223)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:11:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 jforrester and lucaswerkmeister-wmde: Continuing with sync
[14:11:47] <Lucas_WMDE>	 (MatmaRex: ^ fyi)
[14:11:56] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff)
[14:12:04] <MatmaRex>	 thanks
[14:12:14] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:12:40] <wikibugs>	 (03PS1) 10Majavah: network: make cloud_private_networks per_site [puppet] - 10https://gerrit.wikimedia.org/r/998411
[14:12:42] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::cloudgw: do not traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412
[14:13:23] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::cloudgw: do not traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850)
[14:14:56] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:11] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1308/co" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[14:16:17] <_joe_>	 jouncebot: nowandnext
[14:16:17] <jouncebot>	 For the next 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1400)
[14:16:17] <jouncebot>	 In 0 hour(s) and 43 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1500)
[14:16:20] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::cloudgw: do not traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850)
[14:16:22] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1309/console" [puppet] - 10https://gerrit.wikimedia.org/r/998411 (owner: 10Majavah)
[14:16:36] <_joe_>	 Lucas_WMDE: when you're done, I have a backport for a train blocker :)
[14:16:43] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] docker-registry: Raise nginx timeouts to 240s [puppet] - 10https://gerrit.wikimedia.org/r/998392 (owner: 10Clément Goubert)
[14:16:55] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:17:24] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro  The same response was sent for each case  please advise how you would like me to proceed.  De...
[14:17:30] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1310/co" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[14:17:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:17:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:997878|Fix PermissionException being logged (T356223)]], [[gerrit:997877|Fix PermissionException being logged (T356223)]] (duration: 08m 08s)
[14:17:59] <Lucas_WMDE>	 _joe_: hm, I already started the gate-and-submit for the next backports :S
[14:18:00] <stashbot>	 T356223: Flow errors - Insufficient permissions to see userlinks for rev_id and InvalidTopicUuidException - https://phabricator.wikimedia.org/T356223
[14:18:06] <Lucas_WMDE>	 is it okay to wait until after that?
[14:18:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 60%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56440 and previous config saved to /var/cache/conftool/dbconfig/20240207-141812-arnaudb.json
[14:18:13] <Lucas_WMDE>	 (Zuul predicts 12/13 minutes ETA for those)
[14:18:19] <_joe_>	 Lucas_WMDE: yeah I meant when you're done with the rest
[14:18:23] <Lucas_WMDE>	 ok
[14:18:28] <wikibugs>	 (03PS4) 10Majavah: P:wmcs::cloudgw: do not traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850)
[14:18:30] <wikibugs>	 (03PS1) 10Majavah: network: add cloud-codfw-bgp-private-vips [puppet] - 10https://gerrit.wikimedia.org/r/998415
[14:18:32] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:18:33] <Lucas_WMDE>	 (otherwise we could’ve squeezed in the backport, I meant)
[14:19:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998384 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:19:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:19:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] network: make cloud_private_networks per_site [puppet] - 10https://gerrit.wikimedia.org/r/998411 (owner: 10Majavah)
[14:19:31] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff)
[14:19:40] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:19:43] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1311/co" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[14:20:01] <wikibugs>	 (03PS5) 10Majavah: P:wmcs::cloudgw: do not traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850)
[14:20:11] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] network: make cloud_private_networks per_site [puppet] - 10https://gerrit.wikimedia.org/r/998411 (owner: 10Majavah)
[14:20:14] <_joe_>	 Lucas_WMDE: ah yeah, it's ok :)
[14:20:31] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[14:20:55] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[14:21:14] <Lucas_WMDE>	 _joe_: okay, then I’ll ping you :)
[14:21:16] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1312/co" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[14:21:30] <Lucas_WMDE>	 ugh, one of them failed in selenium already
[14:21:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (36) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:22:47] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s try that again, a selenium job randomly failed" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:22:56] <Lucas_WMDE>	 not sure that actually works
[14:23:10] <Lucas_WMDE>	 might have to wait for the first backport to finish gate-and-submit
[14:23:31] * Lucas_WMDE ignores the little shoulder demon that suggests submitting the change manually
[14:24:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56441 and previous config saved to /var/cache/conftool/dbconfig/20240207-142423-arnaudb.json
[14:24:36] <wikibugs>	 (03PS1) 10Tsevener: Add edit_interaction stream config for iOS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998416 (https://phabricator.wikimedia.org/T355265)
[14:24:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: use systemd::timer::job for 'update-etcd-mw-config-lastindex' [puppet] - 10https://gerrit.wikimedia.org/r/998417 (https://phabricator.wikimedia.org/T337831)
[14:25:16] <icinga-wm>	 PROBLEM - Host sretest1001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:25:17] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "Also changed their type in the switch migration planning sheet" [puppet] - 10https://gerrit.wikimedia.org/r/998403 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[14:25:32] <icinga-wm>	 RECOVERY - Host sretest1001 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[14:25:56] <MatmaRex>	 i think it only works after jenkins reports the failure
[14:26:15] <Lucas_WMDE>	 jenkins reported the failure alright, I can see it in zuul
[14:26:31] <Lucas_WMDE>	 but I think it’ll only repeat the build when the other change in the gate-and-submit-wmf pipeline finishes
[14:26:32] <MatmaRex>	 not on the gerrit change though
[14:26:36] <wikibugs>	 (03CR) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[14:26:44] <Lucas_WMDE>	 so it can decide whether to base it on that (if it merges) or not (if it also fails)
[14:26:53] <Lucas_WMDE>	 hm, true
[14:26:59] <Lucas_WMDE>	 I think that might also be waiting for the same reason
[14:27:23] <Lucas_WMDE>	 if the previous change in the chain was bad, zuul would automatically retry the next change without it and not report an error on gerrit, probably
[14:27:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1313/co" [puppet] - 10https://gerrit.wikimedia.org/r/998417 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[14:28:50] <icinga-wm>	 PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:29] <volans>	 !log deploying debmonitor-client_0.3.5 fleet-wide
[14:29:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "Basically remove a bunch of legacy, and unblocks the nrpe::monitor_systemd_unit_state removal task" [puppet] - 10https://gerrit.wikimedia.org/r/998417 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi)
[14:30:26] <wikibugs>	 (03PS1) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418
[14:30:43] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Improve unix username auto-fill [software/bitu] - 10https://gerrit.wikimedia.org/r/998338 (https://phabricator.wikimedia.org/T347634) (owner: 10Slyngshede)
[14:30:59] <wikibugs>	 (03Merged) 10jenkins-bot: ParserObserver: Limit the size of cache of previous parse traces [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998384 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:31:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ParserObserver: Limit the size of cache of previous parse traces [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:31:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:31:26] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "*now* try again" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:31:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:31:48] <Lucas_WMDE>	 20 more minutes of waiting probably, sorry :S
[14:31:50] <wikibugs>	 (03PS1) 10Majavah: P:openstack: rabbitmq: cleanup rabbitmq firewall [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610)
[14:32:04] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:32:29] <wikibugs>	 (03PS2) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418
[14:32:36] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2045.codfw.wmnet
[14:32:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2045.codfw.wmnet
[14:32:46] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet
[14:32:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack: rabbitmq: cleanup rabbitmq firewall [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah)
[14:33:11] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1314/co" [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah)
[14:33:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: elasticsearch::cirrus
[14:33:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56442 and previous config saved to /var/cache/conftool/dbconfig/20240207-143317-arnaudb.json
[14:33:28] <_joe_>	 Lucas_WMDE: that's ok, I just want https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/998318 to go out so we might be able to unblock the train
[14:33:32] <wikibugs>	 (03CR) 10Effie Mouzeli: "As I discussed with Claime in previous commments, this variable is different than the others we set (which are all server related configur" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[14:34:54] <wikibugs>	 (03PS2) 10Majavah: P:openstack: rabbitmq: cleanup rabbitmq firewall [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610)
[14:36:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack: rabbitmq: cleanup rabbitmq firewall [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah)
[14:36:05] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10akosiaris) >>! In T316421#9520979, @Jelto wrote: > @akosiaris do you have any information or experience what upgrade path etherpad-lite has? I was not able to find an...
[14:36:35] <Lucas_WMDE>	 _joe_: hm, was ffmpegEncode() missing the * 1024 factor before? (I can see that in midiToAudioEncode() it just moved around)
[14:36:49] <_joe_>	 both :D
[14:36:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch elasticsearch::cirrus to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998421 (https://phabricator.wikimedia.org/T349619)
[14:37:01] <_joe_>	 and also adding a limit to wall clock time, I misread our configs
[14:37:18] <wikibugs>	 (03PS3) 10Majavah: P:openstack: rabbitmq: cleanup rabbitmq firewall [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610)
[14:37:31] <Lucas_WMDE>	 yeah, but the time limit is explained in the commit message and the memory bit isn’t :P
[14:37:48] <Lucas_WMDE>	 anyway, looks fine to backport :)
[14:37:53] <_joe_>	 Additionally, add the same configuration to
[14:37:54] <_joe_>	 ffmpegEncode() as well as midiEncode().
[14:37:56] <_joe_>	 :)
[14:38:06] <_joe_>	 I just forgot to add the conf there, completely
[14:38:14] <_joe_>	 that's what you get for writing patches during outages
[14:38:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch elasticsearch::cirrus to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998421 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:38:45] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1316/co" [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah)
[14:39:29] <Lucas_WMDE>	 limits that are silently in units other than “one” without mentioning it in the name are evil anyway ^^
[14:39:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:36] <Lucas_WMDE>	 if it was $wgTranscodeBackgroundMemoryLimitInKiB it would’ve been more obvious
[14:39:50] <_joe_>	 Lucas_WMDE: actually I want to make it in bytes
[14:39:58] <_joe_>	 but not while unblocking the train
[14:40:02] <Lucas_WMDE>	 yeah, fair
[14:40:12] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts ms-be[2044-2050].codfw.wmnet
[14:41:34] <wikibugs>	 (03PS1) 10Btullis: Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998422 (https://phabricator.wikimedia.org/T356792)
[14:41:58] <_joe_>	 Lucas_WMDE: looks like CI will fail for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/998385?tab=checks
[14:42:15] <Lucas_WMDE>	 I think that’s the old check?
[14:42:23] <Lucas_WMDE>	 in https://integration.wikimedia.org/zuul/ it looks all green or blue at the moment
[14:42:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998422 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis)
[14:43:00] <_joe_>	 Lucas_WMDE: ah yeah gerrit's UI is confusing
[14:43:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10klausman)
[14:43:32] <wikibugs>	 (03PS1) 10Majavah: P:openstack: radosgw: move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998423
[14:43:36] <_joe_>	 I'm a bit worried by how excruciatingly slow CI is
[14:44:06] <_joe_>	 in case we're in an emergency we'll have to cherry-pick a patch to the deployment server
[14:44:25] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on ml-cache2001.codfw.wmnet with reason: Machine network link move (T355861)
[14:44:29] <stashbot>	 T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861
[14:44:30] <jinxer-wm>	 (ProbeDown) firing: (3) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:44:42] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ml-cache2001.codfw.wmnet with reason: Machine network link move (T355861)
[14:45:14] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10klausman)
[14:45:22] <wikibugs>	 (03PS2) 10Btullis: Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792)
[14:45:42] <wikibugs>	 (03Abandoned) 10Btullis: Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998422 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis)
[14:46:06] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1317/co" [puppet] - 10https://gerrit.wikimedia.org/r/998423 (owner: 10Majavah)
[14:46:21] <wikibugs>	 (03CR) 10Majavah: P:openstack: radosgw: move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998423 (owner: 10Majavah)
[14:46:55] <wikibugs>	 (03CR) 10Btullis: "OK, can do. I set for 5 minutes in the latest patchset." [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis)
[14:47:31] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] kubernetes: make 5 appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/998403 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[14:47:44] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] "As a pessimist, I suspect 300s (5m) will bite us again in the future, but at least we'll know where to look 😉" [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis)
[14:48:18] <wikibugs>	 (03PS1) 10Filippo Giunchedi: nrpe: remove monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/998424 (https://phabricator.wikimedia.org/T337831)
[14:48:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56443 and previous config saved to /var/cache/conftool/dbconfig/20240207-144822-arnaudb.json
[14:50:08] <wikibugs>	 (03CR) 10Brouberol: Add helmfile deployments for Superset (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis)
[14:50:10] <wikibugs>	 (03PS14) 10Brouberol: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis)
[14:50:19] <wikibugs>	 (03Merged) 10jenkins-bot: ParserObserver: Limit the size of cache of previous parse traces [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński)
[14:50:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:998384|ParserObserver: Limit the size of cache of previous parse traces (T351732)]], [[gerrit:998385|ParserObserver: Limit the size of cache of previous parse traces (T351732)]]
[14:50:48] <stashbot>	 T351732: Debug memory leak in maintenance script - https://phabricator.wikimedia.org/T351732
[14:50:50] <vgutierrez>	 !log reboot ncredir2001
[14:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:26] <icinga-wm>	 RECOVERY - Check systemd state on ncredir2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:51:50] <vgutierrez>	 topranks: ^^
[14:52:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:998384|ParserObserver: Limit the size of cache of previous parse traces (T351732)]], [[gerrit:998385|ParserObserver: Limit the size of cache of previous parse traces (T351732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:52:34] <Lucas_WMDE>	 nothing to test, or so I heard
[14:52:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Continuing with sync
[14:52:41] <MatmaRex>	 yeah
[14:53:30] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:55:39] <wikibugs>	 (03PS1) 10Btullis: Use the analytics-presto CNAME for workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/998425 (https://phabricator.wikimedia.org/T336045)
[14:56:23] <Lucas_WMDE>	 MatmaRex: how long is that maintenance script expected to take?
[14:56:41] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:57:04] <MatmaRex>	 Lucas_WMDE: the enwiki one probably a couple of weeks. the other ones a couple of days
[14:57:18] <wikibugs>	 (03PS2) 10Btullis: Use the analytics-presto CNAME for workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/998425 (https://phabricator.wikimedia.org/T336045)
[14:57:29] <vgutierrez>	 !log reboot ncredir2001
[14:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:51] <Lucas_WMDE>	 ok
[14:58:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) prometheus-phpfpm-statustext-textfile.service Failed on mw1401:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:58:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:998384|ParserObserver: Limit the size of cache of previous parse traces (T351732)]], [[gerrit:998385|ParserObserver: Limit the size of cache of previous parse traces (T351732)]] (duration: 08m 08s)
[14:58:57] <stashbot>	 T351732: Debug memory leak in maintenance script - https://phabricator.wikimedia.org/T351732
[14:58:58] <wikibugs>	 (03CR) 10Majavah: "recheck" [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[14:59:04] <Lucas_WMDE>	 _joe_: you can backport now, I think
[14:59:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[14:59:33] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:42] <Lucas_WMDE>	 MatmaRex: and all of the commands in that comment are still needed? (asking because it’s a few months old and urbanecm had some comments afterwards)
[14:59:50] <Lucas_WMDE>	 (“that comment” = https://phabricator.wikimedia.org/T315510#9312431)
[14:59:59] * urbanecm was summoned
[15:00:06] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1500)
[15:00:20] <MatmaRex>	 Lucas_WMDE: i'm not sure how far the scripts made it, so not all may be needed, but they won't hurt
[15:00:26] <Lucas_WMDE>	 ok
[15:00:31] * Lucas_WMDE fires up a tmux
[15:00:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[15:00:40] <MatmaRex>	 it seems easier to re-run them than to figure it out
[15:00:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[15:00:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[15:01:12] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/998425 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[15:01:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[15:01:19] <MatmaRex>	 thank you :)
[15:01:21] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T355609)', diff saved to https://phabricator.wikimedia.org/P56444 and previous config saved to /var/cache/conftool/dbconfig/20240207-150121-marostegui.json
[15:01:34] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[15:01:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (30) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:01:45] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: elasticsearch::cirrus
[15:02:05] <Lucas_WMDE>	 !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki frwiki --current --all --touched-after=20230613000000 --start '["7544396"]' # T315510, in tmux
[15:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:14] <stashbot>	 T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510
[15:03:16] <Lucas_WMDE>	 I’ll wait until it prints the next --start line before starting rowiki
[15:03:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (30) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:03:26] <_joe_>	 Lucas_WMDE: can we start merging my change? it will take 20 minutes anyways
[15:03:33] <Lucas_WMDE>	 _joe_: you can go ahead, I’m done
[15:03:38] <Lucas_WMDE>	 or should I deploy it?
[15:03:51] <_joe_>	 Lucas_WMDE: if you already have a console :)
[15:03:57] <Lucas_WMDE>	 ok sure ^^
[15:04:05] <_joe_>	 <3
[15:04:05] <wikibugs>	 (03PS1) 10Slyngshede: Use the ManifestStaticFilesStorage in production [software/bitu] - 10https://gerrit.wikimedia.org/r/998426
[15:04:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998318 (https://phabricator.wikimedia.org/T356780) (owner: 10Giuseppe Lavagetto)
[15:04:23] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:28] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2377.codfw.wmnet with OS bullseye
[15:04:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use the ManifestStaticFilesStorage in production [software/bitu] - 10https://gerrit.wikimedia.org/r/998426 (owner: 10Slyngshede)
[15:04:47] <Lucas_WMDE>	 !log STOP script for T315510, forgot to tee it somewhere useful
[15:04:49] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2378.codfw.wmnet with OS bullseye
[15:04:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:29] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2406.codfw.wmnet with OS bullseye
[15:05:30] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2301.codfw.wmnet with OS bullseye
[15:05:32] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2310.codfw.wmnet with OS bullseye
[15:05:52] <Lucas_WMDE>	 !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki frwiki --current --all --touched-after=20230613000000 --start '["7544396"]' | tee ~/T315510-frwiki # in tmux
[15:05:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:33] <wikibugs>	 10SRE-swift-storage: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10MatthewVernon)
[15:06:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T355609)', diff saved to https://phabricator.wikimedia.org/P56445 and previous config saved to /var/cache/conftool/dbconfig/20240207-150643-marostegui.json
[15:06:48] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[15:07:25] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:36] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[15:08:47] <wikibugs>	 (03PS3) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418
[15:08:58] <Lucas_WMDE>	 MatmaRex: how long should it usually take to start seeing some more output from the script?
[15:09:08] <Lucas_WMDE>	 the frwiki one has just printed “Processing” and the first --start so far
[15:09:13] <Lucas_WMDE>	 none of the “Processed” messages yet
[15:09:35] <Lucas_WMDE>	 (haven’t started the script for other wikis yet)
[15:09:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[15:10:25] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[2044-2050].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002"
[15:10:27] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:04] <wikibugs>	 (03PS4) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418
[15:11:07] <Lucas_WMDE>	 I guess it’s going through a lot of rows again that were already processed, and not printing output until it finds the point where it really has to resume?
[15:11:26] <MatmaRex>	 Lucas_WMDE: hmm, not sure
[15:11:49] <MatmaRex>	 or it could be stuck on some page that causes parsoid to hang
[15:11:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[15:11:59] <Lucas_WMDE>	 I’ll start the rowiki and see how it behaves
[15:12:09] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:12:12] <Lucas_WMDE>	 !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki rowiki --current --all --touched-after=20230613000000 --start '["2041962"]' | tee ~/T315510-rowiki # in tmux
[15:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:19] <Lucas_WMDE>	 ok, that one is printing output directly
[15:12:27] <Lucas_WMDE>	 processed 100/200/300 (updated 0)
[15:12:30] <MatmaRex>	 i can schedule this for another time if you want to be done for today
[15:12:37] <Lucas_WMDE>	 so it sounds like it should print something even when it has nothing to do 🤔
[15:12:43] <MatmaRex>	 hm
[15:13:13] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:13:35] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:13:37] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[2044-2050].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002"
[15:13:38] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:13:39] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[2044-2050].codfw.wmnet
[15:13:45] <wikibugs>	 10SRE-swift-storage: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `ms-be[2044-2050].codfw.wmnet` - ms-be2044.codfw.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Found physic...
[15:14:25] <Lucas_WMDE>	 _joe_: I might as well ask this now – is it possible to test the WebVideoTranscodeJob backport on mwdebug?
[15:14:33] <Lucas_WMDE>	 (still ETA 9min in zuul  btw)
[15:14:36] <wikibugs>	 (03PS5) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418
[15:14:47] <_joe_>	 Lucas_WMDE: no I don't think you can, given jobs go to the jobqueue
[15:14:55] <Lucas_WMDE>	 yeah, makes sense
[15:14:59] <_joe_>	 and this change is contained to a job
[15:15:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[15:16:27] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:16:34] <wikibugs>	 (03PS6) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418
[15:18:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[15:18:50] <wikibugs>	 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10lmata) Thanks for the report; we'll continue to investigate and discuss.
[15:18:50] <_joe_>	 also, 9 minutes, wow. Jenkins is fast
[15:19:07] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:19:39] <wikibugs>	 10SRE: sre - https://phabricator.wikimedia.org/T356881 (10Vecna-the-whispered)
[15:19:49] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:20:11] <wikibugs>	 (03CR) 10Ssingh: slo_definitions: Use trafficserver_backend_sli_bad (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[15:20:21] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2377.codfw.wmnet with reason: host reimage
[15:20:29] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2378.codfw.wmnet with reason: host reimage
[15:21:04] <Lucas_WMDE>	 MatmaRex: rough estimate for rowiki based on its current processing rate: a bit over 2 days
[15:21:19] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2406.codfw.wmnet with reason: host reimage
[15:21:20] <Lucas_WMDE>	 though so far it’s still “updated 0” all around, so who knows how much it’ll slow down once it actually has something to do :'D
[15:21:43] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2301.codfw.wmnet with reason: host reimage
[15:21:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P56446 and previous config saved to /var/cache/conftool/dbconfig/20240207-152150-marostegui.json
[15:21:59] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2310.codfw.wmnet with reason: host reimage
[15:22:02] <Lucas_WMDE>	 still no output from frwiki btw o_O
[15:22:07] * Lucas_WMDE peeks at htop
[15:22:21] <Lucas_WMDE>	 well, it’s barely eating any CPU
[15:22:24] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] "Yes please" [puppet] - 10https://gerrit.wikimedia.org/r/998425 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[15:22:50] <Lucas_WMDE>	 wait, wrong process
[15:22:55] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2377.codfw.wmnet with reason: host reimage
[15:23:14] <Lucas_WMDE>	 okay, the frwiki process *is* eating 100% of one CPU
[15:23:24] <Lucas_WMDE>	 17 minutes of CPU time so far
[15:23:36] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Jhancock.wm) This rack is physically ready
[15:23:45] <Lucas_WMDE>	 rowiki is more like 60% CPU (and making visible progress for it, of course)
[15:23:55] <MatmaRex>	 well… something is eating memory though: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mwmaint2002&viewPanel=4
[15:23:59] <Lucas_WMDE>	 frwiki is also at 10.9G resident memory already
[15:24:19] <wikibugs>	 10SRE: sre - https://phabricator.wikimedia.org/T356881 (10Bugreporter) 05Open→03Invalid
[15:24:21] <MatmaRex>	 i think it's stuck on some specific page. there's a parsoid bug where some pages take infinite memory
[15:24:29] <wikibugs>	 (03Merged) 10jenkins-bot: WebVideoTranscodeJob: also add time limits [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998318 (https://phabricator.wikimedia.org/T356780) (owner: 10Giuseppe Lavagetto)
[15:24:31] <Lucas_WMDE>	 yikes
[15:24:33] <Lucas_WMDE>	 that’s a lot of memory
[15:24:38] <_joe_>	 yeah
[15:24:42] <Lucas_WMDE>	 yeah I’ll probably kill it in a few minutes
[15:24:46] <MatmaRex>	 (this is not the memory leak that i hope i fixed)
[15:24:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:998318|WebVideoTranscodeJob: also add time limits (T356780)]]
[15:24:55] <stashbot>	 T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780
[15:25:09] <MatmaRex>	 i think you can stop it, yeah, and i'll need to find what page is that, because there isn't enough logging
[15:25:31] <Lucas_WMDE>	 I guess it must be in the first 100 rows after the start ID 7544396?
[15:25:33] <Lucas_WMDE>	 whatever table that refers to
[15:25:38] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2310.codfw.wmnet with reason: host reimage
[15:25:39] <Lucas_WMDE>	 (I think I guessed it wrong once already, weeks ago ^^)
[15:25:44] <MatmaRex>	 yes
[15:25:58] <Lucas_WMDE>	 if the limit is configurable I can try to narrow it down, maybe
[15:26:05] <Lucas_WMDE>	 but let’s backport poor _joe_’s change first ^^
[15:26:16] <_joe_>	 ahah
[15:26:18] <MatmaRex>	 (see https://phabricator.wikimedia.org/T254522 and https://phabricator.wikimedia.org/T353874 for one specific case)
[15:26:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and oblivian: Backport for [[gerrit:998318|WebVideoTranscodeJob: also add time limits (T356780)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:26:26] <_joe_>	 thanks <3
[15:26:27] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and oblivian: Continuing with sync
[15:27:07] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998415 (owner: 10Majavah)
[15:27:27] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: P:wmcs::cloudgw: do not NAT traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[15:27:40] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] network: add cloud-codfw-bgp-private-vips [puppet] - 10https://gerrit.wikimedia.org/r/998415 (owner: 10Majavah)
[15:28:02] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2301.codfw.wmnet with reason: host reimage
[15:28:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-phpfpm-statustext-textfile.service Failed on mwdebug1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:30:14] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2406.codfw.wmnet with reason: host reimage
[15:30:26] <wikibugs>	 (03PS7) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418
[15:30:37] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis)
[15:31:07] <Lucas_WMDE>	 !log STOP persistRevisionThreadItems on frwiki for T315510 – 100% CPU usage, 15G RAM and counting, no progress output: clearly stuck on something
[15:31:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:13] <stashbot>	 T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510
[15:31:23] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Use the analytics-presto CNAME for workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/998425 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[15:31:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede)
[15:32:24] <_joe_>	 Lucas_WMDE: it might make sense to launch the script with strace at this point
[15:32:37] <_joe_>	 or if you launch it, I can attach with strace myself
[15:32:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:998318|WebVideoTranscodeJob: also add time limits (T356780)]] (duration: 07m 48s)
[15:32:44] <stashbot>	 T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780
[15:32:44] <_joe_>	 to try and see what's going on
[15:33:02] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2378.codfw.wmnet with reason: host reimage
[15:33:04] <_joe_>	 oh ok, let me see if I finally fixed something, or if I need to propose a rollback
[15:33:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (32) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:33:30] <Lucas_WMDE>	 👍
[15:33:39] * Lucas_WMDE hasn’t straced mwscript before
[15:33:49] <Lucas_WMDE>	 stracing the php process based on its PID might work better, yeah
[15:34:09] <Lucas_WMDE>	 but I have a meeting now, so I’ll leave it alone for a bit if that’s okay
[15:34:24] <Lucas_WMDE>	 !log backport+config window done
[15:34:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:17] <wikibugs>	 (03PS2) 10Clément Goubert: codfw lvs::balancer: Switch config_host to conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/998431 (https://phabricator.wikimedia.org/T355870)
[15:35:19] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "In preparation for B3 migration on 2024-02-28 where conf2004 will go offline for a brief period. I've presumed we don't want to use conf20" [puppet] - 10https://gerrit.wikimedia.org/r/998431 (https://phabricator.wikimedia.org/T355870) (owner: 10Clément Goubert)
[15:35:21] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) (owner: 10Bking)
[15:35:23] <MatmaRex>	 yeah, thanks. i will find the problem page
[15:35:42] <MatmaRex>	 _joe_: it's 100% stuck in parsoid trying to parse some degenerate wikitext
[15:36:00] <_joe_>	 I love "degenerate wikitext"
[15:36:21] <claime>	 Isn't that just wikitext?
[15:36:26] <Lucas_WMDE>	 ayyyyyy
[15:36:26] <MatmaRex>	 heh
[15:36:35] <MatmaRex>	 i actually found the page, not sure if i should paste the link here
[15:36:42] <Lucas_WMDE>	 MatmaRex: is it okay if I already start ukwiki and/or viwiki? or should they wait until rowiki is done? (since they’re combined by && in your comment)
[15:36:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P56447 and previous config saved to /var/cache/conftool/dbconfig/20240207-153656-marostegui.json
[15:37:27] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:37:29] <MatmaRex>	 Lucas_WMDE: probably better to wait, amir didn't want me to run multiple of those scripts on the same db group
[15:37:38] <Lucas_WMDE>	 alright
[15:37:51] <MatmaRex>	 (i think that's too safe, but better safe than sorry)
[15:38:03] <Lucas_WMDE>	 should I start enwiki then?
[15:38:05] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 (10Andrew) There's no need to coordinate with us for cloudbackup2001, it might cause us to get a transient alert...
[15:38:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (38) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:38:33] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:38:57] <Lucas_WMDE>	 probably with the last --start from https://phabricator.wikimedia.org/T315510#9328399
[15:39:11] <MatmaRex>	 yeah, you can
[15:39:34] <MatmaRex>	 and yeah, you're right, we can start from that point
[15:39:49] <Lucas_WMDE>	 ok
[15:40:08] <Lucas_WMDE>	 !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --start '["67578461"]' | tee ~/T315510-enwiki # in tmux
[15:40:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:14] <MatmaRex>	 oh, i guess that also says the rowiki and ukwiki runs finished?
[15:40:28] <Lucas_WMDE>	 oh, hm
[15:40:36] <Lucas_WMDE>	 (enwiki is making progress btw and already updated 1)
[15:40:38] <Lucas_WMDE>	 (yay)
[15:42:06] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/998431 (https://phabricator.wikimedia.org/T355870) (owner: 10Clément Goubert)
[15:42:15] <Lucas_WMDE>	 MatmaRex: but doesn’t that comment only mean that rowiki finished, and ukwiki was still in progress?
[15:42:22] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2377.codfw.wmnet with OS bullseye
[15:42:40] <MatmaRex>	 Lucas_WMDE: oops, yes
[15:42:58] <MatmaRex>	 i misread
[15:43:17] <Lucas_WMDE>	 ok, so I can kill rowiki and instead start ukwiki with the --start from there
[15:43:39] <jelto>	 !log import etherpad-lite 1.9.7-1 on apt1001 host - T316421
[15:44:02] <MatmaRex>	 yeah
[15:44:59] <icinga-wm>	 PROBLEM - Check systemd state on kubemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:00] <Lucas_WMDE>	 !log STOP persistRevisionThreadItems on rowiki for T315510 – according to T315510#9328399, it should be done already (it was at --start '["2075226"]' and had processed 31000, updated 0)
[15:45:19] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2310.codfw.wmnet with OS bullseye
[15:45:22] <hnowlan>	 Lucas_WMDE: are you rolling out changes related to jobqueue/jobrunners atm? 
[15:45:31] <Emperor>	 !log depool codfw dnsdisc T355861
[15:45:38] <Lucas_WMDE>	 not as far as I’m aware
[15:45:44] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw
[15:45:47] <Lucas_WMDE>	 I deployed backports, those are done
[15:45:50] <Lucas_WMDE>	 and am running some maintenancle scripts
[15:45:51] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a2-codfw.mgmt with reason: prepping for server uplink migration codfw rack a2
[15:45:58] <claime>	 hnowlan: There was joe's change to TMH
[15:46:00] <Lucas_WMDE>	 hnowlan: is anything wrong?
[15:46:07] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a2-codfw.mgmt with reason: prepping for server uplink migration codfw rack a2
[15:46:11] <icinga-wm>	 RECOVERY - Check systemd state on kubemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:14] <Lucas_WMDE>	 the last backport (joe’s change to TMH) was https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/998318
[15:46:15] <Emperor>	 !log depool thanos-fe2001 T355861
[15:46:20] <_joe_>	 hnowlan: talk to me
[15:46:28] <_joe_>	 what is the problem you're seeing?
[15:46:32] <topranks>	 !log moving Netbox server uplinks from asw-a2-codfw to lsw1-a2-codfw to prep config for server moves T355861
[15:46:34] <hnowlan>	 there's been a spike in errors for jobqueue since 14:55  https://logstash.wikimedia.org/goto/684a454f5135b7b7fdb695a19b0ec98d
[15:46:56] <_joe_>	 so well before my change went out, which was changing webVideoTranscode on group0
[15:47:42] <_joe_>	 hnowlan: those are errors *enqueueing* jobs
[15:47:50] <_joe_>	 so the problem seems to be eventgate-main maybe?
[15:47:57] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2301.codfw.wmnet with OS bullseye
[15:47:57] <Lucas_WMDE>	 the backport of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/998384 would be closer to that time, but I don’t see how it could be related
[15:49:11] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2406.codfw.wmnet with OS bullseye
[15:49:29] <_joe_>	 I can't make much of that logstash
[15:49:52] <MatmaRex>	 (i filed https://phabricator.wikimedia.org/T356884 about the fr.wp page that i think is hanging my maintenance script)
[15:50:00] <_joe_>	 hnowlan: also seems limited to k8s, wth
[15:50:01] <hnowlan>	 yeah not a lot of detail in the errors
[15:50:28] <hnowlan>	 _joe_: could be a side effect of the specific type of job 
[15:50:41] <_joe_>	 hnowlan: it's not just the jobrunners
[15:50:57] <_joe_>	 but did you find it's a specific type of job?
[15:51:01] <hnowlan>	 no
[15:51:01] <_joe_>	 did I miss something?
[15:52:01] <_joe_>	 no it's all over the place
[15:52:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T355609)', diff saved to https://phabricator.wikimedia.org/P56448 and previous config saved to /var/cache/conftool/dbconfig/20240207-155203-marostegui.json
[15:52:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[15:52:10] <_joe_>	 hnowlan: I'd take a look at eventgate-main
[15:52:17] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2378.codfw.wmnet with OS bullseye
[15:52:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[15:52:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170:3317 (T355609)', diff saved to https://phabricator.wikimedia.org/P56449 and previous config saved to /var/cache/conftool/dbconfig/20240207-155225-marostegui.json
[15:54:02] <claime>	 Seeing some heap limit exceeded logs for eventgate-main but not mich more
[15:54:08] <claime>	 and it's like 2 errors
[15:54:12] <claime>	 well warnings
[15:59:12] <Lucas_WMDE>	 I also just noticed stashbot is gone
[15:59:18] <Lucas_WMDE>	 so some SAL messages got lost already
[15:59:24] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 22 hosts with reason: Migrating servers in codfw rack A2 to lsw1-a2-codfw
[15:59:45] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 22 hosts with reason: Migrating servers in codfw rack A2 to lsw1-a2-codfw
[16:00:49] <jinxer-wm>	 (ProbeDown) firing: (3) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:00:58] <jinxer-wm>	 (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:01:16] <herron>	 hello thanos my old friend
[16:01:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:01:27] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:01:27] <icinga-wm>	 PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:01:44] <cdanis>	 herron: isn't this about when it happened yesterday?
[16:01:50] <cdanis>	 16:00 UTC exactly 🤔
[16:02:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[16:02:14] <herron>	 cdanis: yeah sounds right
[16:02:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T355609)', diff saved to https://phabricator.wikimedia.org/P56450 and previous config saved to /var/cache/conftool/dbconfig/20240207-160218-marostegui.json
[16:02:28] <topranks>	 !log Commencing server uplink moves from old switch  to new in codfw rack A2 T355861
[16:03:34] <Lucas_WMDE>	 !log STOP persistRevisionThreadItems on rowiki for T315510 – according to T315510#9328399, it should be done already (it was at --start '["2075226"]' and had processed 31000, updated 0) [relog from 15:45, stashbot was down]
[16:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:44] <Lucas_WMDE>	 that’s better
[16:03:46] <stashbot>	 T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510
[16:03:58] <Lucas_WMDE>	 (lots of other log messages from the last 20 minutes are presumably also missing)
[16:04:30] <jinxer-wm>	 (ProbeDown) firing: (5) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:04:33] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:04:51] <vgutierrez>	 !log <topranks> Commencing server uplink moves from old switch  to new in codfw rack A2 T355861
[16:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:01] <jelto>	 !log import etherpad-lite 1.9.7-1 on apt1001 host - T316421
[16:05:01] <stashbot>	 T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861
[16:05:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:14] <stashbot>	 T316421: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421
[16:05:28] <topranks>	 vgutierrez: thanks!  I'd forget my own damn head you know :)
[16:05:37] <vgutierrez>	 np :)
[16:05:58] <vgutierrez>	 I owe you some brain cells for that /etc/network/interfaces thingie
[16:07:14] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[16:10:20] <Lucas_WMDE>	 it looks like the “could not enqueue jobs” errors went away again?
[16:10:33] <herron>	 !log hard reboot titan1002
[16:10:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:37] <Lucas_WMDE>	 ah, at 16:00 UTC as cdanis wrote above
[16:10:44] <Lucas_WMDE>	 (I wasn’t sure if that had referred to the same thing or something else ^^)
[16:11:04] <cdanis>	 Lucas_WMDE: for 16:00 UTC I was referring to the titan1* crashes
[16:11:11] <Lucas_WMDE>	 hm, ok
[16:11:42] <Lucas_WMDE>	 still, the last error in logstash was at 16:00:00.874…
[16:11:54] <Lucas_WMDE>	 that’s extremely close to the full hour
[16:12:20] <hnowlan>	 yeeeeah, two messages that are ms over the second but otherwise a dead stop 
[16:14:13] <icinga-wm>	 RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:15:19] <Lucas_WMDE>	 !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki ukwiki --current --all --touched-after=20230613000000 --start '["1685316"]' | tee ~/T315510-ukwiki # in tmux
[16:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:26] <Lucas_WMDE>	 MatmaRex: ^ fyi
[16:15:42] <Lucas_WMDE>	 okay, and it’s printing “processed” messages, so it’s not stuck it seems
[16:15:48] <Lucas_WMDE>	 (though it feels slower than the enwiki one?)
[16:15:49] <jinxer-wm>	 (ProbeDown) firing: (5) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:15:50] <MatmaRex>	 thanks Lucas_WMDE
[16:16:02] <Emperor>	 !log repool thanos-fe2001 T355861
[16:16:04] <MatmaRex>	 Lucas_WMDE: btw, i am finding out why it's broken: https://phabricator.wikimedia.org/T356884
[16:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:08] <stashbot>	 T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861
[16:16:10] <Lucas_WMDE>	 actually, scratch that, I think they’re about equally slow and I just forgot the speed
[16:16:14] <Lucas_WMDE>	 nice
[16:16:19] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw
[16:16:26] <_joe_>	 MatmaRex: is that script enqueuing jobs, by any chance?
[16:16:27] <Emperor>	 !log repool codfw dnsdisc T355861
[16:16:29] <MatmaRex>	 we seem to have an infinite loop in DiscussionTools actually, not the parser
[16:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:38] <MatmaRex>	 _joe_: it shouldn't
[16:16:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:16:49] <_joe_>	 MatmaRex: oh interesting :)
[16:16:58] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.remove-downtime for ml-cache2001.codfw.wmnet
[16:16:59] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-cache2001.codfw.wmnet
[16:17:04] <Lucas_WMDE>	 _joe_: the enwiki script has also been running in the background the whole time btw
[16:17:09] <MatmaRex>	 but it parses pages, and who knows what that does
[16:17:09] <Lucas_WMDE>	 even after the logstash errors stopped
[16:17:15] <MatmaRex>	 does the timing match the other issue?
[16:17:16] <_joe_>	 yeah I'm trying to find causes
[16:17:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P56451 and previous config saved to /var/cache/conftool/dbconfig/20240207-161725-marostegui.json
[16:17:31] <_joe_>	 but I think it was just eventgate not being healthy
[16:17:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:18:05] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[16:19:30] <jinxer-wm>	 (ProbeDown) firing: (8) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:19:33] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:19:43] <Lucas_WMDE>	 MatmaRex: the ukwiki run is still printing “updated 0” despite having made some updates at the end of the script run urbanec.m posted; is there something else that could have “processed” the pages(?) in the meantime, or is this unexpected?
[16:19:58] <MatmaRex>	 Lucas_WMDE: yeah, if they were purged for any reason
[16:20:08] <Lucas_WMDE>	 hm, ok
[16:20:15] <Lucas_WMDE>	 or maybe it was “And started again” at the end of that comment…
[16:20:34] <Lucas_WMDE>	 although enwiki had stuff to do from the get go, so it doesn’t seem like that one had been started again
[16:20:49] <jinxer-wm>	 (ProbeDown) firing: (8) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:20:57] <jinxer-wm>	 (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:24:40] <logmsgbot>	 !log sbailey@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply
[16:25:56] <logmsgbot>	 !log sbailey@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply
[16:27:56] <Lucas_WMDE>	 I asked about it on the task now
[16:32:01] <logmsgbot>	 !log sbailey@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply
[16:32:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P56452 and previous config saved to /var/cache/conftool/dbconfig/20240207-163231-marostegui.json
[16:33:30] <logmsgbot>	 !log sbailey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[16:34:36] <logmsgbot>	 !log sbailey@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply
[16:35:44] <logmsgbot>	 !log sbailey@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply
[16:39:29] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[16:46:25] <hnowlan>	 !log homer 'cr*codfw*' commit 'T354791' for 5 new k8s ex-appservers 
[16:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:29] <stashbot>	 T354791: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791
[16:46:53] <jynus>	 btullis: Bring two new stat servers into service (9596fbf8b5) ok to merge?
[16:47:17] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for asw-a-codfw,cr[1-2]-codfw,lsw1-a2-codfw.mgmt
[16:47:19] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for asw-a-codfw,cr[1-2]-codfw,lsw1-a2-codfw.mgmt
[16:47:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T355609)', diff saved to https://phabricator.wikimedia.org/P56454 and previous config saved to /var/cache/conftool/dbconfig/20240207-164738-marostegui.json
[16:47:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[16:47:42] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[16:47:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[16:52:12] <logmsgbot>	 !log hnowlan@cumin2002 conftool action : set/weight=10; selector: name=(mw2377.codfw.wmnet|mw2378.codfw.wmnet|mw2406.codfw.wmnet|mw2301.codfw.wmnet|mw2310.codfw.wmnet),cluster=kubernetes,service=kubesvc
[16:52:21] <logmsgbot>	 !log hnowlan@cumin2002 conftool action : set/pooled=yes; selector: name=(mw2377.codfw.wmnet|mw2378.codfw.wmnet|mw2406.codfw.wmnet|mw2301.codfw.wmnet|mw2310.codfw.wmnet),cluster=kubernetes,service=kubesvc
[16:54:34] <logmsgbot>	 !log sbailey@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[16:55:00] <logmsgbot>	 !log sbailey@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[16:55:25] <icinga-wm>	 PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-loki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:56:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[16:56:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[16:57:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T355609)', diff saved to https://phabricator.wikimedia.org/P56455 and previous config saved to /var/cache/conftool/dbconfig/20240207-165703-marostegui.json
[16:57:07] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[16:58:03] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.timer,httpbb_kubernetes_mw-api-ext_hourly.timer,httpbb_kubernetes_mw-api-int_hourly.timer,httpbb_kubernetes_mw-jobrunner_hourly.timer,httpbb_kubernetes_mw-web_hourly.timer,httpbb_kubernetes_mw-wikifunctions_hourly.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:00:58] <swfrench-wmf>	 FYI, I'm in the process of moving those httpbb timers from cumin1001 to cumin1002. They've now been absented on cumin1001 and are coming up on 1002 shortly.
[17:02:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T355609)', diff saved to https://phabricator.wikimedia.org/P56456 and previous config saved to /var/cache/conftool/dbconfig/20240207-170225-marostegui.json
[17:02:47] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[17:03:20] <logmsgbot>	 !log sbailey@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[17:03:56] <logmsgbot>	 !log sbailey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[17:03:58] <jinxer-wm>	 (ProbeDown) firing: (2) Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:04:21] <logmsgbot>	 !log sbailey@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[17:04:52] <logmsgbot>	 !log sbailey@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[17:05:34] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:08:04] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:10:54] <topranks>	 sukhe: ^^ not sure if this is expected?
[17:11:12] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=thumbor
[17:12:07] <hnowlan>	 topranks: wonder if that's related to my thumbor mishap
[17:12:16] <topranks>	 hnowlan: I was just wondering the same 
[17:12:23] <hnowlan>	 it might recover in a few 
[17:12:24] <sukhe>	 yeah :)
[17:12:39] <topranks>	 ok cool, thanks! 
[17:12:52] <sukhe>	 checking still
[17:13:08] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:13:23] <topranks>	 yay :)
[17:13:24] <sukhe>	 ok :)
[17:13:28] <hnowlan>	 apologies 
[17:13:40] <sukhe>	 np! thanks for the ping topranks 
[17:13:53] <sukhe>	 hnowlan: Traffic is around if we can help
[17:13:59] <jinxer-wm>	 (ProbeDown) resolved: (2) Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:14:05] <sukhe>	 oops
[17:14:30] <jinxer-wm>	 (ProbeDown) firing: (3) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:15:44] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:17:32] <icinga-wm>	 PROBLEM - Check systemd state on stat1011 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:17:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P56457 and previous config saved to /var/cache/conftool/dbconfig/20240207-171732-marostegui.json
[17:18:04] <icinga-wm>	 PROBLEM - Check systemd state on stat1010 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:25:23] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [restbase/deploy@1007273]: Disabling storage for jawiki
[17:26:07] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad
[17:32:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P56458 and previous config saved to /var/cache/conftool/dbconfig/20240207-173238-marostegui.json
[17:32:43] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@1007273]: Disabling storage for jawiki (duration: 07m 19s)
[17:47:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T355609)', diff saved to https://phabricator.wikimedia.org/P56459 and previous config saved to /var/cache/conftool/dbconfig/20240207-174745-marostegui.json
[17:47:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[17:47:50] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[17:48:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[17:48:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T355609)', diff saved to https://phabricator.wikimedia.org/P56460 and previous config saved to /var/cache/conftool/dbconfig/20240207-174807-marostegui.json
[17:52:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw
[17:52:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw
[17:53:15] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:53:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T355609)', diff saved to https://phabricator.wikimedia.org/P56461 and previous config saved to /var/cache/conftool/dbconfig/20240207-175328-marostegui.json
[17:53:32] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[17:56:59] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1800)
[18:08:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P56462 and previous config saved to /var/cache/conftool/dbconfig/20240207-180835-marostegui.json
[18:13:53] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:15:11] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:17:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:18:34] <wikibugs>	 (03PS1) 10Dzahn: wikistats: symlink deploy script into PATH [puppet] - 10https://gerrit.wikimedia.org/r/998491
[18:18:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: httpbb needs to be setup on cumin1002 and removed from cumin1001 - https://phabricator.wikimedia.org/T356054 (10Scott_French) 05Open→03Resolved Timers are up and happy on cumin1002 and no longer running on cumin1001.
[18:18:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419 (10Scott_French)
[18:19:53] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] Provide context for account creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede)
[18:20:05] <wikibugs>	 (03CR) 10Dzahn: "deployed config change with deploy-wikistats to change user agent for https://phabricator.wikimedia.org/T354101" [puppet] - 10https://gerrit.wikimedia.org/r/998491 (owner: 10Dzahn)
[18:20:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Provide context for account creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede)
[18:20:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] wikistats: symlink deploy script into PATH [puppet] - 10https://gerrit.wikimedia.org/r/998491 (owner: 10Dzahn)
[18:23:22] <wikibugs>	 (03PS2) 10Majavah: Add a python-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/997537
[18:23:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P56463 and previous config saved to /var/cache/conftool/dbconfig/20240207-182342-marostegui.json
[18:24:23] <wikibugs>	 (03PS1) 10Ahmon Dancy: Update buildkitd image references [puppet] - 10https://gerrit.wikimedia.org/r/998493 (https://phabricator.wikimedia.org/T356418)
[18:25:29] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad
[18:27:41] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (5) Elasticsearch instance elastic2037-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[18:28:26] <wikibugs>	 (03PS1) 10Bking: cloudelastic: Begin private IP migration for cloudelastic1008 [puppet] - 10https://gerrit.wikimedia.org/r/998494 (https://phabricator.wikimedia.org/T355617)
[18:30:36] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons.
[18:31:54] <wikibugs>	 (03PS1) 10Ahmon Dancy: Revert "Temporarily enable Dockerfile frontend on trusted runners" [puppet] - 10https://gerrit.wikimedia.org/r/998495 (https://phabricator.wikimedia.org/T356418)
[18:32:41] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2037-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[18:33:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05Open→03In progress
[18:33:03] <inflatador>	 :eyes on that Elstic alert
[18:33:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) a:03cchen
[18:33:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) p:05Triage→03High
[18:35:04] <wikibugs>	 10SRE-Access-Requests, 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) Should we close this ticket as "invalid"? It seems the best course of action might be a new ticket like "migrate all WMDE pipelines to airflow" an...
[18:35:47] <wikibugs>	 10SRE-Access-Requests, 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) 05Open→03In progress
[18:37:41] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) resolved: (3) Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[18:38:12] <inflatador>	 ah, these are leftovers from the switch maintenance, I guess the suppression just expired
[18:38:17] <inflatador>	 anyway, we're all good
[18:38:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T355609)', diff saved to https://phabricator.wikimedia.org/P56464 and previous config saved to /var/cache/conftool/dbconfig/20240207-183849-marostegui.json
[18:38:52] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[18:38:56] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[18:39:06] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[18:39:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T355609)', diff saved to https://phabricator.wikimedia.org/P56465 and previous config saved to /var/cache/conftool/dbconfig/20240207-183912-marostegui.json
[18:40:24] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM overall, splitting the dmz_cidr is a good idea.  I think for the purpose of the "no nat" rule it might be easier to just use the clou" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah)
[18:43:41] <wikibugs>	 (03PS1) 10Bking: cloudelastic: Complete cloudelastic1008's migration [puppet] - 10https://gerrit.wikimedia.org/r/998498 (https://phabricator.wikimedia.org/T355617)
[18:44:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudelastic: Complete cloudelastic1008's migration [puppet] - 10https://gerrit.wikimedia.org/r/998498 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[18:44:19] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/998494 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[18:44:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T355609)', diff saved to https://phabricator.wikimedia.org/P56466 and previous config saved to /var/cache/conftool/dbconfig/20240207-184433-marostegui.json
[18:44:38] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[18:45:28] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] sessionstore: provision sessionstore2004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997991 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans)
[18:49:05] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons.
[18:49:07] <wikibugs>	 10SRE-Access-Requests, 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) Not yet. I believe @AndrewTavis_WMDE will be sharing some findings from WMDE side soon.
[18:50:39] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/998494 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[18:53:30] <wikibugs>	 (03CR) 10Btullis: "The commit message is a bit confusing. Are we at risk of removing them before they are defunct?" [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[18:59:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P56467 and previous config saved to /var/cache/conftool/dbconfig/20240207-185940-marostegui.json
[19:00:05] <jouncebot>	 brennen and dancy: That opportune time for a Train log triage with CPT deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1900).
[19:00:05] <jouncebot>	 brennen and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1900).
[19:00:54] <brennen>	 o/
[19:01:05] <dancy>	 o/
[19:01:35] <wikibugs>	 (03PS1) 10Eevans: (faux) keys & certs for sessionstore200[4-6] [labs/private] - 10https://gerrit.wikimedia.org/r/998504 (https://phabricator.wikimedia.org/T356829)
[19:01:36] <brennen>	 !log train 1.42.0-wmf.17 (T354435): a couple of blockers currently, waiting on resolution before rolling
[19:01:37] <wikibugs>	 (03PS1) 10Eevans: cleanup obsolete keys & certs (hosts decommissioned) [labs/private] - 10https://gerrit.wikimedia.org/r/998505
[19:01:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:54] <stashbot>	 T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435
[19:02:17] <wikibugs>	 (03PS3) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606)
[19:07:07] <wikibugs>	 (03CR) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[19:07:17] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] (faux) keys & certs for sessionstore200[4-6] [labs/private] - 10https://gerrit.wikimedia.org/r/998504 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans)
[19:08:11] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] cleanup obsolete keys & certs (hosts decommissioned) [labs/private] - 10https://gerrit.wikimedia.org/r/998505 (owner: 10Eevans)
[19:09:08] <wikibugs>	 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) @MoritzMuehlenhoff thank you for restoring my access!  I am trying to log into Superset and Hue, but l cannot access them. I also reset the developer account's passwo...
[19:09:25] <wikibugs>	 (03CR) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[19:12:10] <wikibugs>	 (03PS1) 10Dzahn: aptrepo: allow for gitlab versions between 16.5.x and 16.7.x [puppet] - 10https://gerrit.wikimedia.org/r/998510 (https://phabricator.wikimedia.org/T356906)
[19:14:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P56468 and previous config saved to /var/cache/conftool/dbconfig/20240207-191446-marostegui.json
[19:16:48] <wikibugs>	 (03PS2) 10Dzahn: aptrepo: allow for gitlab versions from 16.5 to 16.6.x [puppet] - 10https://gerrit.wikimedia.org/r/998510 (https://phabricator.wikimedia.org/T356906)
[19:16:51] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) Tagging DPE SRE in case this is specific to those tools.  @cchen: Can you please verify if you can `ssh` to the stat hosts and also use Jupyte...
[19:19:52] <mutante>	 !log people1004 systemctl stop confd; running puppet; checking to remove confd remnants from people* hosts - T356296
[19:19:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:56] <stashbot>	 T356296: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296
[19:22:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) I `ssh` the stats machine and `kinit`, and got `Password incorrect while getting initial credentials. and I also tried JupyterHub, and it also...
[19:25:51] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] sessionstore: provision sessionstore2005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997992 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans)
[19:26:19] <wikibugs>	 (03PS2) 10Bking: cloudelastic: remove unnecessary hostnames [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617)
[19:27:04] <wikibugs>	 (03PS2) 10Eevans: sessionstore: provision sessionstore2005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997992 (https://phabricator.wikimedia.org/T356829)
[19:27:06] <wikibugs>	 (03PS2) 10Eevans: sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829)
[19:28:03] <wikibugs>	 (03PS3) 10Bking: cloudelastic: remove unnecessary hostnames [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617)
[19:28:53] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) @cchen When you ran kinit the first time after you logged in, did it ask you to change the password? Did you get a new temporary one by mail?...
[19:28:56] <wikibugs>	 (03CR) 10Bking: "Apologies for the bad commit message. I've updated it to (hopefully) be less confusing. The TLDR is that we never needed those alt names, " [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[19:29:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloudelastic: remove unnecessary hostnames [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[19:29:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T355609)', diff saved to https://phabricator.wikimedia.org/P56469 and previous config saved to /var/cache/conftool/dbconfig/20240207-192953-marostegui.json
[19:29:56] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[19:29:59] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[19:30:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[19:30:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T355609)', diff saved to https://phabricator.wikimedia.org/P56470 and previous config saved to /var/cache/conftool/dbconfig/20240207-193016-marostegui.json
[19:30:20] <wikibugs>	 (03PS4) 10Bking: cloudelastic: remove unneeded hostnames from cert alt names [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617)
[19:32:03] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@80b329b]: Analytics Hotfix [analytics/refinery@80b329b5]
[19:32:59] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[19:33:38] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: Begin private IP migration for cloudelastic1008 [puppet] - 10https://gerrit.wikimedia.org/r/998494 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[19:34:09] <wikibugs>	 (03PS1) 10Dzahn: peopleweb: test edit, comment out idp [puppet] - 10https://gerrit.wikimedia.org/r/998532
[19:35:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T355609)', diff saved to https://phabricator.wikimedia.org/P56471 and previous config saved to /var/cache/conftool/dbconfig/20240207-193540-marostegui.json
[19:35:45] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[19:36:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) @Dzahn Oh, I see. I found the email and reran the kinit with the temporary password, it works now.
[19:40:05] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall)
[19:41:53] <wikibugs>	 (03PS3) 10Eevans: sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829)
[19:42:31] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@80b329b]: Analytics Hotfix [analytics/refinery@80b329b5] (duration: 10m 28s)
[19:42:49] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@80b329b] (thin): Analytics Hotfix -THIN [analytics/refinery@80b329b5]
[19:42:55] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@80b329b] (thin): Analytics Hotfix -THIN [analytics/refinery@80b329b5] (duration: 00m 05s)
[19:43:48] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@80b329b] (hadoop-test): Analytics Hotfix - TEST [analytics/refinery@80b329b5]
[19:43:49] <wikibugs>	 (03PS2) 10Dzahn: peopleweb: test edit [puppet] - 10https://gerrit.wikimedia.org/r/998532
[19:44:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] peopleweb: test edit [puppet] - 10https://gerrit.wikimedia.org/r/998532 (owner: 10Dzahn)
[19:45:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1008.wikimedia.org
[19:45:47] <wikibugs>	 (03PS3) 10Dzahn: peopleweb: test edit [puppet] - 10https://gerrit.wikimedia.org/r/998532
[19:47:28] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@80b329b] (hadoop-test): Analytics Hotfix - TEST [analytics/refinery@80b329b5] (duration: 03m 40s)
[19:50:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P56472 and previous config saved to /var/cache/conftool/dbconfig/20240207-195047-marostegui.json
[19:51:58] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans)
[19:52:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:55:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1008.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[19:56:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1008.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[19:56:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:56:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic1008.wikimedia.org
[20:00:17] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[20:00:46] <wikibugs>	 (03PS4) 10Dzahn: peopleweb: test edit [puppet] - 10https://gerrit.wikimedia.org/r/998532
[20:02:02] <wikibugs>	 (03PS1) 10Brennen Bearnes: Fix regression in HLS track content type [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998452 (https://phabricator.wikimedia.org/T356780)
[20:02:45] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] aptrepo: allow for gitlab versions from 16.5 to 16.6.x [puppet] - 10https://gerrit.wikimedia.org/r/998510 (https://phabricator.wikimedia.org/T356906) (owner: 10Dzahn)
[20:03:58] <logmsgbot>	 !log joal@deploy2002 Started deploy [airflow-dags/analytics@ea0a3db]: Analytics Hotfix [airflow-dags/analytics@ea0a3db2]
[20:04:27] <wikibugs>	 10SRE, 10serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296 (10Dzahn) Seems to me this has to do with the `profile::firewall` migration from iptables to nftables.  What these hosts have in common is `profile::firewall::provider: nftables` in hierad...
[20:04:39] <logmsgbot>	 !log joal@deploy2002 Finished deploy [airflow-dags/analytics@ea0a3db]: Analytics Hotfix [airflow-dags/analytics@ea0a3db2] (duration: 00m 40s)
[20:04:45] <wikibugs>	 10SRE, 10serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296 (10Dzahn) cc: @Muehlenhoff
[20:05:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P56473 and previous config saved to /var/cache/conftool/dbconfig/20240207-200555-marostegui.json
[20:06:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1008 to private IPs - bking@cumin2002"
[20:07:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1008 to private IPs - bking@cumin2002"
[20:07:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:07:53] <brennen>	 bvibber, James_F: i'll go ahead and deploy that backport here momentarily.
[20:08:02] <James_F>	 Awesome.
[20:08:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1008
[20:08:11] <bvibber>	 Woot
[20:09:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] aptrepo: allow for gitlab versions from 16.5 to 16.6.x [puppet] - 10https://gerrit.wikimedia.org/r/998510 (https://phabricator.wikimedia.org/T356906) (owner: 10Dzahn)
[20:09:21] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1008
[20:10:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05In progress→03Resolved Great! Feel free to reopen the ticket if there is anything else missing.
[20:11:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998452 (https://phabricator.wikimedia.org/T356780) (owner: 10Brennen Bearnes)
[20:14:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <ldap/wmf> for <JWheeler-WMF> - https://phabricator.wikimedia.org/T355170 (10Dzahn) 05In progress→03Stalled
[20:15:04] <icinga-wm>	 RECOVERY - Check systemd state on stat1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:15:45] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to <LDAP/WMDE> for <WMDE Cyn> - https://phabricator.wikimedia.org/T355937 (10Dzahn) 05In progress→03Resolved a:03Dzahn @WMDECyn You have been added to the groups 'nda' and 'wmde' just like other WMDE employees.  Things should work as expec...
[20:15:45] <brennen>	 bvibber: this a "go ahead past test servers, confirm in group0" sort of situation, yeah?
[20:18:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye
[20:18:26] <icinga-wm>	 PROBLEM - Check systemd state on stat1010 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:21:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T355609)', diff saved to https://phabricator.wikimedia.org/P56474 and previous config saved to /var/cache/conftool/dbconfig/20240207-202101-marostegui.json
[20:21:04] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[20:21:07] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[20:21:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[20:21:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T355609)', diff saved to https://phabricator.wikimedia.org/P56475 and previous config saved to /var/cache/conftool/dbconfig/20240207-202123-marostegui.json
[20:22:23] <bvibber>	 moment
[20:24:45] <bvibber>	 *waits for the files to churn in job queue*
[20:26:46] <brennen>	 bvibber: still waiting on CI here, so not yet deployed
[20:27:05] <brennen>	 (sorry, could have been clearer about that)
[20:27:05] <bvibber>	 no rush then :D
[20:27:37] <brennen>	 will ping when backport's done. :)
[20:28:40] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:29:40] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:31:28] <wikibugs>	 (03PS1) 10Eevans: sessionstore: setup sessionstore200[4-6] (new) [deployment-charts] - 10https://gerrit.wikimedia.org/r/998538 (https://phabricator.wikimedia.org/T356829)
[20:31:30] <wikibugs>	 (03PS1) 10Eevans: sessionstore: remove decommissioned hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/998539 (https://phabricator.wikimedia.org/T356828)
[20:32:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T355609)', diff saved to https://phabricator.wikimedia.org/P56477 and previous config saved to /var/cache/conftool/dbconfig/20240207-203222-marostegui.json
[20:32:27] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[20:32:52] <wikibugs>	 (03Merged) 10jenkins-bot: Fix regression in HLS track content type [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998452 (https://phabricator.wikimedia.org/T356780) (owner: 10Brennen Bearnes)
[20:33:18] <logmsgbot>	 !log brennen@deploy2002 Started scap: Backport for [[gerrit:998452|Fix regression in HLS track content type (T356780)]]
[20:33:22] <stashbot>	 T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780
[20:36:06] <wikibugs>	 (03CR) 10Eevans: "This changeset is ready to go as-is, but I'm marking it -1 to signal it isn't yet ready to be merged.  We need to first merge r998538, dep" [deployment-charts] - 10https://gerrit.wikimedia.org/r/998539 (https://phabricator.wikimedia.org/T356828) (owner: 10Eevans)
[20:37:02] <logmsgbot>	 !log brennen@deploy2002 brennen: Backport for [[gerrit:998452|Fix regression in HLS track content type (T356780)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:37:09] <brennen>	 going ahead with sync
[20:37:17] <logmsgbot>	 !log brennen@deploy2002 brennen: Continuing with sync
[20:38:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:40:24] <wikibugs>	 (03CR) 10Krinkle: Configure parser cache filters for parsoid-pcache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler)
[20:42:43] <wikibugs>	 (03CR) 10Krinkle: Configure parser cache filters for parsoid-pcache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler)
[20:43:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) prometheus-phpfpm-statustext-textfile.service Failed on mw1364:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:43:39] <logmsgbot>	 !log brennen@deploy2002 Finished scap: Backport for [[gerrit:998452|Fix regression in HLS track content type (T356780)]] (duration: 10m 20s)
[20:43:43] <stashbot>	 T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780
[20:43:51] <brennen>	 bvibber: that's out
[20:46:40] <bvibber>	 testing...
[20:47:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P56478 and previous config saved to /var/cache/conftool/dbconfig/20240207-204728-marostegui.json
[20:48:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (37) prometheus-phpfpm-statustext-textfile.service Failed on mw1350:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:50:14] <bvibber>	 brennen: confirmed fixed on test :D
[20:52:04] <brennen>	 bvibber: thx!
[21:00:04] <icinga-wm>	 RECOVERY - Check systemd state on stat1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T2100)
[21:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[21:02:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P56479 and previous config saved to /var/cache/conftool/dbconfig/20240207-210235-marostegui.json
[21:03:58] <icinga-wm>	 PROBLEM - Check systemd state on stat1011 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:08:41] <wikibugs>	 (03PS1) 10Herron: SystemdUnitFailed: remove 'Failed' from alert text [alerts] - 10https://gerrit.wikimedia.org/r/998545
[21:09:15] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1008.eqiad.wmnet with OS bullseye
[21:09:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10JTannerWMF)
[21:14:30] <jinxer-wm>	 (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:16:40] <wikibugs>	 (03CR) 10Herron: "Proposing this since I've misread the failed-resolved-failed text pattern a few times at a quick glance" [alerts] - 10https://gerrit.wikimedia.org/r/998545 (owner: 10Herron)
[21:17:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T355609)', diff saved to https://phabricator.wikimedia.org/P56480 and previous config saved to /var/cache/conftool/dbconfig/20240207-211741-marostegui.json
[21:17:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[21:17:48] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[21:17:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[21:18:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T355609)', diff saved to https://phabricator.wikimedia.org/P56481 and previous config saved to /var/cache/conftool/dbconfig/20240207-211803-marostegui.json
[21:23:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T355609)', diff saved to https://phabricator.wikimedia.org/P56482 and previous config saved to /var/cache/conftool/dbconfig/20240207-212304-marostegui.json
[21:23:08] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[21:28:14] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) @cchen: How about Superset & Hue?
[21:31:10] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10bking)
[21:38:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P56483 and previous config saved to /var/cache/conftool/dbconfig/20240207-213810-marostegui.json
[21:42:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) I still not able to access Superset & Hue, and i tried to reset  my password again, still not working.
[21:42:57] <wikibugs>	 10SRE: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920 (10DBu-WMF)
[21:46:30] <wikibugs>	 10SRE: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920 (10Dzahn)
[21:47:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920 (10Dzahn)
[21:48:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920 (10Dzahn) Looks like dmarcian has been replaced by dmarcdigests.com. (details in T330944). Adding some tags for visibility.
[21:49:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1) You look still to be blocked on wikitech https://wikitech.wikimedia.org/wiki/Special:Contributions/Conniecc1 - not sure if that's related bu...
[21:49:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05Resolved→03Open
[21:49:22] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) a:05cchen→03None
[21:50:53] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1)
[21:52:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1) I've added a checklist based on the private task.  @MoritzMuehlenhoff (or another SRE): please update based on what already works  @cchen: i...
[21:53:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P56484 and previous config saved to /var/cache/conftool/dbconfig/20240207-215317-marostegui.json
[22:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T2200)
[22:00:06] <icinga-wm>	 RECOVERY - Check systemd state on stat1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:00:51] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Re-enable writes to wikidata on cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998559 (https://phabricator.wikimedia.org/T352335)
[22:04:00] <icinga-wm>	 PROBLEM - Check systemd state on stat1011 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:05:07] <ebernhardson>	 It's a couple minutes late for backport window, but i'm going to deploy the write re-enable above
[22:05:16] <ebernhardson>	 unless there are concerns
[22:06:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998559 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson)
[22:06:19] <brennen>	 ebernhardson: you've got my blessing as releng / train deployer.
[22:06:45] <ebernhardson>	 brennen: awesome, thanks
[22:07:00] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release
[22:07:07] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Re-enable writes to wikidata on cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998559 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson)
[22:07:32] <logmsgbot>	 !log ebernhardson@deploy2002 Started scap: Backport for [[gerrit:998559|cirrus: Re-enable writes to wikidata on cloudelastic (T352335)]]
[22:07:45] <stashbot>	 T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335
[22:08:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T355609)', diff saved to https://phabricator.wikimedia.org/P56485 and previous config saved to /var/cache/conftool/dbconfig/20240207-220824-marostegui.json
[22:08:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[22:08:39] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[22:08:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[22:08:59] <logmsgbot>	 !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:998559|cirrus: Re-enable writes to wikidata on cloudelastic (T352335)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:10:24] <logmsgbot>	 !log ebernhardson@deploy2002 ebernhardson: Continuing with sync
[22:11:25] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:13:35] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Parser: Fix the main loop getting stuck on some signatures [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998453 (https://phabricator.wikimedia.org/T356884)
[22:13:44] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Parser: Fix the main loop getting stuck on some signatures [extensions/DiscussionTools] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998454 (https://phabricator.wikimedia.org/T356884)
[22:16:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (13) prometheus-phpfpm-statustext-textfile.service Failed on mw1418:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:16:42] <logmsgbot>	 !log ebernhardson@deploy2002 Finished scap: Backport for [[gerrit:998559|cirrus: Re-enable writes to wikidata on cloudelastic (T352335)]] (duration: 09m 10s)
[22:16:46] <stashbot>	 T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335
[22:17:43] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:21:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (29) prometheus-phpfpm-statustext-textfile.service Failed on mw1409:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:34:27] <wikibugs>	 (03PS21) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624)
[22:34:38] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper)
[22:39:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper)
[22:46:12] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T347624, testing 961878 patch) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling source-only afterwards
[22:46:17] <stashbot>	 T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624
[22:46:32] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T347624, testing 961878 patch) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling source-only afterwards
[22:46:55] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper)
[22:47:58] <wikibugs>	 (03PS22) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624)
[22:49:34] <brett>	 !log Uploaded ncmonitor 0.0.2 to bookworm-wikimedia archive
[22:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:26] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:10:42] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:11:20] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[23:21:50] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:22:14] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:22:26] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[23:53:14] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:54:04] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release
[23:54:34] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring