[00:38:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997844 [00:39:05] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997844 (owner: 10TrainBranchBot) [00:51:27] !log removing 21 files for legal compliance [00:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:11] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/997844 (owner: 10TrainBranchBot) [01:14:39] (03PS1) 10BryanDavis: striker: Bump container version to 2024-02-07-005708-production [puppet] - 10https://gerrit.wikimedia.org/r/997990 [01:14:56] jouncebot: nowandnext [01:14:56] No deployments scheduled for the next 5 hour(s) and 45 minute(s) [01:14:56] In 5 hour(s) and 45 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0700) [01:19:48] (03PS4) 10Zabe: Update mediawiki/mediawiki-codesniffer to 43.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404 [01:20:53] (03CR) 10Zabe: [C: 03+2] Update mediawiki/mediawiki-codesniffer to 43.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404 (owner: 10Zabe) [01:21:37] (03Merged) 10jenkins-bot: Update mediawiki/mediawiki-codesniffer to 43.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/996404 (owner: 10Zabe) [01:22:20] !log zabe@deploy2002 Started scap: Backport for [[gerrit:996404|Update mediawiki/mediawiki-codesniffer to 43.0.0]] [01:23:52] !log zabe@deploy2002 zabe: Backport for [[gerrit:996404|Update mediawiki/mediawiki-codesniffer to 43.0.0]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [01:24:12] !log zabe@deploy2002 zabe: Continuing with sync [01:25:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:25] (SystemdUnitFailed) firing: (11) prometheus-phpfpm-statustext-textfile.service Failed on mw1364:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:46] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:996404|Update mediawiki/mediawiki-codesniffer to 43.0.0]] (duration: 08m 25s) [01:31:45] (03PS1) 10Eevans: sessionstore: provision sessionstore2004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997991 (https://phabricator.wikimedia.org/T356829) [01:31:47] (03PS1) 10Eevans: sessionstore: provision sessionstore2005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997992 (https://phabricator.wikimedia.org/T356829) [01:31:49] (03PS1) 10Eevans: sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829) [01:35:25] (SystemdUnitFailed) resolved: (35) prometheus-phpfpm-statustext-textfile.service Failed on mw1354:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:57:08] (03PS5) 10Zabe: Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) (owner: 10Caenus) [01:58:26] (03PS1) 10Zabe: throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997996 [02:00:36] (03CR) 10Zabe: [C: 03+2] Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) (owner: 10Caenus) [02:00:48] (03PS2) 10Zabe: throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997996 [02:00:48] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:00:51] (03CR) 10Zabe: [C: 03+2] throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997996 (owner: 10Zabe) [02:01:24] (03Merged) 10jenkins-bot: Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) (owner: 10Caenus) [02:01:44] (03Merged) 10jenkins-bot: throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997996 (owner: 10Zabe) [02:02:24] !log zabe@deploy2002 Started scap: Backport for [[gerrit:952817|Deleting Ns:104 in itwikivoyage]], [[gerrit:997996|throttle: Remove expired throttle]] [02:03:52] !log zabe@deploy2002 caenus and zabe: Backport for [[gerrit:952817|Deleting Ns:104 in itwikivoyage]], [[gerrit:997996|throttle: Remove expired throttle]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [02:04:15] !log zabe@deploy2002 caenus and zabe: Continuing with sync [02:05:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:10:25] (SystemdUnitFailed) firing: (17) prometheus-phpfpm-statustext-textfile.service Failed on mw1352:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:10:46] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:952817|Deleting Ns:104 in itwikivoyage]], [[gerrit:997996|throttle: Remove expired throttle]] (duration: 08m 22s) [02:11:07] !log zabe@mwmaint2002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki mediawikiwiki "Wikimedia Apps/Suggested edits" "Wikimedia Apps/Android Suggested edits" "Zabe" --reason "per request [[:phab:T348875|T348875]]" [02:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:11] T348875: Move [Wikimedia Apps/Suggested edits] to [Wikimedia Apps/Android Suggested edits] on MediaWiki.org - https://phabricator.wikimedia.org/T348875 [02:15:26] (SystemdUnitFailed) resolved: (44) prometheus-phpfpm-statustext-textfile.service Failed on mw1352:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:17:42] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:40] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:39:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:38] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:10:48] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:29:32] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:31:00] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:36:25] (SystemdUnitFailed) firing: docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:40] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:22] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:42:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:17:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:18:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:51:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [05:52:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [05:52:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2105 (T355609)', diff saved to https://phabricator.wikimedia.org/P56381 and previous config saved to /var/cache/conftool/dbconfig/20240207-055210-marostegui.json [05:52:15] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [05:53:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2030', diff saved to https://phabricator.wikimedia.org/P56382 and previous config saved to /var/cache/conftool/dbconfig/20240207-055301-root.json [05:53:50] (03PS1) 10Marostegui: es2030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/998008 (https://phabricator.wikimedia.org/T351916) [05:55:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2030.codfw.wmnet with OS bookworm [05:56:05] (03CR) 10Marostegui: [C: 03+2] es2030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/998008 (https://phabricator.wikimedia.org/T351916) (owner: 10Marostegui) [05:59:47] (03PS5) 10Vgutierrez: fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [06:00:35] (03CR) 10Marostegui: [C: 03+2] mariadb: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [06:00:48] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:02:02] (03CR) 10Marostegui: "Filippo, I have tried to submit this change+merge but I cannot submit, as after the +2, there seem to be some other changes that need to b" [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [06:02:56] (03CR) 10CI reject: [V: 04-1] fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [06:05:30] (03CR) 10Vgutierrez: [C: 04-1] fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [06:14:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T355609)', diff saved to https://phabricator.wikimedia.org/P56383 and previous config saved to /var/cache/conftool/dbconfig/20240207-061424-marostegui.json [06:14:29] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:14:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2030.codfw.wmnet with reason: host reimage [06:17:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2030.codfw.wmnet with reason: host reimage [06:17:42] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:29:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P56384 and previous config saved to /var/cache/conftool/dbconfig/20240207-062931-marostegui.json [06:34:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2030.codfw.wmnet with OS bookworm [06:34:30] (03PS1) 10Marostegui: Revert "es2030: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/998028 [06:35:20] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/998028 (owner: 10Marostegui) [06:36:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Switch es1 master T351916', diff saved to https://phabricator.wikimedia.org/P56385 and previous config saved to /var/cache/conftool/dbconfig/20240207-063659-marostegui.json [06:37:04] T351916: Migrate es1 to Bookworm and MariaDB 10.6 - https://phabricator.wikimedia.org/T351916 [06:37:17] (03CR) 10Marostegui: [C: 03+2] Revert "es2030: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/998028 (owner: 10Marostegui) [06:39:53] (03PS1) 10Marostegui: es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/998145 (https://phabricator.wikimedia.org/T351916) [06:39:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1032', diff saved to https://phabricator.wikimedia.org/P56386 and previous config saved to /var/cache/conftool/dbconfig/20240207-063957-root.json [06:41:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1032.eqiad.wmnet with OS bookworm [06:41:17] (03CR) 10Marostegui: [C: 03+2] es1032: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/998145 (https://phabricator.wikimedia.org/T351916) (owner: 10Marostegui) [06:41:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P56387 and previous config saved to /var/cache/conftool/dbconfig/20240207-064142-root.json [06:44:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P56388 and previous config saved to /var/cache/conftool/dbconfig/20240207-064438-marostegui.json [06:52:04] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:52:18] uh? [06:52:45] 10SRE, 10Traffic: A poor internet connection should not result in a HTTP 503 error - https://phabricator.wikimedia.org/T356025 (10Bugreporter) [06:53:26] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:53:36] 10SRE, 10Traffic: Cannot edit wikipedia from my work computer - https://phabricator.wikimedia.org/T356799 (10Bugreporter) [06:54:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1032.eqiad.wmnet with reason: host reimage [06:56:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P56389 and previous config saved to /var/cache/conftool/dbconfig/20240207-065647-root.json [06:57:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1032.eqiad.wmnet with reason: host reimage [06:59:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T355609)', diff saved to https://phabricator.wikimedia.org/P56390 and previous config saved to /var/cache/conftool/dbconfig/20240207-065944-marostegui.json [06:59:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [06:59:50] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:00:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0700) [07:00:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2109 (T355609)', diff saved to https://phabricator.wikimedia.org/P56391 and previous config saved to /var/cache/conftool/dbconfig/20240207-070007-marostegui.json [07:03:58] (03PS1) 10Marostegui: Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/998032 [07:05:41] 10SRE, 10Traffic: A poor internet connection should not result in a HTTP 503 error - https://phabricator.wikimedia.org/T356025 (10Vgutierrez) sadly varnish is not able to tell between a client that goes away earlier than expected (by poor Internet access) triggering a backend fetch error from an actual backend... [07:05:50] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:08:08] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:11:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P56392 and previous config saved to /var/cache/conftool/dbconfig/20240207-071152-root.json [07:16:14] (03CR) 10Marostegui: [C: 03+2] Revert "es1032: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/998032 (owner: 10Marostegui) [07:16:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1032.eqiad.wmnet with OS bookworm [07:17:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P56393 and previous config saved to /var/cache/conftool/dbconfig/20240207-071707-root.json [07:25:58] (03CR) 10Majavah: [C: 03+2] P:toolforge::mailrelay: don't blindly reject any bounces [puppet] - 10https://gerrit.wikimedia.org/r/994250 (owner: 10Majavah) [07:26:47] <_joe_> jouncebot: next [07:26:47] In 0 hour(s) and 33 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0800) [07:26:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P56394 and previous config saved to /var/cache/conftool/dbconfig/20240207-072657-root.json [07:27:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997879 (https://phabricator.wikimedia.org/T356780) (owner: 10Jforrester) [07:28:03] <_joe_> I am saving time in the backport window as this is a branch backport for a train blocker [07:28:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T355609)', diff saved to https://phabricator.wikimedia.org/P56395 and previous config saved to /var/cache/conftool/dbconfig/20240207-072851-marostegui.json [07:28:55] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:32:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 5%: After reimage', diff saved to https://phabricator.wikimedia.org/P56396 and previous config saved to /var/cache/conftool/dbconfig/20240207-073212-root.json [07:42:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P56397 and previous config saved to /var/cache/conftool/dbconfig/20240207-074203-root.json [07:43:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P56398 and previous config saved to /var/cache/conftool/dbconfig/20240207-074357-marostegui.json [07:46:50] (03Merged) 10jenkins-bot: Set the memory limit in bytes. [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997879 (https://phabricator.wikimedia.org/T356780) (owner: 10Jforrester) [07:47:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P56399 and previous config saved to /var/cache/conftool/dbconfig/20240207-074717-root.json [07:47:34] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:997879|Set the memory limit in bytes. (T356780)]] [07:47:38] T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780 [07:49:15] (03CR) 10Muehlenhoff: [C: 03+2] Extend config-master Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/997888 (owner: 10Muehlenhoff) [07:49:20] !log oblivian@deploy2002 oblivian and jforrester: Backport for [[gerrit:997879|Set the memory limit in bytes. (T356780)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:50:05] !log oblivian@deploy2002 oblivian and jforrester: Continuing with sync [07:51:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:51:49] !log rebalance ganeti codfw/row B following completed switch maintenance T355860 [07:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:53] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [07:56:25] (SystemdUnitFailed) firing: (9) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P56400 and previous config saved to /var/cache/conftool/dbconfig/20240207-075708-root.json [07:57:11] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:997879|Set the memory limit in bytes. (T356780)]] (duration: 09m 36s) [07:57:14] T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780 [07:58:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet [07:59:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P56401 and previous config saved to /var/cache/conftool/dbconfig/20240207-075904-marostegui.json [07:59:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet [08:00:04] Amir1 and Urbanecm: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0800). nyaa~ [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:26] (SystemdUnitFailed) firing: (34) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [08:02:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P56402 and previous config saved to /var/cache/conftool/dbconfig/20240207-080222-root.json [08:03:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2030.codfw.wmnet [08:06:26] (SystemdUnitFailed) firing: (35) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:23] (03PS2) 10Slyngshede: Provide context for account creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) [08:09:39] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:11:26] (SystemdUnitFailed) firing: (34) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 38 hosts with reason: Primary switchover s4 T356649 [08:11:41] T356649: Switchover s4 master (db1160 -> db1238) - https://phabricator.wikimedia.org/T356649 [08:11:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 38 hosts with reason: Primary switchover s4 T356649 [08:12:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P56403 and previous config saved to /var/cache/conftool/dbconfig/20240207-081213-root.json [08:12:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db1238 with weight 0 T356649', diff saved to https://phabricator.wikimedia.org/P56404 and previous config saved to /var/cache/conftool/dbconfig/20240207-081220-arnaudb.json [08:12:43] (03CR) 10Slyngshede: Provide context for account creation. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede) [08:12:45] <_joe_> jouncebot: now [08:12:45] For the next 0 hour(s) and 47 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T0800) [08:14:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T355609)', diff saved to https://phabricator.wikimedia.org/P56405 and previous config saved to /var/cache/conftool/dbconfig/20240207-081410-marostegui.json [08:14:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [08:14:15] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:14:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [08:14:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T355609)', diff saved to https://phabricator.wikimedia.org/P56406 and previous config saved to /var/cache/conftool/dbconfig/20240207-081433-marostegui.json [08:15:21] (03CR) 10Majavah: [C: 03+2] striker: Bump container version to 2024-02-07-005708-production [puppet] - 10https://gerrit.wikimedia.org/r/997990 (owner: 10BryanDavis) [08:17:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P56407 and previous config saved to /var/cache/conftool/dbconfig/20240207-081727-root.json [08:18:39] (03PS2) 10Slyngshede: P:docker::builder clean docker image cache regularly. [puppet] - 10https://gerrit.wikimedia.org/r/997796 [08:21:28] (03PS2) 10Muehlenhoff: debmonitor: Remove legacy cert handling [puppet] - 10https://gerrit.wikimedia.org/r/995183 [08:24:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/997480 (https://phabricator.wikimedia.org/T354959) (owner: 10Muehlenhoff) [08:26:16] (03CR) 10Slyngshede: P:docker::builder clean docker image cache regularly. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/997796 (owner: 10Slyngshede) [08:32:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P56408 and previous config saved to /var/cache/conftool/dbconfig/20240207-083233-root.json [08:33:46] (03PS2) 10Slyngshede: Allow users to view the entire SSH key [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140) [08:34:02] (03CR) 10Slyngshede: Allow users to view the entire SSH key (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140) (owner: 10Slyngshede) [08:36:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T355609)', diff saved to https://phabricator.wikimedia.org/P56409 and previous config saved to /var/cache/conftool/dbconfig/20240207-083650-marostegui.json [08:36:54] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:44:28] (03CR) 10Arnaudb: [C: 03+2] mariadb: Promote db1238 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/997487 (https://phabricator.wikimedia.org/T356649) (owner: 10Gerrit maintenance bot) [08:45:09] !log Starting s4 eqiad failover from db1160 to db1238 - T356649 [08:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:13] T356649: Switchover s4 master (db1160 -> db1238) - https://phabricator.wikimedia.org/T356649 [08:46:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:46:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db1238 to s4 primary T356649', diff saved to https://phabricator.wikimedia.org/P56410 and previous config saved to /var/cache/conftool/dbconfig/20240207-084654-arnaudb.json [08:47:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P56411 and previous config saved to /var/cache/conftool/dbconfig/20240207-084738-root.json [08:48:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51452 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:51:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P56412 and previous config saved to /var/cache/conftool/dbconfig/20240207-085157-marostegui.json [08:57:43] (03PS2) 10Filippo Giunchedi: envoy: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997818 (https://phabricator.wikimedia.org/T337831) [08:58:18] (03PS1) 10Majavah: network: allow passing 'cloud' as realm to slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/998259 [08:58:20] (03PS1) 10Majavah: network::constants: use 'cloud' where possible [puppet] - 10https://gerrit.wikimedia.org/r/998260 [08:58:22] (03PS1) 10Majavah: P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) [08:58:52] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] envoy: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997818 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [08:58:56] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140) (owner: 10Slyngshede) [08:59:40] (03PS3) 10Filippo Giunchedi: graphite: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997806 (https://phabricator.wikimedia.org/T337831) [09:00:28] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] graphite: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997806 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [09:01:46] (03PS2) 10Majavah: P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) [09:02:18] (03PS3) 10Filippo Giunchedi: profile: remove absented statsd hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/997803 (https://phabricator.wikimedia.org/T239862) [09:02:27] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] profile: remove absented statsd hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/997803 (https://phabricator.wikimedia.org/T239862) (owner: 10Filippo Giunchedi) [09:03:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'mathching old db1238 weight https://phabricator.wikimedia.org/P56404', diff saved to https://phabricator.wikimedia.org/P56413 and previous config saved to /var/cache/conftool/dbconfig/20240207-090316-arnaudb.json [09:04:52] (03PS2) 10Filippo Giunchedi: mariadb: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) [09:04:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1292/c" [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [09:06:12] (03CR) 10CI reject: [V: 04-1] P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [09:06:14] (03CR) 10Filippo Giunchedi: "Thank you for following up, indeed the patch was part of a chain of related and independent patches. I've now moved the patch to be stand " [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [09:07:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P56414 and previous config saved to /var/cache/conftool/dbconfig/20240207-090703-marostegui.json [09:07:50] (03CR) 10CI reject: [V: 04-1] P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [09:08:05] (03CR) 10Volans: "small missing nit, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff) [09:17:57] (03CR) 10Jcrespo: [C: 03+1] "Please note this is good, but *not enough for the ticket scope*- there needs to be a change on the job defaults config (see netbox)." [puppet] - 10https://gerrit.wikimedia.org/r/997935 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis) [09:19:46] (03CR) 10Jelto: [C: 03+2] Temporarily enable Dockerfile frontend on trusted runners (part 2, rev 2) [puppet] - 10https://gerrit.wikimedia.org/r/997516 (https://phabricator.wikimedia.org/T356418) (owner: 10Ahmon Dancy) [09:19:55] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::core_test [09:20:14] (03CR) 10JMeybohm: New cookbook to reboot/restart config-master hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 (owner: 10Muehlenhoff) [09:21:48] (03PS1) 10Muehlenhoff: Switch mariadb::core_test to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998294 (https://phabricator.wikimedia.org/T349619) [09:22:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T355609)', diff saved to https://phabricator.wikimedia.org/P56415 and previous config saved to /var/cache/conftool/dbconfig/20240207-092210-marostegui.json [09:22:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [09:22:14] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:22:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [09:22:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:22:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:22:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T355609)', diff saved to https://phabricator.wikimedia.org/P56416 and previous config saved to /var/cache/conftool/dbconfig/20240207-092248-marostegui.json [09:23:18] (03CR) 10Muehlenhoff: New cookbook to reboot/restart config-master hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 (owner: 10Muehlenhoff) [09:23:56] (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::core_test to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998294 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:24:35] !log Doing security deploy for T356183 [09:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998259 (owner: 10Majavah) [09:25:58] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Allow users to view the entire SSH key [software/bitu] - 10https://gerrit.wikimedia.org/r/997852 (https://phabricator.wikimedia.org/T351140) (owner: 10Slyngshede) [09:26:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998260 (owner: 10Majavah) [09:26:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1160', diff saved to https://phabricator.wikimedia.org/P56417 and previous config saved to /var/cache/conftool/dbconfig/20240207-092614-root.json [09:30:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::core_test [09:30:49] (03CR) 10Alexandros Kosiaris: "LGTM, but since you are already posting https://gerrit.wikimedia.org/r/c/operations/puppet/+/998260/1 as a followup, I suppose you intend " [puppet] - 10https://gerrit.wikimedia.org/r/998259 (owner: 10Majavah) [09:31:20] !log removing a bunch of old kernel versions from chartmuseum* to free ~3.5GB disk space [09:31:21] (03CR) 10Vgutierrez: [C: 04-1] "looking good, it needs some work though (check the inline comments)" [puppet] - 10https://gerrit.wikimedia.org/r/991438 (https://phabricator.wikimedia.org/T355190) (owner: 10BCornwall) [09:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:17] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [09:36:20] (03PS2) 10Filippo Giunchedi: confd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) [09:36:22] (03PS2) 10Filippo Giunchedi: chartmuseum: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997816 (https://phabricator.wikimedia.org/T337831) [09:36:24] (03PS2) 10Filippo Giunchedi: docker_registry: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997817 (https://phabricator.wikimedia.org/T337831) [09:36:26] (03PS2) 10Filippo Giunchedi: mediawiki: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) [09:36:28] (03PS2) 10Filippo Giunchedi: etcd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831) [09:38:25] (03CR) 10Majavah: [C: 03+2] "I'm going to try, but I fear as long as `$::realm` is `'labs'` it's going to be somewhat difficult to do that cleanly." [puppet] - 10https://gerrit.wikimedia.org/r/998259 (owner: 10Majavah) [09:38:39] (03CR) 10Majavah: [C: 03+2] network::constants: use 'cloud' where possible [puppet] - 10https://gerrit.wikimedia.org/r/998260 (owner: 10Majavah) [09:39:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T355609)', diff saved to https://phabricator.wikimedia.org/P56418 and previous config saved to /var/cache/conftool/dbconfig/20240207-093953-marostegui.json [09:39:56] arnaudb: are you done with the old s4 replica, I need it for some schema changes (yes, I'm a vulture) [09:39:58] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:40:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:40:10] (03PS3) 10Majavah: P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) [09:41:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic2107.codfw.wmnet [09:41:55] (03PS2) 10Muehlenhoff: New cookbook to reboot/restart config-master hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 [09:42:44] (03CR) 10CI reject: [V: 04-1] P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [09:42:46] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 (owner: 10Muehlenhoff) [09:43:17] Amir1: not yet, I am afk for a few moments and will be performing a schema update after [09:43:28] ping me once done :D [09:43:52] sure [09:44:28] (03PS4) 10Majavah: P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) [09:45:49] !log dreamyjazz Deployed security patch for T356183 [09:45:52] (03PS1) 10Majavah: network: rename 'labs' in data to 'cloud' [puppet] - 10https://gerrit.wikimedia.org/r/998299 [09:46:18] (03CR) 10Clément Goubert: [C: 03+1] service: move thumbor from thumbor pool to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/951545 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [09:46:25] (SystemdUnitFailed) firing: (29) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:46:41] (03CR) 10Clément Goubert: [C: 03+1] conftool: clean up thumbor pools [puppet] - 10https://gerrit.wikimedia.org/r/951546 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [09:46:59] (03CR) 10CI reject: [V: 04-1] network: rename 'labs' in data to 'cloud' [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah) [09:47:42] (03CR) 10Muehlenhoff: [C: 03+2] New cookbook to reboot/restart config-master hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/997887 (owner: 10Muehlenhoff) [09:48:16] (03PS1) 10Brouberol: superset: fix database connection test for our mysql DBs [puppet] - 10https://gerrit.wikimedia.org/r/998300 (https://phabricator.wikimedia.org/T335356) [09:48:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic2107.codfw.wmnet [09:50:12] (03PS1) 10Slyngshede: Add links with information to footer. [software/bitu] - 10https://gerrit.wikimedia.org/r/998301 (https://phabricator.wikimedia.org/T351137) [09:50:48] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1299/co" [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [09:51:11] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah) [09:51:26] (SystemdUnitFailed) firing: (38) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:31] (03CR) 10Btullis: [C: 03+1] "Great! Thanks for looking nito this." [puppet] - 10https://gerrit.wikimedia.org/r/998300 (https://phabricator.wikimedia.org/T335356) (owner: 10Brouberol) [09:52:05] (03CR) 10Brouberol: [C: 03+2] superset: fix database connection test for our mysql DBs [puppet] - 10https://gerrit.wikimedia.org/r/998300 (https://phabricator.wikimedia.org/T335356) (owner: 10Brouberol) [09:53:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] conftool: clean up thumbor pools [puppet] - 10https://gerrit.wikimedia.org/r/951546 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [09:53:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] service: move thumbor from thumbor pool to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/951545 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [09:55:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P56419 and previous config saved to /var/cache/conftool/dbconfig/20240207-095500-marostegui.json [09:55:42] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10LSobanski) [09:56:20] (03PS22) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [09:56:26] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10LSobanski) I updated the description to reflect the new Etherpad release (1.9.7). See below for a list of changes: * Notable enhancements and fixes ** Added Live Plug... [09:57:08] (03PS2) 10Majavah: network: rename 'labs' in data to 'cloud' [puppet] - 10https://gerrit.wikimedia.org/r/998299 [09:59:26] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah) [10:00:10] (03CR) 10Majavah: [V: 03+1] "The change to ferm config looks ok, only ordering and comments are changed." [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah) [10:00:41] (03CR) 10Brouberol: Add a deployment chart for Superset (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [10:00:48] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:10:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P56420 and previous config saved to /var/cache/conftool/dbconfig/20240207-101006-marostegui.json [10:11:56] (03PS1) 10MVernon: swift: removed drained ms-be10[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/998305 (https://phabricator.wikimedia.org/T353149) [10:12:28] !log Continuing security deploy for T356183 [10:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:09] (03CR) 10Hnowlan: [C: 03+1] sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [10:15:34] (03CR) 10Hnowlan: [C: 03+1] sessionstore: provision sessionstore2005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997992 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [10:16:17] (03CR) 10Hnowlan: [C: 03+1] sessionstore: provision sessionstore2004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997991 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [10:17:42] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:09] !log dreamyjazz Deployed security patch for T356183 [10:21:26] (SystemdUnitFailed) firing: (31) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:22:40] (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore2004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997991 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [10:23:06] (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore2005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997992 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [10:23:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1009.eqiad.wmnet [10:23:28] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2009.codfw.wmnet [10:23:36] (03CR) 10MVernon: [C: 03+1] sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [10:24:12] (03PS1) 10Volans: setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 [10:24:36] !log Finished security deploys for T356183 [10:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:51] (03CR) 10Marostegui: [C: 03+1] swift: removed drained ms-be10[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/998305 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [10:25:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T355609)', diff saved to https://phabricator.wikimedia.org/P56421 and previous config saved to /var/cache/conftool/dbconfig/20240207-102513-marostegui.json [10:25:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [10:25:18] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:25:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [10:25:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T355609)', diff saved to https://phabricator.wikimedia.org/P56422 and previous config saved to /var/cache/conftool/dbconfig/20240207-102535-marostegui.json [10:26:02] (03CR) 10Slyngshede: setup.py: actually use install_requires (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans) [10:26:08] (03CR) 10MVernon: [C: 03+2] swift: removed drained ms-be10[44-50] from the rings [puppet] - 10https://gerrit.wikimedia.org/r/998305 (https://phabricator.wikimedia.org/T353149) (owner: 10MVernon) [10:26:26] (SystemdUnitFailed) firing: (31) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:26:46] (03PS2) 10Volans: setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 [10:26:56] (03CR) 10Volans: setup.py: actually use install_requires (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans) [10:27:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/993797 (owner: 10JHathaway) [10:28:16] (03CR) 10Jbond: [C: 03+1] P:kerberos::kadminserver absent Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/995181 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:28:51] (03CR) 10Jbond: [C: 03+1] "lgtm, may want to check nothing in cloud use the old method" [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff) [10:29:28] (03PS3) 10Volans: setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 [10:29:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/995211 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [10:30:18] (03CR) 10CI reject: [V: 04-1] setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans) [10:35:50] (03CR) 10Jbond: [C: 03+1] "lgtm see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/995213 (https://phabricator.wikimedia.org/T356174) (owner: 10Muehlenhoff) [10:36:52] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:37:08] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:37:55] (03PS4) 10Volans: setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 [10:39:19] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:39:42] (03CR) 10Jbond: [C: 03+1] "LGTM may want to add cole or kieth" [puppet] - 10https://gerrit.wikimedia.org/r/997555 (owner: 10JHathaway) [10:40:31] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans) [10:41:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah) [10:41:50] (03CR) 10Clément Goubert: [C: 03+1] confd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [10:42:21] (03PS1) 10Klausman: admin_ng: drop version on apiGroups perms for exp NS in LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/998330 (https://phabricator.wikimedia.org/T354516) [10:43:26] (03CR) 10Arnaudb: [C: 03+1] mariadb: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [10:43:42] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:44:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic2108.codfw.wmnet [10:44:22] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:45:59] (03CR) 10Volans: [C: 03+2] setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans) [10:47:11] (03CR) 10Clément Goubert: confd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [10:47:55] (03Merged) 10jenkins-bot: setup.py: actually use install_requires [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998326 (owner: 10Volans) [10:47:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T355609)', diff saved to https://phabricator.wikimedia.org/P56423 and previous config saved to /var/cache/conftool/dbconfig/20240207-104757-marostegui.json [10:48:02] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:48:43] (03PS2) 10Slyngshede: Add links with information to footer. [software/bitu] - 10https://gerrit.wikimedia.org/r/998301 (https://phabricator.wikimedia.org/T351137) [10:51:14] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.3.5 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998335 [10:51:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic2108.codfw.wmnet [10:51:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::parsercache [10:53:02] (03PS1) 10Muehlenhoff: Switch mariadb::parsercache to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998336 (https://phabricator.wikimedia.org/T349619) [10:53:20] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.3.5 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998335 (owner: 10Volans) [10:53:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic2109.codfw.wmnet [10:53:58] (03CR) 10Marostegui: [C: 03+2] mariadb: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997821 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [10:54:51] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.3.5 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/998335 (owner: 10Volans) [10:57:11] (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::parsercache to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998336 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:58:56] (03PS1) 10Btullis: Update data-engineering-postgresql bacula job defaults [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655) [11:00:04] (03CR) 10Btullis: "Thanks Jaime. I hadn't spotted that. I have added that change in a second patch in this chain." [puppet] - 10https://gerrit.wikimedia.org/r/997935 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1100) [11:00:07] (03CR) 10Btullis: [C: 03+2] [DPE Postgres] Only backup the latest postgres dump file [puppet] - 10https://gerrit.wikimedia.org/r/997935 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis) [11:00:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic2109.codfw.wmnet [11:00:13] (03CR) 10CI reject: [V: 04-1] Update data-engineering-postgresql bacula job defaults [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis) [11:00:45] (03PS2) 10Alexandros Kosiaris: eventrouter: Add port 8080 to containerPorts [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167) [11:02:00] (03PS1) 10Slyngshede: Improve unix username auto-fill [software/bitu] - 10https://gerrit.wikimedia.org/r/998338 (https://phabricator.wikimedia.org/T347634) [11:03:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P56424 and previous config saved to /var/cache/conftool/dbconfig/20240207-110304-marostegui.json [11:04:18] (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: drop version on apiGroups perms for exp NS in LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/998330 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [11:06:02] (03CR) 10Clément Goubert: [C: 04-2] "Mediawiki nodes are buster (https://phabricator.wikimedia.org/T356787) and will be progressively re-imaged to be kubernetes worker nodes o" [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:06:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::parsercache [11:07:57] (03PS1) 10Giuseppe Lavagetto: WebVideoTranscodeJob: also add time limits [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998318 (https://phabricator.wikimedia.org/T356780) [11:08:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [11:08:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [11:09:43] (03CR) 10JMeybohm: [C: 03+1] eventrouter: Add port 8080 to containerPorts [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [11:10:51] (03CR) 10JMeybohm: [C: 03+1] "The crashloop detection SystemdUnitCrashLoop provides was not part of the nrpe check. So even if the SystemdUnitCrashLoop alert does not w" [puppet] - 10https://gerrit.wikimedia.org/r/997817 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:11:03] (03CR) 10JMeybohm: [C: 03+1] "The crashloop detection SystemdUnitCrashLoop provides was not part of the nrpe check. So even if the SystemdUnitCrashLoop alert does not w" [puppet] - 10https://gerrit.wikimedia.org/r/997816 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:11:51] (03CR) 10JMeybohm: [C: 03+1] etcd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:14:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:14:42] (03CR) 10Clément Goubert: [V: 03+1] "alert[1001,2001].wikimedia.org,deploy2002.codfw.wmnet,deploy1002.eqiad.wmnet,mwmaint2002.codfw.wmnet,mwmaint1002.eqiad.wmnet,puppetmaster[" [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:16:34] (03CR) 10JMeybohm: [C: 03+1] "The crashloop detection SystemdUnitCrashLoop provides was not part of the nrpe check. So even if the SystemdUnitCrashLoop alert does not w" [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:18:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P56425 and previous config saved to /var/cache/conftool/dbconfig/20240207-111810-marostegui.json [11:18:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/998301 (https://phabricator.wikimedia.org/T351137) (owner: 10Slyngshede) [11:20:31] (03PS1) 10Btullis: Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) [11:21:50] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1303/co" [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis) [11:26:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, thanks for following up!" [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah) [11:26:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventrouter: Add port 8080 to containerPorts [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [11:27:20] (03CR) 10Majavah: [V: 03+1 C: 03+2] network: rename 'labs' in data to 'cloud' [puppet] - 10https://gerrit.wikimedia.org/r/998299 (owner: 10Majavah) [11:29:13] (03Merged) 10jenkins-bot: eventrouter: Add port 8080 to containerPorts [deployment-charts] - 10https://gerrit.wikimedia.org/r/992740 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [11:29:29] (03CR) 10Clément Goubert: [V: 03+1 C: 03+1] "alert[1001,2001].wikimedia.org,deploy2002.codfw.wmnet,deploy1002.eqiad.wmnet,mwmaint2002.codfw.wmnet,mwmaint1002.eqiad.wmnet,puppetmaster[" [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:29:55] (03CR) 10Clément Goubert: [C: 03+1] "The nodes being buster only means they will not benefit from the crashloop detection." [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [11:31:58] 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10fgiunchedi) Thank you for the report and investigation, I took this chance to update https://wikitech.wikimedia.org/wiki/Thanos and make... [11:33:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T355609)', diff saved to https://phabricator.wikimedia.org/P56426 and previous config saved to /var/cache/conftool/dbconfig/20240207-113317-marostegui.json [11:33:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [11:33:28] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:33:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2190.codfw.wmnet with reason: Maintenance [11:33:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T355609)', diff saved to https://phabricator.wikimedia.org/P56427 and previous config saved to /var/cache/conftool/dbconfig/20240207-113339-marostegui.json [11:34:48] PROBLEM - Check systemd state on mw1363 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:33] (03CR) 10Jcrespo: [C: 04-1] "It is missing the if ending bracket, otherwise it looks good. :-)" [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis) [11:36:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:36:44] (03PS2) 10Btullis: Update data-engineering-postgresql bacula job defaults [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655) [11:37:35] moritzm: still interested in k8s hosts failing ferm or should I just go ahead and manually restart that one? [11:41:18] (03PS1) 10Volans: Upstream release v0.3.5 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 [11:41:25] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:01] (03CR) 10Volans: Upstream release v0.3.5 (031 comment) [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 (owner: 10Volans) [11:42:04] (03CR) 10Jcrespo: [C: 03+1] Update data-engineering-postgresql bacula job defaults [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis) [11:44:18] (03PS5) 10Gmodena: WIP - add webrequest.frontend stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [11:45:04] (03CR) 10Btullis: [C: 03+1] Add a deployment chart for Superset (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [11:45:13] (03PS1) 10Majavah: wikireplicas: maintain-views: try depooling host on lock failure [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) [11:45:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [11:45:42] claime: just restart them, when I find some time I'll add a toil class for it, but not this week [11:45:52] (03CR) 10Majavah: "This is untested for now." [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [11:46:05] ack [11:47:20] RECOVERY - Check systemd state on mw1363 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:33] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs: cloud_private_subnet: add route to private instance networks [puppet] - 10https://gerrit.wikimedia.org/r/998261 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [11:48:55] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master-codfw [11:49:19] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors [11:49:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors [11:50:17] (03CR) 10Slyngshede: [C: 03+1] "Looks good, agree on keeping the apt dependency." [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 (owner: 10Volans) [11:51:14] (03CR) 10Volans: [C: 03+2] Upstream release v0.3.5 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 (owner: 10Volans) [11:51:18] (03CR) 10Btullis: [C: 03+2] Update data-engineering-postgresql bacula job defaults [puppet] - 10https://gerrit.wikimedia.org/r/998337 (https://phabricator.wikimedia.org/T316655) (owner: 10Btullis) [11:51:25] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 (owner: 10Volans) [11:52:05] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 40% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/998358 (https://phabricator.wikimedia.org/T355532) [11:52:49] (03Merged) 10jenkins-bot: Upstream release v0.3.5 [software/debmonitor-client] (debian) - 10https://gerrit.wikimedia.org/r/998355 (owner: 10Volans) [11:52:52] (03PS1) 10Clément Goubert: trafficserver: move 40% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/998359 (https://phabricator.wikimedia.org/T355532) [11:53:38] PROBLEM - Check systemd state on mw1473 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master-codfw [11:54:58] RECOVERY - Check systemd state on mw1473 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:10] PROBLEM - Check whether ferm is active by checking the default input chain on mw1381 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:56:24] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master-eqiad [11:56:25] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:44] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors [11:56:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors [11:56:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1013.eqiad.wmnet [11:58:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T355609)', diff saved to https://phabricator.wikimedia.org/P56428 and previous config saved to /var/cache/conftool/dbconfig/20240207-115849-marostegui.json [11:58:54] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:01:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master-eqiad [12:01:25] (SystemdUnitFailed) firing: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:10] !log uploaded debmonitor-client_0.3.5 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia,bookworm-wikimedia [12:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1013.eqiad.wmnet [12:02:35] (03CR) 10Hnowlan: [C: 03+1] trafficserver: move 40% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/998359 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [12:02:55] (03CR) 10Hnowlan: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 40% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/998358 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [12:03:42] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:25] (SystemdUnitFailed) resolved: (4) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:10] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 40% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/998358 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [12:10:14] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 40% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/998358 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [12:10:44] (03CR) 10Btullis: "Looking great! A few question on the networkpolicies, but all good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis) [12:12:00] !log mw-web, mw-api-ext: Raise replicas for 40% traffic - T355532 [12:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:04] T355532: Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 [12:12:22] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [12:12:35] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [12:12:41] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [12:13:19] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [12:13:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P56429 and previous config saved to /var/cache/conftool/dbconfig/20240207-121356-marostegui.json [12:14:03] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:14:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1014.eqiad.wmnet [12:14:14] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:14:24] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [12:14:32] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [12:16:36] (03PS23) 10Brouberol: Add a deployment chart for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [12:17:06] arnaudb: Emperor: Heads up, raising mw-on-k8s traffic to 40% external traffic [12:17:39] (03PS1) 10Slyngshede: Add informative titles to all pages. [software/bitu] - 10https://gerrit.wikimedia.org/r/998365 (https://phabricator.wikimedia.org/T351136) [12:17:59] !log trafficserver: move 40% of traffic to mw on k8s - T355532 [12:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:04] T355532: Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 [12:18:05] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 40% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/998359 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [12:18:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1014.eqiad.wmnet [12:19:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1015.eqiad.wmnet [12:21:18] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:41] (SystemdUnitFailed) firing: docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:46] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Add links with information to footer. [software/bitu] - 10https://gerrit.wikimedia.org/r/998301 (https://phabricator.wikimedia.org/T351137) (owner: 10Slyngshede) [12:25:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1015.eqiad.wmnet [12:25:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs1016.eqiad.wmnet [12:25:38] RECOVERY - Check whether ferm is active by checking the default input chain on mw1381 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:28:45] (03CR) 10Hnowlan: [C: 03+2] service: move thumbor from thumbor pool to kubesvc [puppet] - 10https://gerrit.wikimedia.org/r/951545 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [12:29:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P56430 and previous config saved to /var/cache/conftool/dbconfig/20240207-122903-marostegui.json [12:30:28] (03CR) 10Brouberol: Add a deployment chart for Superset (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987785 (https://phabricator.wikimedia.org/T352166) (owner: 10Btullis) [12:31:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1016.eqiad.wmnet [12:31:25] !log hnowlan@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T334488) [12:31:29] T334488: Final steps for fully-Kubernetes Thumbor - https://phabricator.wikimedia.org/T334488 [12:32:34] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T334488) [12:33:51] !log jmm@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir [12:33:55] (03PS1) 10Clément Goubert: docker-registry: Raise nginx timeouts to 240s [puppet] - 10https://gerrit.wikimedia.org/r/998392 [12:34:56] !log hnowlan@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T334488) [12:35:53] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T334488) [12:41:36] (03CR) 10Klausman: [C: 03+1] Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis) [12:44:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T355609)', diff saved to https://phabricator.wikimedia.org/P56431 and previous config saved to /var/cache/conftool/dbconfig/20240207-124409-marostegui.json [12:44:15] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:45:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2105.codfw.wmnet with reason: T344589 - kernel upgrade [12:45:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2105.codfw.wmnet with reason: T344589 - kernel upgrade [12:46:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T344589 - depool db2105', diff saved to https://phabricator.wikimedia.org/P56432 and previous config saved to /var/cache/conftool/dbconfig/20240207-124605-arnaudb.json [12:54:34] PROBLEM - Check systemd state on ncredir2001 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ipip0.service,ifup@ipip60.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:25] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [12:57:51] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 (10Clement_Goubert) 05Open→03Resolved [12:58:44] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [12:59:10] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:58] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:02] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:30] PROBLEM - Check systemd state on an-master1003 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:32] PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [13:09:24] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:38] (03PS1) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) [13:10:14] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:26] RECOVERY - Check systemd state on an-master1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:28] RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [13:11:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1304/co" [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [13:12:20] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:13:00] (03CR) 10CI reject: [V: 04-1] openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [13:15:18] (03CR) 10Filippo Giunchedi: [C: 03+2] docker_registry: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997817 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [13:15:22] (03CR) 10Filippo Giunchedi: [C: 03+2] chartmuseum: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997816 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [13:15:26] (03CR) 10Filippo Giunchedi: [C: 03+2] etcd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997820 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [13:15:30] (03CR) 10Filippo Giunchedi: [C: 03+2] confd: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997815 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [13:15:33] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: remove nrpe::monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/997819 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [13:23:28] (03PS2) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) [13:24:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 15%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56433 and previous config saved to /var/cache/conftool/dbconfig/20240207-132402-arnaudb.json [13:24:57] (03PS1) 10Hnowlan: kubernetes: make 5 appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/998403 (https://phabricator.wikimedia.org/T351074) [13:25:10] PROBLEM - Check systemd state on an-master1003 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:10] PROBLEM - Hadoop Namenode - Primary on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [13:25:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on es2024.codfw.wmnet with reason: T344589 - kernel upgrade [13:25:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1305/co" [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [13:25:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2024.codfw.wmnet with reason: T344589 - kernel upgrade [13:26:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T344589 - depool es2024', diff saved to https://phabricator.wikimedia.org/P56434 and previous config saved to /var/cache/conftool/dbconfig/20240207-132559-arnaudb.json [13:26:56] (03CR) 10CI reject: [V: 04-1] openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [13:28:37] (03PS1) 10Majavah: templates/56.15.185.in-addr.arpa: add missing includes [dns] - 10https://gerrit.wikimedia.org/r/998404 (https://phabricator.wikimedia.org/T341338) [13:29:05] (03PS3) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) [13:29:36] (03CR) 10Arturo Borrero Gonzalez: openstack: overhaul the floating IP updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [13:32:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=1) rolling reboot on A:ncredir [13:34:40] (03CR) 10Slyngshede: "LGTM, this does roll out the nrpe script tool all hosts though, but we're removing it in a little while, so I think it's fine." [puppet] - 10https://gerrit.wikimedia.org/r/997801 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [13:35:24] (03PS1) 10Bartosz Dziewoński: ParserObserver: Limit the size of cache of previous parse traces [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998384 (https://phabricator.wikimedia.org/T351732) [13:35:29] (03PS1) 10Bartosz Dziewoński: ParserObserver: Limit the size of cache of previous parse traces [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) [13:36:41] (03PS4) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) [13:38:12] (03PS5) 10Majavah: openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) [13:39:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 30%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56435 and previous config saved to /var/cache/conftool/dbconfig/20240207-133907-arnaudb.json [13:39:27] RECOVERY - Hadoop Namenode - Primary on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [13:39:30] (03CR) 10Majavah: openstack: overhaul the floating IP updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [13:39:59] RECOVERY - Check systemd state on an-master1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:50] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1307/co" [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [13:48:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 15%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56436 and previous config saved to /var/cache/conftool/dbconfig/20240207-134801-arnaudb.json [13:48:56] (03CR) 10Klausman: [C: 03+2] admin_ng: drop version on apiGroups perms for exp NS in LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/998330 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [13:51:35] (03Merged) 10jenkins-bot: admin_ng: drop version on apiGroups perms for exp NS in LiftWing [deployment-charts] - 10https://gerrit.wikimedia.org/r/998330 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman) [13:52:22] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [13:52:35] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [13:52:43] confd maintenance? [13:52:48] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:53:31] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:53:38] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:54:12] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:54:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 60%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56437 and previous config saved to /var/cache/conftool/dbconfig/20240207-135412-arnaudb.json [13:56:19] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:57:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] docker-registry: Raise nginx timeouts to 240s [puppet] - 10https://gerrit.wikimedia.org/r/998392 (owner: 10Clément Goubert) [13:57:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/998338 (https://phabricator.wikimedia.org/T347634) (owner: 10Slyngshede) [13:57:47] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:59:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.652 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:59:46] (03CR) 10Vgutierrez: "any chance of increasing the timeout rather than disabling it?" [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis) [13:59:48] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10klausman) After dropping the version specifiers (`/v...`) at the end of the `apiGroups` directives, this is now working properly. [14:00:00] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1400). [14:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:49] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:00:51] hi [14:00:59] o/ [14:01:00] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:08] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:01:15] I can deploy, I guess ^^ [14:01:35] (03PS3) 10Muehlenhoff: debmonitor: Remove legacy cert handling [puppet] - 10https://gerrit.wikimedia.org/r/995183 [14:02:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Flow] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997878 (https://phabricator.wikimedia.org/T356223) (owner: 10Jforrester) [14:02:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Flow] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997877 (https://phabricator.wikimedia.org/T356223) (owner: 10Jforrester) [14:02:39] all of my changes aren't really testable on mwdebug [14:02:46] (03CR) 10CI reject: [V: 04-1] debmonitor: Remove legacy cert handling [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff) [14:02:47] the Flow fix will hopefully show up in the logs [14:02:58] the core fix is in preparation for a maintenance script i want to run [14:03:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 30%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56438 and previous config saved to /var/cache/conftool/dbconfig/20240207-140306-arnaudb.json [14:03:24] aha, I see you found the cause of the memory leak \o/ [14:03:24] (actually, if you're feeling bored, you could start that maintenance script for me ;) it's https://phabricator.wikimedia.org/T315510) [14:03:27] yeah, that seems hardly fixable [14:03:31] *testable [14:03:43] (03CR) 10Muehlenhoff: "Cloud uses "puppet", so that's fine as well." [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff) [14:03:46] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:08] MatmaRex: only after the core backport is done, I assume? [14:04:29] yeah [14:04:46] the commands are listed here: https://phabricator.wikimedia.org/T315510#9312431 [14:07:33] (03PS4) 10Muehlenhoff: debmonitor: Remove legacy cert handling [puppet] - 10https://gerrit.wikimedia.org/r/995183 [14:09:15] (03Merged) 10jenkins-bot: Fix PermissionException being logged [extensions/Flow] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/997878 (https://phabricator.wikimedia.org/T356223) (owner: 10Jforrester) [14:09:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 75%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56439 and previous config saved to /var/cache/conftool/dbconfig/20240207-140918-arnaudb.json [14:09:23] (03Merged) 10jenkins-bot: Fix PermissionException being logged [extensions/Flow] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/997877 (https://phabricator.wikimedia.org/T356223) (owner: 10Jforrester) [14:09:47] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:997878|Fix PermissionException being logged (T356223)]], [[gerrit:997877|Fix PermissionException being logged (T356223)]] [14:09:51] T356223: Flow errors - Insufficient permissions to see userlinks for rev_id and InvalidTopicUuidException - https://phabricator.wikimedia.org/T356223 [14:10:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit already" [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998384 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:10:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "starting gate-and-submit already" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:11:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:11:20] !log lucaswerkmeister-wmde@deploy2002 jforrester and lucaswerkmeister-wmde: Backport for [[gerrit:997878|Fix PermissionException being logged (T356223)]], [[gerrit:997877|Fix PermissionException being logged (T356223)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:11:32] !log lucaswerkmeister-wmde@deploy2002 jforrester and lucaswerkmeister-wmde: Continuing with sync [14:11:47] (MatmaRex: ^ fyi) [14:11:56] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff) [14:12:04] thanks [14:12:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:12:40] (03PS1) 10Majavah: network: make cloud_private_networks per_site [puppet] - 10https://gerrit.wikimedia.org/r/998411 [14:12:42] (03PS1) 10Majavah: P:wmcs::cloudgw: do not traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 [14:13:23] (03PS2) 10Majavah: P:wmcs::cloudgw: do not traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) [14:14:56] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:11] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1308/co" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [14:16:17] <_joe_> jouncebot: nowandnext [14:16:17] For the next 0 hour(s) and 43 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1400) [14:16:17] In 0 hour(s) and 43 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1500) [14:16:20] (03PS3) 10Majavah: P:wmcs::cloudgw: do not traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) [14:16:22] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1309/console" [puppet] - 10https://gerrit.wikimedia.org/r/998411 (owner: 10Majavah) [14:16:36] <_joe_> Lucas_WMDE: when you're done, I have a backport for a train blocker :) [14:16:43] (03CR) 10Clément Goubert: [C: 03+2] docker-registry: Raise nginx timeouts to 240s [puppet] - 10https://gerrit.wikimedia.org/r/998392 (owner: 10Clément Goubert) [14:16:55] (SystemdUnitFailed) firing: (7) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:17:24] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro The same response was sent for each case please advise how you would like me to proceed. De... [14:17:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1310/co" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [14:17:42] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:17:56] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:997878|Fix PermissionException being logged (T356223)]], [[gerrit:997877|Fix PermissionException being logged (T356223)]] (duration: 08m 08s) [14:17:59] _joe_: hm, I already started the gate-and-submit for the next backports :S [14:18:00] T356223: Flow errors - Insufficient permissions to see userlinks for rev_id and InvalidTopicUuidException - https://phabricator.wikimedia.org/T356223 [14:18:06] is it okay to wait until after that? [14:18:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 60%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56440 and previous config saved to /var/cache/conftool/dbconfig/20240207-141812-arnaudb.json [14:18:13] (Zuul predicts 12/13 minutes ETA for those) [14:18:19] <_joe_> Lucas_WMDE: yeah I meant when you're done with the rest [14:18:23] ok [14:18:28] (03PS4) 10Majavah: P:wmcs::cloudgw: do not traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) [14:18:30] (03PS1) 10Majavah: network: add cloud-codfw-bgp-private-vips [puppet] - 10https://gerrit.wikimedia.org/r/998415 [14:18:32] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:33] (otherwise we could’ve squeezed in the backport, I meant) [14:19:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998384 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:19:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:19:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] network: make cloud_private_networks per_site [puppet] - 10https://gerrit.wikimedia.org/r/998411 (owner: 10Majavah) [14:19:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995183 (owner: 10Muehlenhoff) [14:19:40] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:43] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1311/co" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [14:20:01] (03PS5) 10Majavah: P:wmcs::cloudgw: do not traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) [14:20:11] (03CR) 10Majavah: [V: 03+1 C: 03+2] network: make cloud_private_networks per_site [puppet] - 10https://gerrit.wikimedia.org/r/998411 (owner: 10Majavah) [14:20:14] <_joe_> Lucas_WMDE: ah yeah, it's ok :) [14:20:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:20:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [14:21:14] _joe_: okay, then I’ll ping you :) [14:21:16] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1312/co" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [14:21:30] ugh, one of them failed in selenium already [14:21:42] (SystemdUnitFailed) firing: (36) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:22:47] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s try that again, a selenium job randomly failed" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:22:56] not sure that actually works [14:23:10] might have to wait for the first backport to finish gate-and-submit [14:23:31] * Lucas_WMDE ignores the little shoulder demon that suggests submitting the change manually [14:24:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2105 (re)pooling @ 100%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56441 and previous config saved to /var/cache/conftool/dbconfig/20240207-142423-arnaudb.json [14:24:36] (03PS1) 10Tsevener: Add edit_interaction stream config for iOS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998416 (https://phabricator.wikimedia.org/T355265) [14:24:55] (03PS1) 10Filippo Giunchedi: icinga: use systemd::timer::job for 'update-etcd-mw-config-lastindex' [puppet] - 10https://gerrit.wikimedia.org/r/998417 (https://phabricator.wikimedia.org/T337831) [14:25:16] PROBLEM - Host sretest1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:25:17] (03CR) 10Clément Goubert: [C: 03+1] "Also changed their type in the switch migration planning sheet" [puppet] - 10https://gerrit.wikimedia.org/r/998403 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [14:25:32] RECOVERY - Host sretest1001 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:25:56] i think it only works after jenkins reports the failure [14:26:15] jenkins reported the failure alright, I can see it in zuul [14:26:31] but I think it’ll only repeat the build when the other change in the gate-and-submit-wmf pipeline finishes [14:26:32] not on the gerrit change though [14:26:36] (03CR) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:26:44] so it can decide whether to base it on that (if it merges) or not (if it also fails) [14:26:53] hm, true [14:26:59] I think that might also be waiting for the same reason [14:27:23] if the previous change in the chain was bad, zuul would automatically retry the next change without it and not report an error on gerrit, probably [14:27:46] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1313/co" [puppet] - 10https://gerrit.wikimedia.org/r/998417 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [14:28:50] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:29] !log deploying debmonitor-client_0.3.5 fleet-wide [14:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:44] (03CR) 10Filippo Giunchedi: [V: 03+1] "Basically remove a bunch of legacy, and unblocks the nrpe::monitor_systemd_unit_state removal task" [puppet] - 10https://gerrit.wikimedia.org/r/998417 (https://phabricator.wikimedia.org/T337831) (owner: 10Filippo Giunchedi) [14:30:26] (03PS1) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [14:30:43] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Improve unix username auto-fill [software/bitu] - 10https://gerrit.wikimedia.org/r/998338 (https://phabricator.wikimedia.org/T347634) (owner: 10Slyngshede) [14:30:59] (03Merged) 10jenkins-bot: ParserObserver: Limit the size of cache of previous parse traces [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998384 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:31:04] (03CR) 10CI reject: [V: 04-1] ParserObserver: Limit the size of cache of previous parse traces [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:31:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:31:26] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "*now* try again" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:31:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:31:48] 20 more minutes of waiting probably, sorry :S [14:31:50] (03PS1) 10Majavah: P:openstack: rabbitmq: cleanup rabbitmq firewall [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) [14:32:04] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:32:29] (03PS2) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [14:32:36] !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2045.codfw.wmnet [14:32:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be2045.codfw.wmnet [14:32:46] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) Will create a clone of db2169.codfw.wmnet onto db2194.codfw.wmnet [14:32:59] (03CR) 10CI reject: [V: 04-1] P:openstack: rabbitmq: cleanup rabbitmq firewall [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [14:33:11] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1314/co" [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [14:33:15] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: elasticsearch::cirrus [14:33:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56442 and previous config saved to /var/cache/conftool/dbconfig/20240207-143317-arnaudb.json [14:33:28] <_joe_> Lucas_WMDE: that's ok, I just want https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/998318 to go out so we might be able to unblock the train [14:33:32] (03CR) 10Effie Mouzeli: "As I discussed with Claime in previous commments, this variable is different than the others we set (which are all server related configur" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [14:34:54] (03PS2) 10Majavah: P:openstack: rabbitmq: cleanup rabbitmq firewall [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) [14:36:02] (03CR) 10CI reject: [V: 04-1] P:openstack: rabbitmq: cleanup rabbitmq firewall [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [14:36:05] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10akosiaris) >>! In T316421#9520979, @Jelto wrote: > @akosiaris do you have any information or experience what upgrade path etherpad-lite has? I was not able to find an... [14:36:35] _joe_: hm, was ffmpegEncode() missing the * 1024 factor before? (I can see that in midiToAudioEncode() it just moved around) [14:36:49] <_joe_> both :D [14:36:53] (03PS1) 10Muehlenhoff: Switch elasticsearch::cirrus to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998421 (https://phabricator.wikimedia.org/T349619) [14:37:01] <_joe_> and also adding a limit to wall clock time, I misread our configs [14:37:18] (03PS3) 10Majavah: P:openstack: rabbitmq: cleanup rabbitmq firewall [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) [14:37:31] yeah, but the time limit is explained in the commit message and the memory bit isn’t :P [14:37:48] anyway, looks fine to backport :) [14:37:53] <_joe_> Additionally, add the same configuration to [14:37:54] <_joe_> ffmpegEncode() as well as midiEncode(). [14:37:56] <_joe_> :) [14:38:06] <_joe_> I just forgot to add the conf there, completely [14:38:14] <_joe_> that's what you get for writing patches during outages [14:38:28] (03CR) 10Muehlenhoff: [C: 03+2] Switch elasticsearch::cirrus to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/998421 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:38:45] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1316/co" [puppet] - 10https://gerrit.wikimedia.org/r/998419 (https://phabricator.wikimedia.org/T345610) (owner: 10Majavah) [14:39:29] limits that are silently in units other than “one” without mentioning it in the name are evil anyway ^^ [14:39:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:36] if it was $wgTranscodeBackgroundMemoryLimitInKiB it would’ve been more obvious [14:39:50] <_joe_> Lucas_WMDE: actually I want to make it in bytes [14:39:58] <_joe_> but not while unblocking the train [14:40:02] yeah, fair [14:40:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts ms-be[2044-2050].codfw.wmnet [14:41:34] (03PS1) 10Btullis: Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998422 (https://phabricator.wikimedia.org/T356792) [14:41:58] <_joe_> Lucas_WMDE: looks like CI will fail for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/998385?tab=checks [14:42:15] I think that’s the old check? [14:42:23] in https://integration.wikimedia.org/zuul/ it looks all green or blue at the moment [14:42:48] (03CR) 10CI reject: [V: 04-1] Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998422 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis) [14:43:00] <_joe_> Lucas_WMDE: ah yeah gerrit's UI is confusing [14:43:21] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10klausman) [14:43:32] (03PS1) 10Majavah: P:openstack: radosgw: move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998423 [14:43:36] <_joe_> I'm a bit worried by how excruciatingly slow CI is [14:44:06] <_joe_> in case we're in an emergency we'll have to cherry-pick a patch to the deployment server [14:44:25] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on ml-cache2001.codfw.wmnet with reason: Machine network link move (T355861) [14:44:29] T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 [14:44:30] (ProbeDown) firing: (3) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:44:42] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ml-cache2001.codfw.wmnet with reason: Machine network link move (T355861) [14:45:14] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10klausman) [14:45:22] (03PS2) 10Btullis: Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) [14:45:42] (03Abandoned) 10Btullis: Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998422 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis) [14:46:06] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1317/co" [puppet] - 10https://gerrit.wikimedia.org/r/998423 (owner: 10Majavah) [14:46:21] (03CR) 10Majavah: P:openstack: radosgw: move to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998423 (owner: 10Majavah) [14:46:55] (03CR) 10Btullis: "OK, can do. I set for 5 minutes in the latest patchset." [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis) [14:47:31] (03CR) 10Hnowlan: [C: 03+2] kubernetes: make 5 appservers k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/998403 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [14:47:44] (03CR) 10Klausman: [C: 03+1] "As a pessimist, I suspect 300s (5m) will bite us again in the future, but at least we'll know where to look 😉" [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis) [14:48:18] (03PS1) 10Filippo Giunchedi: nrpe: remove monitor_systemd_unit_state [puppet] - 10https://gerrit.wikimedia.org/r/998424 (https://phabricator.wikimedia.org/T337831) [14:48:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: kernel upgrade done', diff saved to https://phabricator.wikimedia.org/P56443 and previous config saved to /var/cache/conftool/dbconfig/20240207-144822-arnaudb.json [14:50:08] (03CR) 10Brouberol: Add helmfile deployments for Superset (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis) [14:50:10] (03PS14) 10Brouberol: Add helmfile deployments for Superset [deployment-charts] - 10https://gerrit.wikimedia.org/r/987786 (https://phabricator.wikimedia.org/T353791) (owner: 10Btullis) [14:50:19] (03Merged) 10jenkins-bot: ParserObserver: Limit the size of cache of previous parse traces [core] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998385 (https://phabricator.wikimedia.org/T351732) (owner: 10Bartosz Dziewoński) [14:50:45] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:998384|ParserObserver: Limit the size of cache of previous parse traces (T351732)]], [[gerrit:998385|ParserObserver: Limit the size of cache of previous parse traces (T351732)]] [14:50:48] T351732: Debug memory leak in maintenance script - https://phabricator.wikimedia.org/T351732 [14:50:50] !log reboot ncredir2001 [14:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:26] RECOVERY - Check systemd state on ncredir2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:50] topranks: ^^ [14:52:15] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:998384|ParserObserver: Limit the size of cache of previous parse traces (T351732)]], [[gerrit:998385|ParserObserver: Limit the size of cache of previous parse traces (T351732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:52:34] nothing to test, or so I heard [14:52:36] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Continuing with sync [14:52:41] yeah [14:53:30] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:39] (03PS1) 10Btullis: Use the analytics-presto CNAME for workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/998425 (https://phabricator.wikimedia.org/T336045) [14:56:23] MatmaRex: how long is that maintenance script expected to take? [14:56:41] (SystemdUnitFailed) resolved: (2) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:57:04] Lucas_WMDE: the enwiki one probably a couple of weeks. the other ones a couple of days [14:57:18] (03PS2) 10Btullis: Use the analytics-presto CNAME for workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/998425 (https://phabricator.wikimedia.org/T336045) [14:57:29] !log reboot ncredir2001 [14:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:51] ok [14:58:25] (SystemdUnitFailed) firing: (8) prometheus-phpfpm-statustext-textfile.service Failed on mw1401:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:54] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:998384|ParserObserver: Limit the size of cache of previous parse traces (T351732)]], [[gerrit:998385|ParserObserver: Limit the size of cache of previous parse traces (T351732)]] (duration: 08m 08s) [14:58:57] T351732: Debug memory leak in maintenance script - https://phabricator.wikimedia.org/T351732 [14:58:58] (03CR) 10Majavah: "recheck" [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [14:59:04] _joe_: you can backport now, I think [14:59:28] (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [14:59:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:42] MatmaRex: and all of the commands in that comment are still needed? (asking because it’s a few months old and urbanecm had some comments afterwards) [14:59:50] (“that comment” = https://phabricator.wikimedia.org/T315510#9312431) [14:59:59] * urbanecm was summoned [15:00:06] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1500) [15:00:20] Lucas_WMDE: i'm not sure how far the scripts made it, so not all may be needed, but they won't hurt [15:00:26] ok [15:00:31] * Lucas_WMDE fires up a tmux [15:00:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:00:40] it seems easier to re-run them than to figure it out [15:00:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:00:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:01:12] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/998425 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [15:01:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:01:19] thank you :) [15:01:21] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T355609)', diff saved to https://phabricator.wikimedia.org/P56444 and previous config saved to /var/cache/conftool/dbconfig/20240207-150121-marostegui.json [15:01:34] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:01:42] (SystemdUnitFailed) firing: (30) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:45] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: elasticsearch::cirrus [15:02:05] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki frwiki --current --all --touched-after=20230613000000 --start '["7544396"]' # T315510, in tmux [15:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:14] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [15:03:16] I’ll wait until it prints the next --start line before starting rowiki [15:03:25] (SystemdUnitFailed) resolved: (30) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:26] <_joe_> Lucas_WMDE: can we start merging my change? it will take 20 minutes anyways [15:03:33] _joe_: you can go ahead, I’m done [15:03:38] or should I deploy it? [15:03:51] <_joe_> Lucas_WMDE: if you already have a console :) [15:03:57] ok sure ^^ [15:04:05] <_joe_> <3 [15:04:05] (03PS1) 10Slyngshede: Use the ManifestStaticFilesStorage in production [software/bitu] - 10https://gerrit.wikimedia.org/r/998426 [15:04:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998318 (https://phabricator.wikimedia.org/T356780) (owner: 10Giuseppe Lavagetto) [15:04:23] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:28] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2377.codfw.wmnet with OS bullseye [15:04:43] (03CR) 10CI reject: [V: 04-1] Use the ManifestStaticFilesStorage in production [software/bitu] - 10https://gerrit.wikimedia.org/r/998426 (owner: 10Slyngshede) [15:04:47] !log STOP script for T315510, forgot to tee it somewhere useful [15:04:49] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2378.codfw.wmnet with OS bullseye [15:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:29] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2406.codfw.wmnet with OS bullseye [15:05:30] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2301.codfw.wmnet with OS bullseye [15:05:32] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2310.codfw.wmnet with OS bullseye [15:05:52] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki frwiki --current --all --touched-after=20230613000000 --start '["7544396"]' | tee ~/T315510-frwiki # in tmux [15:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:33] 10SRE-swift-storage: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10MatthewVernon) [15:06:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T355609)', diff saved to https://phabricator.wikimedia.org/P56445 and previous config saved to /var/cache/conftool/dbconfig/20240207-150643-marostegui.json [15:06:48] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:07:25] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:36] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [15:08:47] (03PS3) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [15:08:58] MatmaRex: how long should it usually take to start seeing some more output from the script? [15:09:08] the frwiki one has just printed “Processing” and the first --start so far [15:09:13] none of the “Processed” messages yet [15:09:35] (haven’t started the script for other wikis yet) [15:09:40] (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [15:10:25] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[2044-2050].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [15:10:27] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:04] (03PS4) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [15:11:07] I guess it’s going through a lot of rows again that were already processed, and not printing output until it finds the point where it really has to resume? [15:11:26] Lucas_WMDE: hmm, not sure [15:11:49] or it could be stuck on some page that causes parsoid to hang [15:11:57] (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [15:11:59] I’ll start the rowiki and see how it behaves [15:12:09] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:12] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki rowiki --current --all --touched-after=20230613000000 --start '["2041962"]' | tee ~/T315510-rowiki # in tmux [15:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:19] ok, that one is printing output directly [15:12:27] processed 100/200/300 (updated 0) [15:12:30] i can schedule this for another time if you want to be done for today [15:12:37] so it sounds like it should print something even when it has nothing to do 🤔 [15:12:43] hm [15:13:13] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:35] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:37] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[2044-2050].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [15:13:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[2044-2050].codfw.wmnet [15:13:45] 10SRE-swift-storage: Q3 ms backend refresh work - https://phabricator.wikimedia.org/T353149 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `ms-be[2044-2050].codfw.wmnet` - ms-be2044.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager - Found physic... [15:14:25] _joe_: I might as well ask this now – is it possible to test the WebVideoTranscodeJob backport on mwdebug? [15:14:33] (still ETA 9min in zuul btw) [15:14:36] (03PS5) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [15:14:47] <_joe_> Lucas_WMDE: no I don't think you can, given jobs go to the jobqueue [15:14:55] yeah, makes sense [15:14:59] <_joe_> and this change is contained to a job [15:15:31] (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [15:16:27] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:34] (03PS6) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [15:18:46] (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [15:18:50] 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10lmata) Thanks for the report; we'll continue to investigate and discuss. [15:18:50] <_joe_> also, 9 minutes, wow. Jenkins is fast [15:19:07] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:39] 10SRE: sre - https://phabricator.wikimedia.org/T356881 (10Vecna-the-whispered) [15:19:49] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:11] (03CR) 10Ssingh: slo_definitions: Use trafficserver_backend_sli_bad (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [15:20:21] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2377.codfw.wmnet with reason: host reimage [15:20:29] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2378.codfw.wmnet with reason: host reimage [15:21:04] MatmaRex: rough estimate for rowiki based on its current processing rate: a bit over 2 days [15:21:19] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2406.codfw.wmnet with reason: host reimage [15:21:20] though so far it’s still “updated 0” all around, so who knows how much it’ll slow down once it actually has something to do :'D [15:21:43] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2301.codfw.wmnet with reason: host reimage [15:21:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P56446 and previous config saved to /var/cache/conftool/dbconfig/20240207-152150-marostegui.json [15:21:59] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2310.codfw.wmnet with reason: host reimage [15:22:02] still no output from frwiki btw o_O [15:22:07] * Lucas_WMDE peeks at htop [15:22:21] well, it’s barely eating any CPU [15:22:24] (03CR) 10Brouberol: [C: 03+1] "Yes please" [puppet] - 10https://gerrit.wikimedia.org/r/998425 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [15:22:50] wait, wrong process [15:22:55] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2377.codfw.wmnet with reason: host reimage [15:23:14] okay, the frwiki process *is* eating 100% of one CPU [15:23:24] 17 minutes of CPU time so far [15:23:36] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw - https://phabricator.wikimedia.org/T355862 (10Jhancock.wm) This rack is physically ready [15:23:45] rowiki is more like 60% CPU (and making visible progress for it, of course) [15:23:55] well… something is eating memory though: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mwmaint2002&viewPanel=4 [15:23:59] frwiki is also at 10.9G resident memory already [15:24:19] 10SRE: sre - https://phabricator.wikimedia.org/T356881 (10Bugreporter) 05Open→03Invalid [15:24:21] i think it's stuck on some specific page. there's a parsoid bug where some pages take infinite memory [15:24:29] (03Merged) 10jenkins-bot: WebVideoTranscodeJob: also add time limits [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998318 (https://phabricator.wikimedia.org/T356780) (owner: 10Giuseppe Lavagetto) [15:24:31] yikes [15:24:33] that’s a lot of memory [15:24:38] <_joe_> yeah [15:24:42] yeah I’ll probably kill it in a few minutes [15:24:46] (this is not the memory leak that i hope i fixed) [15:24:51] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:998318|WebVideoTranscodeJob: also add time limits (T356780)]] [15:24:55] T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780 [15:25:09] i think you can stop it, yeah, and i'll need to find what page is that, because there isn't enough logging [15:25:31] I guess it must be in the first 100 rows after the start ID 7544396? [15:25:33] whatever table that refers to [15:25:38] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2310.codfw.wmnet with reason: host reimage [15:25:39] (I think I guessed it wrong once already, weeks ago ^^) [15:25:44] yes [15:25:58] if the limit is configurable I can try to narrow it down, maybe [15:26:05] but let’s backport poor _joe_’s change first ^^ [15:26:16] <_joe_> ahah [15:26:18] (see https://phabricator.wikimedia.org/T254522 and https://phabricator.wikimedia.org/T353874 for one specific case) [15:26:25] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and oblivian: Backport for [[gerrit:998318|WebVideoTranscodeJob: also add time limits (T356780)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:26:26] <_joe_> thanks <3 [15:26:27] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and oblivian: Continuing with sync [15:27:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/998415 (owner: 10Majavah) [15:27:27] (03PS6) 10Arturo Borrero Gonzalez: P:wmcs::cloudgw: do not NAT traffic to cloud-internal networks [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [15:27:40] (03CR) 10Majavah: [C: 03+2] network: add cloud-codfw-bgp-private-vips [puppet] - 10https://gerrit.wikimedia.org/r/998415 (owner: 10Majavah) [15:28:02] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2301.codfw.wmnet with reason: host reimage [15:28:25] (SystemdUnitFailed) firing: (2) prometheus-phpfpm-statustext-textfile.service Failed on mwdebug1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:14] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2406.codfw.wmnet with reason: host reimage [15:30:26] (03PS7) 10Slyngshede: LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 [15:30:37] (03CR) 10Btullis: [C: 03+2] Configure analytics.wikimedia.org to support large downloads [puppet] - 10https://gerrit.wikimedia.org/r/998345 (https://phabricator.wikimedia.org/T356792) (owner: 10Btullis) [15:31:07] !log STOP persistRevisionThreadItems on frwiki for T315510 – 100% CPU usage, 15G RAM and counting, no progress output: clearly stuck on something [15:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:13] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [15:31:23] (03CR) 10Btullis: [C: 03+2] Use the analytics-presto CNAME for workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/998425 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [15:31:27] (03CR) 10CI reject: [V: 04-1] LDAPBackend: Implement limit checks for UID [software/bitu] - 10https://gerrit.wikimedia.org/r/998418 (owner: 10Slyngshede) [15:32:24] <_joe_> Lucas_WMDE: it might make sense to launch the script with strace at this point [15:32:37] <_joe_> or if you launch it, I can attach with strace myself [15:32:40] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:998318|WebVideoTranscodeJob: also add time limits (T356780)]] (duration: 07m 48s) [15:32:44] T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780 [15:32:44] <_joe_> to try and see what's going on [15:33:02] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2378.codfw.wmnet with reason: host reimage [15:33:04] <_joe_> oh ok, let me see if I finally fixed something, or if I need to propose a rollback [15:33:25] (SystemdUnitFailed) firing: (32) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:30] 👍 [15:33:39] * Lucas_WMDE hasn’t straced mwscript before [15:33:49] stracing the php process based on its PID might work better, yeah [15:34:09] but I have a meeting now, so I’ll leave it alone for a bit if that’s okay [15:34:24] !log backport+config window done [15:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:17] (03PS2) 10Clément Goubert: codfw lvs::balancer: Switch config_host to conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/998431 (https://phabricator.wikimedia.org/T355870) [15:35:19] (03CR) 10Clément Goubert: [V: 03+1] "In preparation for B3 migration on 2024-02-28 where conf2004 will go offline for a brief period. I've presumed we don't want to use conf20" [puppet] - 10https://gerrit.wikimedia.org/r/998431 (https://phabricator.wikimedia.org/T355870) (owner: 10Clément Goubert) [15:35:21] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/956082 (https://phabricator.wikimedia.org/T345778) (owner: 10Bking) [15:35:23] yeah, thanks. i will find the problem page [15:35:42] _joe_: it's 100% stuck in parsoid trying to parse some degenerate wikitext [15:36:00] <_joe_> I love "degenerate wikitext" [15:36:21] Isn't that just wikitext? [15:36:26] ayyyyyy [15:36:26] heh [15:36:35] i actually found the page, not sure if i should paste the link here [15:36:42] MatmaRex: is it okay if I already start ukwiki and/or viwiki? or should they wait until rowiki is done? (since they’re combined by && in your comment) [15:36:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P56447 and previous config saved to /var/cache/conftool/dbconfig/20240207-153656-marostegui.json [15:37:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:37:29] Lucas_WMDE: probably better to wait, amir didn't want me to run multiple of those scripts on the same db group [15:37:38] alright [15:37:51] (i think that's too safe, but better safe than sorry) [15:38:03] should I start enwiki then? [15:38:05] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A7 from asw-a7-codfw to lsw1-a7-codfw - https://phabricator.wikimedia.org/T355867 (10Andrew) There's no need to coordinate with us for cloudbackup2001, it might cause us to get a transient alert... [15:38:25] (SystemdUnitFailed) resolved: (38) prometheus-phpfpm-statustext-textfile.service Failed on mw1349:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:33] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:38:57] probably with the last --start from https://phabricator.wikimedia.org/T315510#9328399 [15:39:11] yeah, you can [15:39:34] and yeah, you're right, we can start from that point [15:39:49] ok [15:40:08] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --start '["67578461"]' | tee ~/T315510-enwiki # in tmux [15:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:14] oh, i guess that also says the rowiki and ukwiki runs finished? [15:40:28] oh, hm [15:40:36] (enwiki is making progress btw and already updated 1) [15:40:38] (yay) [15:42:06] (03CR) 10Ssingh: [C: 03+1] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/998431 (https://phabricator.wikimedia.org/T355870) (owner: 10Clément Goubert) [15:42:15] MatmaRex: but doesn’t that comment only mean that rowiki finished, and ukwiki was still in progress? [15:42:22] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2377.codfw.wmnet with OS bullseye [15:42:40] Lucas_WMDE: oops, yes [15:42:58] i misread [15:43:17] ok, so I can kill rowiki and instead start ukwiki with the --start from there [15:43:39] !log import etherpad-lite 1.9.7-1 on apt1001 host - T316421 [15:44:02] yeah [15:44:59] PROBLEM - Check systemd state on kubemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:00] !log STOP persistRevisionThreadItems on rowiki for T315510 – according to T315510#9328399, it should be done already (it was at --start '["2075226"]' and had processed 31000, updated 0) [15:45:19] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2310.codfw.wmnet with OS bullseye [15:45:22] Lucas_WMDE: are you rolling out changes related to jobqueue/jobrunners atm? [15:45:31] !log depool codfw dnsdisc T355861 [15:45:38] not as far as I’m aware [15:45:44] !log mvernon@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw [15:45:47] I deployed backports, those are done [15:45:50] and am running some maintenancle scripts [15:45:51] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a2-codfw.mgmt with reason: prepping for server uplink migration codfw rack a2 [15:45:58] hnowlan: There was joe's change to TMH [15:46:00] hnowlan: is anything wrong? [15:46:07] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a2-codfw.mgmt with reason: prepping for server uplink migration codfw rack a2 [15:46:11] RECOVERY - Check systemd state on kubemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:14] the last backport (joe’s change to TMH) was https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/998318 [15:46:15] !log depool thanos-fe2001 T355861 [15:46:20] <_joe_> hnowlan: talk to me [15:46:28] <_joe_> what is the problem you're seeing? [15:46:32] !log moving Netbox server uplinks from asw-a2-codfw to lsw1-a2-codfw to prep config for server moves T355861 [15:46:34] there's been a spike in errors for jobqueue since 14:55 https://logstash.wikimedia.org/goto/684a454f5135b7b7fdb695a19b0ec98d [15:46:56] <_joe_> so well before my change went out, which was changing webVideoTranscode on group0 [15:47:42] <_joe_> hnowlan: those are errors *enqueueing* jobs [15:47:50] <_joe_> so the problem seems to be eventgate-main maybe? [15:47:57] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2301.codfw.wmnet with OS bullseye [15:47:57] the backport of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/998384 would be closer to that time, but I don’t see how it could be related [15:49:11] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2406.codfw.wmnet with OS bullseye [15:49:29] <_joe_> I can't make much of that logstash [15:49:52] (i filed https://phabricator.wikimedia.org/T356884 about the fr.wp page that i think is hanging my maintenance script) [15:50:00] <_joe_> hnowlan: also seems limited to k8s, wth [15:50:01] yeah not a lot of detail in the errors [15:50:28] _joe_: could be a side effect of the specific type of job [15:50:41] <_joe_> hnowlan: it's not just the jobrunners [15:50:57] <_joe_> but did you find it's a specific type of job? [15:51:01] no [15:51:01] <_joe_> did I miss something? [15:52:01] <_joe_> no it's all over the place [15:52:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T355609)', diff saved to https://phabricator.wikimedia.org/P56448 and previous config saved to /var/cache/conftool/dbconfig/20240207-155203-marostegui.json [15:52:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:52:10] <_joe_> hnowlan: I'd take a look at eventgate-main [15:52:17] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2378.codfw.wmnet with OS bullseye [15:52:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:52:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170:3317 (T355609)', diff saved to https://phabricator.wikimedia.org/P56449 and previous config saved to /var/cache/conftool/dbconfig/20240207-155225-marostegui.json [15:54:02] Seeing some heap limit exceeded logs for eventgate-main but not mich more [15:54:08] and it's like 2 errors [15:54:12] well warnings [15:59:12] I also just noticed stashbot is gone [15:59:18] so some SAL messages got lost already [15:59:24] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 22 hosts with reason: Migrating servers in codfw rack A2 to lsw1-a2-codfw [15:59:45] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 22 hosts with reason: Migrating servers in codfw rack A2 to lsw1-a2-codfw [16:00:49] (ProbeDown) firing: (3) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:00:58] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:01:16] hello thanos my old friend [16:01:19] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:01:27] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:01:27] PROBLEM - SSH on titan1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:01:44] herron: isn't this about when it happened yesterday? [16:01:50] 16:00 UTC exactly 🤔 [16:02:03] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:02:14] cdanis: yeah sounds right [16:02:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T355609)', diff saved to https://phabricator.wikimedia.org/P56450 and previous config saved to /var/cache/conftool/dbconfig/20240207-160218-marostegui.json [16:02:28] !log Commencing server uplink moves from old switch to new in codfw rack A2 T355861 [16:03:34] !log STOP persistRevisionThreadItems on rowiki for T315510 – according to T315510#9328399, it should be done already (it was at --start '["2075226"]' and had processed 31000, updated 0) [relog from 15:45, stashbot was down] [16:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:44] that’s better [16:03:46] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [16:03:58] (lots of other log messages from the last 20 minutes are presumably also missing) [16:04:30] (ProbeDown) firing: (5) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:04:33] (JobUnavailable) firing: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:04:51] !log Commencing server uplink moves from old switch to new in codfw rack A2 T355861 [16:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:01] !log import etherpad-lite 1.9.7-1 on apt1001 host - T316421 [16:05:01] T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 [16:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:14] T316421: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 [16:05:28] vgutierrez: thanks! I'd forget my own damn head you know :) [16:05:37] np :) [16:05:58] I owe you some brain cells for that /etc/network/interfaces thingie [16:07:14] !log btullis@cumin1002 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [16:10:20] it looks like the “could not enqueue jobs” errors went away again? [16:10:33] !log hard reboot titan1002 [16:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:37] ah, at 16:00 UTC as cdanis wrote above [16:10:44] (I wasn’t sure if that had referred to the same thing or something else ^^) [16:11:04] Lucas_WMDE: for 16:00 UTC I was referring to the titan1* crashes [16:11:11] hm, ok [16:11:42] still, the last error in logstash was at 16:00:00.874… [16:11:54] that’s extremely close to the full hour [16:12:20] yeeeeah, two messages that are ms over the second but otherwise a dead stop [16:14:13] RECOVERY - SSH on titan1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:15:19] !log START lucaswerkmeister-wmde@mwmaint2002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki ukwiki --current --all --touched-after=20230613000000 --start '["1685316"]' | tee ~/T315510-ukwiki # in tmux [16:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:26] MatmaRex: ^ fyi [16:15:42] okay, and it’s printing “processed” messages, so it’s not stuck it seems [16:15:48] (though it feels slower than the enwiki one?) [16:15:49] (ProbeDown) firing: (5) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:50] thanks Lucas_WMDE [16:16:02] !log repool thanos-fe2001 T355861 [16:16:04] Lucas_WMDE: btw, i am finding out why it's broken: https://phabricator.wikimedia.org/T356884 [16:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:08] T355861: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 [16:16:10] actually, scratch that, I think they’re about equally slow and I just forgot the speed [16:16:14] nice [16:16:19] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw [16:16:26] <_joe_> MatmaRex: is that script enqueuing jobs, by any chance? [16:16:27] !log repool codfw dnsdisc T355861 [16:16:29] we seem to have an infinite loop in DiscussionTools actually, not the parser [16:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:38] _joe_: it shouldn't [16:16:43] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:16:49] <_joe_> MatmaRex: oh interesting :) [16:16:58] !log klausman@cumin2002 START - Cookbook sre.hosts.remove-downtime for ml-cache2001.codfw.wmnet [16:16:59] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ml-cache2001.codfw.wmnet [16:17:04] _joe_: the enwiki script has also been running in the background the whole time btw [16:17:09] but it parses pages, and who knows what that does [16:17:09] even after the logstash errors stopped [16:17:15] does the timing match the other issue? [16:17:16] <_joe_> yeah I'm trying to find causes [16:17:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P56451 and previous config saved to /var/cache/conftool/dbconfig/20240207-161725-marostegui.json [16:17:31] <_joe_> but I think it was just eventgate not being healthy [16:17:31] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:18:05] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:19:30] (ProbeDown) firing: (8) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:33] (JobUnavailable) resolved: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:19:43] MatmaRex: the ukwiki run is still printing “updated 0” despite having made some updates at the end of the script run urbanec.m posted; is there something else that could have “processed” the pages(?) in the meantime, or is this unexpected? [16:19:58] Lucas_WMDE: yeah, if they were purged for any reason [16:20:08] hm, ok [16:20:15] or maybe it was “And started again” at the end of that comment… [16:20:34] although enwiki had stuff to do from the get go, so it doesn’t seem like that one had been started again [16:20:49] (ProbeDown) firing: (8) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:57] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:40] !log sbailey@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [16:25:56] !log sbailey@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [16:27:56] I asked about it on the task now [16:32:01] !log sbailey@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [16:32:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P56452 and previous config saved to /var/cache/conftool/dbconfig/20240207-163231-marostegui.json [16:33:30] !log sbailey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [16:34:36] !log sbailey@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [16:35:44] !log sbailey@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [16:39:29] !log btullis@cumin1002 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [16:46:25] !log homer 'cr*codfw*' commit 'T354791' for 5 new k8s ex-appservers [16:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:29] T354791: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 [16:46:53] btullis: Bring two new stat servers into service (9596fbf8b5) ok to merge? [16:47:17] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for asw-a-codfw,cr[1-2]-codfw,lsw1-a2-codfw.mgmt [16:47:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for asw-a-codfw,cr[1-2]-codfw,lsw1-a2-codfw.mgmt [16:47:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T355609)', diff saved to https://phabricator.wikimedia.org/P56454 and previous config saved to /var/cache/conftool/dbconfig/20240207-164738-marostegui.json [16:47:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:47:42] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:47:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [16:52:12] !log hnowlan@cumin2002 conftool action : set/weight=10; selector: name=(mw2377.codfw.wmnet|mw2378.codfw.wmnet|mw2406.codfw.wmnet|mw2301.codfw.wmnet|mw2310.codfw.wmnet),cluster=kubernetes,service=kubesvc [16:52:21] !log hnowlan@cumin2002 conftool action : set/pooled=yes; selector: name=(mw2377.codfw.wmnet|mw2378.codfw.wmnet|mw2406.codfw.wmnet|mw2301.codfw.wmnet|mw2310.codfw.wmnet),cluster=kubernetes,service=kubesvc [16:54:34] !log sbailey@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [16:55:00] !log sbailey@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [16:55:25] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-loki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:56:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:57:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T355609)', diff saved to https://phabricator.wikimedia.org/P56455 and previous config saved to /var/cache/conftool/dbconfig/20240207-165703-marostegui.json [16:57:07] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:58:03] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.timer,httpbb_kubernetes_mw-api-ext_hourly.timer,httpbb_kubernetes_mw-api-int_hourly.timer,httpbb_kubernetes_mw-jobrunner_hourly.timer,httpbb_kubernetes_mw-web_hourly.timer,httpbb_kubernetes_mw-wikifunctions_hourly.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:58] FYI, I'm in the process of moving those httpbb timers from cumin1001 to cumin1002. They've now been absented on cumin1001 and are coming up on 1002 shortly. [17:02:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T355609)', diff saved to https://phabricator.wikimedia.org/P56456 and previous config saved to /var/cache/conftool/dbconfig/20240207-170225-marostegui.json [17:02:47] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:03:20] !log sbailey@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [17:03:56] !log sbailey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [17:03:58] (ProbeDown) firing: (2) Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:04:21] !log sbailey@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [17:04:52] !log sbailey@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [17:05:34] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:08:04] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:10:54] sukhe: ^^ not sure if this is expected? [17:11:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=thumbor [17:12:07] topranks: wonder if that's related to my thumbor mishap [17:12:16] hnowlan: I was just wondering the same [17:12:23] it might recover in a few [17:12:24] yeah :) [17:12:39] ok cool, thanks! [17:12:52] checking still [17:13:08] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:13:23] yay :) [17:13:24] ok :) [17:13:28] apologies [17:13:40] np! thanks for the ping topranks [17:13:53] hnowlan: Traffic is around if we can help [17:13:59] (ProbeDown) resolved: (2) Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:14:05] oops [17:14:30] (ProbeDown) firing: (3) Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:15:44] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:17:32] PROBLEM - Check systemd state on stat1011 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:17:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P56457 and previous config saved to /var/cache/conftool/dbconfig/20240207-171732-marostegui.json [17:18:04] PROBLEM - Check systemd state on stat1010 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:23] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@1007273]: Disabling storage for jawiki [17:26:07] !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-jumbo-eqiad [17:32:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P56458 and previous config saved to /var/cache/conftool/dbconfig/20240207-173238-marostegui.json [17:32:43] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@1007273]: Disabling storage for jawiki (duration: 07m 19s) [17:47:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T355609)', diff saved to https://phabricator.wikimedia.org/P56459 and previous config saved to /var/cache/conftool/dbconfig/20240207-174745-marostegui.json [17:47:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [17:47:50] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:48:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [17:48:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T355609)', diff saved to https://phabricator.wikimedia.org/P56460 and previous config saved to /var/cache/conftool/dbconfig/20240207-174807-marostegui.json [17:52:42] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [17:52:44] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [17:53:15] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:53:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T355609)', diff saved to https://phabricator.wikimedia.org/P56461 and previous config saved to /var/cache/conftool/dbconfig/20240207-175328-marostegui.json [17:53:32] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:56:59] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1800) [18:08:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P56462 and previous config saved to /var/cache/conftool/dbconfig/20240207-180835-marostegui.json [18:13:53] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:15:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:17:42] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:34] (03PS1) 10Dzahn: wikistats: symlink deploy script into PATH [puppet] - 10https://gerrit.wikimedia.org/r/998491 [18:18:45] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: httpbb needs to be setup on cumin1002 and removed from cumin1001 - https://phabricator.wikimedia.org/T356054 (10Scott_French) 05Open→03Resolved Timers are up and happy on cumin1002 and no longer running on cumin1001. [18:18:49] 10SRE, 10Infrastructure-Foundations: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419 (10Scott_French) [18:19:53] (03CR) 10BryanDavis: [C: 03+1] Provide context for account creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede) [18:20:05] (03CR) 10Dzahn: "deployed config change with deploy-wikistats to change user agent for https://phabricator.wikimedia.org/T354101" [puppet] - 10https://gerrit.wikimedia.org/r/998491 (owner: 10Dzahn) [18:20:27] (03CR) 10CI reject: [V: 04-1] Provide context for account creation. [software/bitu] - 10https://gerrit.wikimedia.org/r/997811 (https://phabricator.wikimedia.org/T353584) (owner: 10Slyngshede) [18:20:44] (03CR) 10Dzahn: [C: 03+2] wikistats: symlink deploy script into PATH [puppet] - 10https://gerrit.wikimedia.org/r/998491 (owner: 10Dzahn) [18:23:22] (03PS2) 10Majavah: Add a python-bookworm image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/997537 [18:23:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P56463 and previous config saved to /var/cache/conftool/dbconfig/20240207-182342-marostegui.json [18:24:23] (03PS1) 10Ahmon Dancy: Update buildkitd image references [puppet] - 10https://gerrit.wikimedia.org/r/998493 (https://phabricator.wikimedia.org/T356418) [18:25:29] !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-jumbo-eqiad [18:27:41] (CirrusSearchNodeIndexingNotIncreasing) firing: (5) Elasticsearch instance elastic2037-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:28:26] (03PS1) 10Bking: cloudelastic: Begin private IP migration for cloudelastic1008 [puppet] - 10https://gerrit.wikimedia.org/r/998494 (https://phabricator.wikimedia.org/T355617) [18:30:36] !log btullis@cumin1002 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [18:31:54] (03PS1) 10Ahmon Dancy: Revert "Temporarily enable Dockerfile frontend on trusted runners" [puppet] - 10https://gerrit.wikimedia.org/r/998495 (https://phabricator.wikimedia.org/T356418) [18:32:41] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2037-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:33:01] 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05Open→03In progress [18:33:03] :eyes on that Elstic alert [18:33:11] 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) a:03cchen [18:33:44] 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) p:05Triage→03High [18:35:04] 10SRE-Access-Requests, 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) Should we close this ticket as "invalid"? It seems the best course of action might be a new ticket like "migrate all WMDE pipelines to airflow" an... [18:35:47] 10SRE-Access-Requests, 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) 05Open→03In progress [18:37:41] (CirrusSearchNodeIndexingNotIncreasing) resolved: (3) Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:38:12] ah, these are leftovers from the switch maintenance, I guess the suppression just expired [18:38:17] anyway, we're all good [18:38:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T355609)', diff saved to https://phabricator.wikimedia.org/P56464 and previous config saved to /var/cache/conftool/dbconfig/20240207-183849-marostegui.json [18:38:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [18:38:56] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:39:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [18:39:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T355609)', diff saved to https://phabricator.wikimedia.org/P56465 and previous config saved to /var/cache/conftool/dbconfig/20240207-183912-marostegui.json [18:40:24] (03CR) 10Cathal Mooney: "LGTM overall, splitting the dmz_cidr is a good idea. I think for the purpose of the "no nat" rule it might be easier to just use the clou" [puppet] - 10https://gerrit.wikimedia.org/r/998412 (https://phabricator.wikimedia.org/T356850) (owner: 10Majavah) [18:43:41] (03PS1) 10Bking: cloudelastic: Complete cloudelastic1008's migration [puppet] - 10https://gerrit.wikimedia.org/r/998498 (https://phabricator.wikimedia.org/T355617) [18:44:04] (03CR) 10CI reject: [V: 04-1] cloudelastic: Complete cloudelastic1008's migration [puppet] - 10https://gerrit.wikimedia.org/r/998498 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:44:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/998494 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:44:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T355609)', diff saved to https://phabricator.wikimedia.org/P56466 and previous config saved to /var/cache/conftool/dbconfig/20240207-184433-marostegui.json [18:44:38] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:45:28] (03CR) 10Eevans: [C: 03+2] sessionstore: provision sessionstore2004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997991 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [18:49:05] !log btullis@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-jumbo-eqiad cluster: Roll restart of jvm daemons. [18:49:07] 10SRE-Access-Requests, 10Data-Platform-SRE: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) Not yet. I believe @AndrewTavis_WMDE will be sharing some findings from WMDE side soon. [18:50:39] (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/998494 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:53:30] (03CR) 10Btullis: "The commit message is a bit confusing. Are we at risk of removing them before they are defunct?" [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:59:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P56467 and previous config saved to /var/cache/conftool/dbconfig/20240207-185940-marostegui.json [19:00:05] brennen and dancy: That opportune time for a Train log triage with CPT deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1900). [19:00:05] brennen and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T1900). [19:00:54] o/ [19:01:05] o/ [19:01:35] (03PS1) 10Eevans: (faux) keys & certs for sessionstore200[4-6] [labs/private] - 10https://gerrit.wikimedia.org/r/998504 (https://phabricator.wikimedia.org/T356829) [19:01:36] !log train 1.42.0-wmf.17 (T354435): a couple of blockers currently, waiting on resolution before rolling [19:01:37] (03PS1) 10Eevans: cleanup obsolete keys & certs (hosts decommissioned) [labs/private] - 10https://gerrit.wikimedia.org/r/998505 [19:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:54] T354435: 1.42.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T354435 [19:02:17] (03PS3) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) [19:07:07] (03CR) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [19:07:17] (03CR) 10Eevans: [V: 03+2 C: 03+2] (faux) keys & certs for sessionstore200[4-6] [labs/private] - 10https://gerrit.wikimedia.org/r/998504 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [19:08:11] (03CR) 10Eevans: [V: 03+2 C: 03+2] cleanup obsolete keys & certs (hosts decommissioned) [labs/private] - 10https://gerrit.wikimedia.org/r/998505 (owner: 10Eevans) [19:09:08] 10SRE, 10SRE-Access-Requests: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) @MoritzMuehlenhoff thank you for restoring my access! I am trying to log into Superset and Hue, but l cannot access them. I also reset the developer account's passwo... [19:09:25] (03CR) 10BCornwall: slo_definitions: Use trafficserver_backend_sli_bad (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [19:12:10] (03PS1) 10Dzahn: aptrepo: allow for gitlab versions between 16.5.x and 16.7.x [puppet] - 10https://gerrit.wikimedia.org/r/998510 (https://phabricator.wikimedia.org/T356906) [19:14:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P56468 and previous config saved to /var/cache/conftool/dbconfig/20240207-191446-marostegui.json [19:16:48] (03PS2) 10Dzahn: aptrepo: allow for gitlab versions from 16.5 to 16.6.x [puppet] - 10https://gerrit.wikimedia.org/r/998510 (https://phabricator.wikimedia.org/T356906) [19:16:51] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) Tagging DPE SRE in case this is specific to those tools. @cchen: Can you please verify if you can `ssh` to the stat hosts and also use Jupyte... [19:19:52] !log people1004 systemctl stop confd; running puppet; checking to remove confd remnants from people* hosts - T356296 [19:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:56] T356296: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296 [19:22:14] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) I `ssh` the stats machine and `kinit`, and got `Password incorrect while getting initial credentials. and I also tried JupyterHub, and it also... [19:25:51] (03CR) 10Eevans: [C: 03+2] sessionstore: provision sessionstore2005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997992 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [19:26:19] (03PS2) 10Bking: cloudelastic: remove unnecessary hostnames [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) [19:27:04] (03PS2) 10Eevans: sessionstore: provision sessionstore2005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997992 (https://phabricator.wikimedia.org/T356829) [19:27:06] (03PS2) 10Eevans: sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829) [19:28:03] (03PS3) 10Bking: cloudelastic: remove unnecessary hostnames [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) [19:28:53] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) @cchen When you ran kinit the first time after you logged in, did it ask you to change the password? Did you get a new temporary one by mail?... [19:28:56] (03CR) 10Bking: "Apologies for the bad commit message. I've updated it to (hopefully) be less confusing. The TLDR is that we never needed those alt names, " [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:29:24] (03CR) 10CI reject: [V: 04-1] cloudelastic: remove unnecessary hostnames [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:29:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T355609)', diff saved to https://phabricator.wikimedia.org/P56469 and previous config saved to /var/cache/conftool/dbconfig/20240207-192953-marostegui.json [19:29:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [19:29:59] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:30:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [19:30:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T355609)', diff saved to https://phabricator.wikimedia.org/P56470 and previous config saved to /var/cache/conftool/dbconfig/20240207-193016-marostegui.json [19:30:20] (03PS4) 10Bking: cloudelastic: remove unneeded hostnames from cert alt names [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) [19:32:03] !log joal@deploy2002 Started deploy [analytics/refinery@80b329b]: Analytics Hotfix [analytics/refinery@80b329b5] [19:32:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/995107 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:33:38] (03CR) 10Bking: [C: 03+2] cloudelastic: Begin private IP migration for cloudelastic1008 [puppet] - 10https://gerrit.wikimedia.org/r/998494 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:34:09] (03PS1) 10Dzahn: peopleweb: test edit, comment out idp [puppet] - 10https://gerrit.wikimedia.org/r/998532 [19:35:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T355609)', diff saved to https://phabricator.wikimedia.org/P56471 and previous config saved to /var/cache/conftool/dbconfig/20240207-193540-marostegui.json [19:35:45] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:36:02] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) @Dzahn Oh, I see. I found the email and reran the kinit with the temporary password, it works now. [19:40:05] (03CR) 10Ssingh: [C: 03+1] slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [19:41:53] (03PS3) 10Eevans: sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829) [19:42:31] !log joal@deploy2002 Finished deploy [analytics/refinery@80b329b]: Analytics Hotfix [analytics/refinery@80b329b5] (duration: 10m 28s) [19:42:49] !log joal@deploy2002 Started deploy [analytics/refinery@80b329b] (thin): Analytics Hotfix -THIN [analytics/refinery@80b329b5] [19:42:55] !log joal@deploy2002 Finished deploy [analytics/refinery@80b329b] (thin): Analytics Hotfix -THIN [analytics/refinery@80b329b5] (duration: 00m 05s) [19:43:48] !log joal@deploy2002 Started deploy [analytics/refinery@80b329b] (hadoop-test): Analytics Hotfix - TEST [analytics/refinery@80b329b5] [19:43:49] (03PS2) 10Dzahn: peopleweb: test edit [puppet] - 10https://gerrit.wikimedia.org/r/998532 [19:44:59] (03CR) 10CI reject: [V: 04-1] peopleweb: test edit [puppet] - 10https://gerrit.wikimedia.org/r/998532 (owner: 10Dzahn) [19:45:40] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1008.wikimedia.org [19:45:47] (03PS3) 10Dzahn: peopleweb: test edit [puppet] - 10https://gerrit.wikimedia.org/r/998532 [19:47:28] !log joal@deploy2002 Finished deploy [analytics/refinery@80b329b] (hadoop-test): Analytics Hotfix - TEST [analytics/refinery@80b329b5] (duration: 03m 40s) [19:50:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P56472 and previous config saved to /var/cache/conftool/dbconfig/20240207-195047-marostegui.json [19:51:58] (03CR) 10Eevans: [C: 03+2] sessionstore: provision sessionstore2006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/997993 (https://phabricator.wikimedia.org/T356829) (owner: 10Eevans) [19:52:52] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:55:08] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1008.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [19:56:17] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1008.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [19:56:17] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:56:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic1008.wikimedia.org [20:00:17] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:00:46] (03PS4) 10Dzahn: peopleweb: test edit [puppet] - 10https://gerrit.wikimedia.org/r/998532 [20:02:02] (03PS1) 10Brennen Bearnes: Fix regression in HLS track content type [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998452 (https://phabricator.wikimedia.org/T356780) [20:02:45] (03CR) 10EoghanGaffney: [C: 03+1] aptrepo: allow for gitlab versions from 16.5 to 16.6.x [puppet] - 10https://gerrit.wikimedia.org/r/998510 (https://phabricator.wikimedia.org/T356906) (owner: 10Dzahn) [20:03:58] !log joal@deploy2002 Started deploy [airflow-dags/analytics@ea0a3db]: Analytics Hotfix [airflow-dags/analytics@ea0a3db2] [20:04:27] 10SRE, 10serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296 (10Dzahn) Seems to me this has to do with the `profile::firewall` migration from iptables to nftables. What these hosts have in common is `profile::firewall::provider: nftables` in hierad... [20:04:39] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@ea0a3db]: Analytics Hotfix [airflow-dags/analytics@ea0a3db2] (duration: 00m 40s) [20:04:45] 10SRE, 10serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296 (10Dzahn) cc: @Muehlenhoff [20:05:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P56473 and previous config saved to /var/cache/conftool/dbconfig/20240207-200555-marostegui.json [20:06:16] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1008 to private IPs - bking@cumin2002" [20:07:08] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1008 to private IPs - bking@cumin2002" [20:07:08] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:07:53] bvibber, James_F: i'll go ahead and deploy that backport here momentarily. [20:08:02] Awesome. [20:08:09] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1008 [20:08:11] Woot [20:09:07] (03CR) 10Dzahn: [C: 03+2] aptrepo: allow for gitlab versions from 16.5 to 16.6.x [puppet] - 10https://gerrit.wikimedia.org/r/998510 (https://phabricator.wikimedia.org/T356906) (owner: 10Dzahn) [20:09:21] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1008 [20:10:19] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05In progress→03Resolved Great! Feel free to reopen the ticket if there is anything else missing. [20:11:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998452 (https://phabricator.wikimedia.org/T356780) (owner: 10Brennen Bearnes) [20:14:14] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T355170 (10Dzahn) 05In progress→03Stalled [20:15:04] RECOVERY - Check systemd state on stat1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:45] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T355937 (10Dzahn) 05In progress→03Resolved a:03Dzahn @WMDECyn You have been added to the groups 'nda' and 'wmde' just like other WMDE employees. Things should work as expec... [20:15:45] bvibber: this a "go ahead past test servers, confirm in group0" sort of situation, yeah? [20:18:18] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye [20:18:26] PROBLEM - Check systemd state on stat1010 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T355609)', diff saved to https://phabricator.wikimedia.org/P56474 and previous config saved to /var/cache/conftool/dbconfig/20240207-202101-marostegui.json [20:21:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance [20:21:07] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:21:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1227.eqiad.wmnet with reason: Maintenance [20:21:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T355609)', diff saved to https://phabricator.wikimedia.org/P56475 and previous config saved to /var/cache/conftool/dbconfig/20240207-202123-marostegui.json [20:22:23] moment [20:24:45] *waits for the files to churn in job queue* [20:26:46] bvibber: still waiting on CI here, so not yet deployed [20:27:05] (sorry, could have been clearer about that) [20:27:05] no rush then :D [20:27:37] will ping when backport's done. :) [20:28:40] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:29:40] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:31:28] (03PS1) 10Eevans: sessionstore: setup sessionstore200[4-6] (new) [deployment-charts] - 10https://gerrit.wikimedia.org/r/998538 (https://phabricator.wikimedia.org/T356829) [20:31:30] (03PS1) 10Eevans: sessionstore: remove decommissioned hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/998539 (https://phabricator.wikimedia.org/T356828) [20:32:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T355609)', diff saved to https://phabricator.wikimedia.org/P56477 and previous config saved to /var/cache/conftool/dbconfig/20240207-203222-marostegui.json [20:32:27] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:32:52] (03Merged) 10jenkins-bot: Fix regression in HLS track content type [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998452 (https://phabricator.wikimedia.org/T356780) (owner: 10Brennen Bearnes) [20:33:18] !log brennen@deploy2002 Started scap: Backport for [[gerrit:998452|Fix regression in HLS track content type (T356780)]] [20:33:22] T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780 [20:36:06] (03CR) 10Eevans: "This changeset is ready to go as-is, but I'm marking it -1 to signal it isn't yet ready to be merged. We need to first merge r998538, dep" [deployment-charts] - 10https://gerrit.wikimedia.org/r/998539 (https://phabricator.wikimedia.org/T356828) (owner: 10Eevans) [20:37:02] !log brennen@deploy2002 brennen: Backport for [[gerrit:998452|Fix regression in HLS track content type (T356780)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:37:09] going ahead with sync [20:37:17] !log brennen@deploy2002 brennen: Continuing with sync [20:38:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:40:24] (03CR) 10Krinkle: Configure parser cache filters for parsoid-pcache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [20:42:43] (03CR) 10Krinkle: Configure parser cache filters for parsoid-pcache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [20:43:25] (SystemdUnitFailed) firing: (14) prometheus-phpfpm-statustext-textfile.service Failed on mw1364:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:43:39] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:998452|Fix regression in HLS track content type (T356780)]] (duration: 10m 20s) [20:43:43] T356780: Video transcoding fails when firejail is enabled - https://phabricator.wikimedia.org/T356780 [20:43:51] bvibber: that's out [20:46:40] testing... [20:47:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P56478 and previous config saved to /var/cache/conftool/dbconfig/20240207-204728-marostegui.json [20:48:25] (SystemdUnitFailed) resolved: (37) prometheus-phpfpm-statustext-textfile.service Failed on mw1350:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:50:14] brennen: confirmed fixed on test :D [20:52:04] bvibber: thx! [21:00:04] RECOVERY - Check systemd state on stat1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T2100) [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:02:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P56479 and previous config saved to /var/cache/conftool/dbconfig/20240207-210235-marostegui.json [21:03:58] PROBLEM - Check systemd state on stat1011 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:08:41] (03PS1) 10Herron: SystemdUnitFailed: remove 'Failed' from alert text [alerts] - 10https://gerrit.wikimedia.org/r/998545 [21:09:15] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1008.eqiad.wmnet with OS bullseye [21:09:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for JTanner - https://phabricator.wikimedia.org/T356917 (10JTannerWMF) [21:14:30] (ProbeDown) firing: Service build2001:873 has failed probes (tcp_package_builder_rsync_ip6) - https://wikitech.wikimedia.org/wiki/Debian_Packaging#Upload_to_Wikimedia_Repo - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:16:40] (03CR) 10Herron: "Proposing this since I've misread the failed-resolved-failed text pattern a few times at a quick glance" [alerts] - 10https://gerrit.wikimedia.org/r/998545 (owner: 10Herron) [21:17:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T355609)', diff saved to https://phabricator.wikimedia.org/P56480 and previous config saved to /var/cache/conftool/dbconfig/20240207-211741-marostegui.json [21:17:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1236.eqiad.wmnet with reason: Maintenance [21:17:48] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [21:17:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1236.eqiad.wmnet with reason: Maintenance [21:18:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T355609)', diff saved to https://phabricator.wikimedia.org/P56481 and previous config saved to /var/cache/conftool/dbconfig/20240207-211803-marostegui.json [21:23:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T355609)', diff saved to https://phabricator.wikimedia.org/P56482 and previous config saved to /var/cache/conftool/dbconfig/20240207-212304-marostegui.json [21:23:08] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [21:28:14] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10mpopov) @cchen: How about Superset & Hue? [21:31:10] 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Comm Error: Backplane 0 on cloudelastic1008 - https://phabricator.wikimedia.org/T356919 (10bking) [21:38:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P56483 and previous config saved to /var/cache/conftool/dbconfig/20240207-213810-marostegui.json [21:42:39] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10cchen) I still not able to access Superset & Hue, and i tried to reset my password again, still not working. [21:42:57] 10SRE: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920 (10DBu-WMF) [21:46:30] 10SRE: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920 (10Dzahn) [21:47:06] 10SRE, 10Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920 (10Dzahn) [21:48:40] 10SRE, 10Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920 (10Dzahn) Looks like dmarcian has been replaced by dmarcdigests.com. (details in T330944). Adding some tags for visibility. [21:49:09] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1) You look still to be blocked on wikitech https://wikitech.wikimedia.org/wiki/Special:Contributions/Conniecc1 - not sure if that's related bu... [21:49:12] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) 05Resolved→03Open [21:49:22] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10Dzahn) a:05cchen→03None [21:50:53] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1) [21:52:12] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Production data & systems access restoration for Connie Chen - https://phabricator.wikimedia.org/T356645 (10RhinosF1) I've added a checklist based on the private task. @MoritzMuehlenhoff (or another SRE): please update based on what already works @cchen: i... [21:53:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P56484 and previous config saved to /var/cache/conftool/dbconfig/20240207-215317-marostegui.json [22:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240207T2200) [22:00:06] RECOVERY - Check systemd state on stat1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:51] (03PS1) 10Ebernhardson: cirrus: Re-enable writes to wikidata on cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998559 (https://phabricator.wikimedia.org/T352335) [22:04:00] PROBLEM - Check systemd state on stat1011 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:07] It's a couple minutes late for backport window, but i'm going to deploy the write re-enable above [22:05:16] unless there are concerns [22:06:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998559 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [22:06:19] ebernhardson: you've got my blessing as releng / train deployer. [22:06:45] brennen: awesome, thanks [22:07:00] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release [22:07:07] (03Merged) 10jenkins-bot: cirrus: Re-enable writes to wikidata on cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998559 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [22:07:32] !log ebernhardson@deploy2002 Started scap: Backport for [[gerrit:998559|cirrus: Re-enable writes to wikidata on cloudelastic (T352335)]] [22:07:45] T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335 [22:08:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T355609)', diff saved to https://phabricator.wikimedia.org/P56485 and previous config saved to /var/cache/conftool/dbconfig/20240207-220824-marostegui.json [22:08:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [22:08:39] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [22:08:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [22:08:59] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:998559|cirrus: Re-enable writes to wikidata on cloudelastic (T352335)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:10:24] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [22:11:25] (SystemdUnitFailed) firing: prometheus-phpfpm-statustext-textfile.service Failed on mwdebug1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:13:35] (03PS1) 10Bartosz Dziewoński: Parser: Fix the main loop getting stuck on some signatures [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/998453 (https://phabricator.wikimedia.org/T356884) [22:13:44] (03PS1) 10Bartosz Dziewoński: Parser: Fix the main loop getting stuck on some signatures [extensions/DiscussionTools] (wmf/1.42.0-wmf.17) - 10https://gerrit.wikimedia.org/r/998454 (https://phabricator.wikimedia.org/T356884) [22:16:25] (SystemdUnitFailed) firing: (13) prometheus-phpfpm-statustext-textfile.service Failed on mw1418:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:42] !log ebernhardson@deploy2002 Finished scap: Backport for [[gerrit:998559|cirrus: Re-enable writes to wikidata on cloudelastic (T352335)]] (duration: 09m 10s) [22:16:46] T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335 [22:17:43] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:21:25] (SystemdUnitFailed) resolved: (29) prometheus-phpfpm-statustext-textfile.service Failed on mw1409:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:27] (03PS21) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [22:34:38] (03CR) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [22:39:06] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [22:46:12] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T347624, testing 961878 patch) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [22:46:17] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [22:46:32] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T347624, testing 961878 patch) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [22:46:55] (03CR) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [22:47:58] (03PS22) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [22:49:34] !log Uploaded ncmonitor 0.0.2 to bookworm-wikimedia archive [22:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:26] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:10:42] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:11:20] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:11:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [23:21:50] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:22:14] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:22:26] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:41:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [23:53:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:54:04] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release [23:54:34] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring