[00:01:06] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:08:52] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:08] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:22:58] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:50] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:42] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:39:35] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921126
[00:39:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921126 (owner: 10TrainBranchBot)
[00:46:32] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:52:48] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:55:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921126 (owner: 10TrainBranchBot)
[01:00:34] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:05:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:08:22] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:10:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:16:12] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:24:02] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:30:20] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:38:12] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:46:02] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:53:54] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:00:02] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:00:50] <icinga-wm>	 RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:40] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:15:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:23:16] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:31:00] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:38:44] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:46:30] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:48:55] <wikibugs>	 (03PS1) 10Andrew Bogott: mwopenstackclients: use novaobserver creds to determine domain of a project [puppet] - 10https://gerrit.wikimedia.org/r/921112
[02:49:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: use novaobserver creds to determine domain of a project [puppet] - 10https://gerrit.wikimedia.org/r/921112 (owner: 10Andrew Bogott)
[02:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:52:42] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:00:30] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:08:20] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:16:12] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:24:02] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:20] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:38:06] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:52] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:14] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:53:44] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:01:34] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:07:38] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:15:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:23:16] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:06] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:38:52] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:42:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:46:32] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:47:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:52:42] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:59:14] <wikibugs>	 10SRE, 10Wikidata, 10wdwb-tech, 10Shape Expressions (M2: Linking to EntitySchemas in statements), and 3 others: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 (10Arian_Bozorg) 05Open→03Resolved Looks good to me!  Thank so much :)
[05:00:32] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:07:34] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:08:22] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:11:14] <jinxer-wm>	 (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:16:14] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:24:06] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:25:22] <wikibugs>	 (03CR) 10Jdlrobson: Enable zebra ab test in hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia)
[05:26:53] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Tracking-Neverending: Thumbnail/imagescaler (tracking) - https://phabricator.wikimedia.org/T43371 (10Jdforrester-WMF)
[05:30:20] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:36:24] <wikibugs>	 (03PS1) 10Marostegui: db1121: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/921116 (https://phabricator.wikimedia.org/T336725)
[05:36:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1121: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/921116 (https://phabricator.wikimedia.org/T336725) (owner: 10Marostegui)
[05:37:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1121 from dbctl T336725', diff saved to https://phabricator.wikimedia.org/P48367 and previous config saved to /var/cache/conftool/dbconfig/20230519-053719-marostegui.json
[05:37:24] <stashbot>	 T336725: decommission db1121.eqiad.wmnet - https://phabricator.wikimedia.org/T336725
[05:38:10] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:43:15] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10phaultfinder)
[05:44:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2032 to es1 master', diff saved to https://phabricator.wikimedia.org/P48368 and previous config saved to /var/cache/conftool/dbconfig/20230519-054403-marostegui.json
[05:45:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2030', diff saved to https://phabricator.wikimedia.org/P48369 and previous config saved to /var/cache/conftool/dbconfig/20230519-054503-root.json
[05:45:39] <wikibugs>	 (03PS1) 10Marostegui: es2030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921117
[05:46:02] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:46:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921117 (owner: 10Marostegui)
[05:47:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:47:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2033 to es2 master', diff saved to https://phabricator.wikimedia.org/P48370 and previous config saved to /var/cache/conftool/dbconfig/20230519-054737-marostegui.json
[05:47:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2031', diff saved to https://phabricator.wikimedia.org/P48371 and previous config saved to /var/cache/conftool/dbconfig/20230519-054758-root.json
[05:48:15] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10phaultfinder)
[05:48:34] <wikibugs>	 (03PS1) 10Marostegui: es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921118
[05:48:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921118 (owner: 10Marostegui)
[05:49:12] <wikibugs>	 (03PS1) 10Ayounsi: admin/data.yaml: ayounsi: add ssh-ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/921119
[05:49:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2034 to es3 master', diff saved to https://phabricator.wikimedia.org/P48372 and previous config saved to /var/cache/conftool/dbconfig/20230519-054923-marostegui.json
[05:49:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2027', diff saved to https://phabricator.wikimedia.org/P48373 and previous config saved to /var/cache/conftool/dbconfig/20230519-054952-root.json
[05:51:00] <wikibugs>	 (03PS1) 10Marostegui: es2027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921120
[05:51:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921120 (owner: 10Marostegui)
[05:52:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:53:06] <wikibugs>	 (03PS1) 10Ayounsi: ayounsi: update ssh key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921121 (https://phabricator.wikimedia.org/T336769)
[05:53:16] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2030: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/920745
[05:53:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2030: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/920745 (owner: 10Marostegui)
[05:53:52] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:54:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48374 and previous config saved to /var/cache/conftool/dbconfig/20230519-055426-root.json
[05:54:46] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/921146
[05:55:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48375 and previous config saved to /var/cache/conftool/dbconfig/20230519-055511-root.json
[05:55:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/921146 (owner: 10Marostegui)
[05:56:20] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/921147
[05:57:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/921147 (owner: 10Marostegui)
[05:57:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48376 and previous config saved to /var/cache/conftool/dbconfig/20230519-055723-root.json
[06:00:07] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230519T0600)
[06:00:52] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:03:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui) Any ETA on when these will be installed? Thanks!
[06:07:08] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:09:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48377 and previous config saved to /var/cache/conftool/dbconfig/20230519-060931-root.json
[06:10:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48378 and previous config saved to /var/cache/conftool/dbconfig/20230519-061016-root.json
[06:12:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48379 and previous config saved to /var/cache/conftool/dbconfig/20230519-061228-root.json
[06:15:30] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:20:46] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:21:10] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:22:18] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:22:30] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49995 bytes in 6.368 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:23:28] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.335 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:24:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48380 and previous config saved to /var/cache/conftool/dbconfig/20230519-062435-root.json
[06:25:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48381 and previous config saved to /var/cache/conftool/dbconfig/20230519-062520-root.json
[06:27:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48382 and previous config saved to /var/cache/conftool/dbconfig/20230519-062733-root.json
[06:30:27] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:36:47] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48383 and previous config saved to /var/cache/conftool/dbconfig/20230519-063940-root.json
[06:40:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48384 and previous config saved to /var/cache/conftool/dbconfig/20230519-064025-root.json
[06:41:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast6002.wikimedia.org
[06:42:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48385 and previous config saved to /var/cache/conftool/dbconfig/20230519-064237-root.json
[06:45:01] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:47:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast6002.wikimedia.org
[06:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:51:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Simplify bastion config in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/921125
[06:53:05] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:54:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48386 and previous config saved to /var/cache/conftool/dbconfig/20230519-065445-root.json
[06:55:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48387 and previous config saved to /var/cache/conftool/dbconfig/20230519-065530-root.json
[06:57:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48388 and previous config saved to /var/cache/conftool/dbconfig/20230519-065742-root.json
[06:59:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host netflow2003.codfw.wmnet with OS bookworm
[07:00:06] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230519T0700)
[07:00:33] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:08:19] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:09:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48389 and previous config saved to /var/cache/conftool/dbconfig/20230519-070949-root.json
[07:10:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48390 and previous config saved to /var/cache/conftool/dbconfig/20230519-071034-root.json
[07:11:50] <moritzm>	 !log installing emacs security updates
[07:11:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48391 and previous config saved to /var/cache/conftool/dbconfig/20230519-071247-root.json
[07:16:01] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:21:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: prometheus4001.ulsfo.wmnet
[07:21:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: prometheus4001.ulsfo.wmnet
[07:21:31] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus4001 - https://phabricator.wikimedia.org/T335585 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: prometheus4001.ulsfo.wmnet
[07:22:47] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (gerrit1001, ...), Fresh: 122 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:23:41] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:24:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48392 and previous config saved to /var/cache/conftool/dbconfig/20230519-072454-root.json
[07:25:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48393 and previous config saved to /var/cache/conftool/dbconfig/20230519-072539-root.json
[07:27:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48394 and previous config saved to /var/cache/conftool/dbconfig/20230519-072751-root.json
[07:31:23] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:31:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow2003.codfw.wmnet with reason: host reimage
[07:34:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow2003.codfw.wmnet with reason: host reimage
[07:37:41] <wikibugs>	 10SRE, 10DBA: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10Marostegui) 05Open→03Resolved I have repooled db1132 - I will investigate db1106 with mariadb (this host is non production)
[07:37:46] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 04-1] doc: add password-protected rsync module for publishing from gitlab (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[07:39:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48395 and previous config saved to /var/cache/conftool/dbconfig/20230519-073959-root.json
[07:40:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48396 and previous config saved to /var/cache/conftool/dbconfig/20230519-074044-root.json
[07:41:35] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 04-1] doc: add password-protected rsync module for publishing from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[07:42:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48397 and previous config saved to /var/cache/conftool/dbconfig/20230519-074256-root.json
[07:49:59] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: deploy Bloom-560m model on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861)
[07:52:19] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-cache2001.codfw.wmnet
[07:53:41] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10fgiunchedi) All of these seem to be for C5 only, maybe some mgmt network problem there?
[07:58:55] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2001.codfw.wmnet
[08:00:21] <wikibugs>	 (03CR) 10Elukey: ml-services: deploy Bloom-560m model on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[08:03:57] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-cache2002.codfw.wmnet
[08:07:04] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921126 (owner: 10TrainBranchBot)
[08:08:03] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:09:33] <moritzm>	 !log copy samplicator from bullseye-wikimedia to bookworm-wikimedia T330884
[08:09:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:37] <stashbot>	 T330884: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884
[08:10:32] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2002.codfw.wmnet
[08:11:15] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-cache2003.codfw.wmnet
[08:13:50] <wikibugs>	 (03PS3) 10Majavah: wmnet: Remove nfs-tools-project.svc.eqiad [dns] - 10https://gerrit.wikimedia.org/r/907136 (https://phabricator.wikimedia.org/T333477)
[08:14:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host netflow2003.codfw.wmnet with OS bookworm
[08:14:59] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] perl532: Add libphp-serialization-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919922 (https://phabricator.wikimedia.org/T323522) (owner: 10BryanDavis)
[08:15:05] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] perl532: Add libmime-lite-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904) (owner: 10BryanDavis)
[08:15:47] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:18:06] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2003.codfw.wmnet
[08:18:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:20:05] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: deploy Bloom-560m model on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861)
[08:22:03] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: deploy Bloom-560m model on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[08:23:29] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:23:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:24:19] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921126 (owner: 10TrainBranchBot)
[08:25:15] <wikibugs>	 (03CR) 10David Caro: "LGTM, I'm not sure if 'shared-storage' is the best naming, as I would expect that to be just storage to share stuff with other tools/users" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/920259 (https://phabricator.wikimedia.org/T334081) (owner: 10Majavah)
[08:27:58] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet
[08:28:25] <wikibugs>	 (03CR) 10Majavah: Add an option to disable NFS access (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/920259 (https://phabricator.wikimedia.org/T334081) (owner: 10Majavah)
[08:31:09] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:31:51] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet
[08:32:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Drop Boost packages from legacy package removal list for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/921174 (https://phabricator.wikimedia.org/T330495)
[08:32:25] <wikibugs>	 (03PS2) 10Muehlenhoff: Drop Boost packages from legacy package removal list for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/921174 (https://phabricator.wikimedia.org/T330495)
[08:32:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: deploy Bloom-560m model on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[08:33:08] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Add configuration file support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336037 (10Joe) 05In progress→03Resolved
[08:33:18] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe)
[08:33:30] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921128
[08:33:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921128 (owner: 10TrainBranchBot)
[08:34:23] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Add traffic sampling support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336038 (10Joe) 05In progress→03Resolved
[08:34:32] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe)
[08:34:39] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet
[08:38:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Drop Boost packages from legacy package removal list for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/921174 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff)
[08:38:21] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[08:38:30] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet
[08:38:47] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:39:43] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] perl532: Add libmime-lite-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904) (owner: 10BryanDavis)
[08:39:48] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] perl532: Add libphp-serialization-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919922 (https://phabricator.wikimedia.org/T323522) (owner: 10BryanDavis)
[08:40:26] <wikibugs>	 (03Merged) 10jenkins-bot: perl532: Add libmime-lite-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904) (owner: 10BryanDavis)
[08:40:30] <wikibugs>	 (03Merged) 10jenkins-bot: perl532: Add libphp-serialization-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919922 (https://phabricator.wikimedia.org/T323522) (owner: 10BryanDavis)
[08:41:51] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet
[08:45:45] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet
[08:46:27] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:52:28] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet
[08:52:35] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:52:41] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921128 (owner: 10TrainBranchBot)
[08:53:21] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) @ayounsi There's now netflow2003 running Bookworm with FNM 1.2.4. If that works fine, we can reimage the other netflow* VMs in-place once Bookworm is stable.  I copied over s...
[08:55:11] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:58:16] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet
[08:59:24] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts ms-be[2040-2043].codfw.wmnet
[09:00:17] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:02:09] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet
[09:02:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch kadmin server back to krb1001 [puppet] - 10https://gerrit.wikimedia.org/r/921242 (https://phabricator.wikimedia.org/T331695)
[09:04:27] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:07:57] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:08:51] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet
[09:15:00] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.dns.netbox
[09:15:39] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:18:29] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[2040-2043].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002"
[09:20:12] <wikibugs>	 (03CR) 10David Caro: Add an option to disable NFS access (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/920259 (https://phabricator.wikimedia.org/T334081) (owner: 10Majavah)
[09:21:16] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw
[09:21:35] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[2040-2043].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002"
[09:21:35] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:21:36] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[2040-2043].codfw.wmnet
[09:21:39] <wikibugs>	 10SRE-swift-storage: Drain and then decommission ms-be20[40-43] - https://phabricator.wikimedia.org/T335280 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `ms-be[2040-2043].codfw.wmnet` - ms-be2040.codfw.wmnet (**WARN**)   - Downtimed host on Icinga/Alertmanager...
[09:21:55] <wikibugs>	 (03PS7) 10EoghanGaffney: doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[09:23:01] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:23:25] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:23:49] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[09:23:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout error getting connected_endpoint prefix - https://phabricator.wikimedia.org/T321704 (10cmooney) This appears to be happening more often now, and is starting to cause considerable noise in the dc-ops  irc channel.  @volans...
[09:26:57] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:27:30] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be204[0-3].codfw.wmnet - https://phabricator.wikimedia.org/T337011 (10MatthewVernon)
[09:28:45] <wikibugs>	 10SRE-swift-storage: Q4 ms backend refresh work (KR) - https://phabricator.wikimedia.org/T335270 (10MatthewVernon)
[09:28:48] <wikibugs>	 10SRE-swift-storage: Drain and then decommission ms-be20[40-43] - https://phabricator.wikimedia.org/T335280 (10MatthewVernon) 05Open→03Resolved Hosts off and decom cookbook run; the DC-ops ticket to actually dispose of the hardware is T337011
[09:31:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2002.codfw.wmnet with OS bullseye
[09:33:08] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:33:33] <wikibugs>	 10SRE, 10Inuka-Team, 10Wikipedia-Preview, 10User-bd808: Add both Wikipedia Preview repos to Packagist - https://phabricator.wikimedia.org/T310938 (10bd808) 05Open→03Resolved a:03bd808 * https://packagist.org/packages/wikimedia/wikipedia-preview * https://packagist.org/packages/wikimedia/wikipediaprev...
[09:33:46] <wikibugs>	 (03PS1) 10EoghanGaffney: Changes from hard-coded list of hosts in doc module [puppet] - 10https://gerrit.wikimedia.org/r/921244
[09:33:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704 (10cmooney)
[09:37:22] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:39:05] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good. Thanks elukey." [puppet] - 10https://gerrit.wikimedia.org/r/919802 (owner: 10Majavah)
[09:39:14] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:41:01] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41240/console" [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney)
[09:42:04] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:45:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage
[09:45:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bast2002 from bastion hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/921247 (https://phabricator.wikimedia.org/T334287)
[09:45:48] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:45:50] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:48:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: deprecate traffic 'global' rules [puppet] - 10https://gerrit.wikimedia.org/r/861826 (https://phabricator.wikimedia.org/T288196)
[09:48:26] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: remove global rules [puppet] - 10https://gerrit.wikimedia.org/r/921248 (https://phabricator.wikimedia.org/T288196)
[09:48:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove bast2002 from bastion hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/921247 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff)
[09:48:39] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove bast2002 from bastion hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/921247 (https://phabricator.wikimedia.org/T334287)
[09:48:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage
[09:48:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:49:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[09:49:38] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[09:54:08] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:55:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Update host for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/921250 (https://phabricator.wikimedia.org/T336995)
[09:56:56] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:00:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye
[10:02:06] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:03:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Update host for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/921250 (https://phabricator.wikimedia.org/T336995) (owner: 10Muehlenhoff)
[10:06:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update host for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/921250 (https://phabricator.wikimedia.org/T336995) (owner: 10Muehlenhoff)
[10:07:32] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:07:40] <moritzm>	 !log installing ncurses security updates
[10:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:00] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:10:36] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[10:12:00] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[10:13:30] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:14:58] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:16:22] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:19:46] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[10:22:14] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:22:38] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:22:48] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[10:25:58] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:27:32] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:28:14] <wikibugs>	 (03PS1) 10Legoktm: Disable GWToolset from Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911)
[10:30:28] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:25] <wikibugs>	 (03PS1) 10Legoktm: Remove GWToolset configuration (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921253 (https://phabricator.wikimedia.org/T270911)
[10:31:27] <wikibugs>	 (03PS1) 10Legoktm: Remove GWToolset configuration (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921254 (https://phabricator.wikimedia.org/T270911)
[10:33:58] <wikibugs>	 (03PS1) 10Hnowlan: imagemagick: update test cases for fixes within libraries [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/921255 (https://phabricator.wikimedia.org/T334863)
[10:35:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2004.codfw.wmnet with OS bullseye
[10:37:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast2002
[10:38:11] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet
[10:38:20] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:38:27] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] "trust me, I am a pro in disabling extensions ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm)
[10:41:02] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] Changes from hard-coded list of hosts in doc module [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney)
[10:41:10] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:42:29] <wikibugs>	 10SRE-swift-storage, 10Discovery-Search: Ensure swiftly access for non-SREs - https://phabricator.wikimedia.org/T335144 (10MatthewVernon) Do you need anything from Data Persistence apropos this? I think not, but wanted to check in that you're not waiting for something from me :)
[10:44:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[10:45:05] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet
[10:46:08] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:47:16] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[10:48:44] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[10:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:50:10] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): gitlab-runner1003 is not coming back online - https://phabricator.wikimedia.org/T336737 (10Jelto) 05Open→03Resolved @Jclark-ctr thanks a lot for the quick response! Error is gone! I'm closing this task.
[10:50:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage
[10:51:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast2002 decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:53:21] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:53:37] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:53:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage
[10:54:51] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:55:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast2002 decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:55:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:55:30] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts bast2002
[10:56:53] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:58:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove bast2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/921260
[10:59:00] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10MoritzMuehlenhoff)
[10:59:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove bast2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/921260 (owner: 10Muehlenhoff)
[11:00:49] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:03:21] <wikibugs>	 (03CR) 10Btullis: [C: 04-1] "There is still a problem with this change, in that I added the wrong file to conda-analytics. See https://phabricator.wikimedia.org/T33276" [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[11:03:29] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Puppet additions for ssw1-e1-eqiad and ssw1-f1-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/906627 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[11:04:30] <wikibugs>	 (03PS1) 10Cathal Mooney: Puppet additions to bring ssw1-f1-eqiad under management [puppet] - 10https://gerrit.wikimedia.org/r/921261 (https://phabricator.wikimedia.org/T322937)
[11:06:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2004.codfw.wmnet with OS bullseye
[11:06:11] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:07:37] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:07:57] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:10:25] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:13:51] <wikibugs>	 (03CR) 10Brian Wolff: [C: 03+1] Disable GWToolset from Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm)
[11:15:11] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:22:09] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:23:55] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:27:29] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw
[11:30:15] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:35:21] <wikibugs>	 (03PS1) 10EoghanGaffney: Allow gitlab-runner hosts to talk rsync to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/921265 (https://phabricator.wikimedia.org/T336168)
[11:37:59] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:41:58] <wikibugs>	 (03PS8) 10Gmodena: mw-page-content-change-enrich: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656)
[11:44:34] <wikibugs>	 (03PS9) 10Gmodena: mw-page-content-change-enrich: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656)
[11:45:37] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:27] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41242/console" [puppet] - 10https://gerrit.wikimedia.org/r/921265 (https://phabricator.wikimedia.org/T336168) (owner: 10EoghanGaffney)
[11:46:55] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] Allow gitlab-runner hosts to talk rsync to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/921265 (https://phabricator.wikimedia.org/T336168) (owner: 10EoghanGaffney)
[11:47:26] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Allow gitlab-runner hosts to talk rsync to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/921265 (https://phabricator.wikimedia.org/T336168) (owner: 10EoghanGaffney)
[11:51:13] <wikibugs>	 (03PS1) 10Majavah: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535)
[11:51:56] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+1] mwscript: Avoid prepending maintenance/ if >= 2 dots in argument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920788 (https://phabricator.wikimedia.org/T336819) (owner: 10Ladsgroup)
[11:53:13] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:17] <wikibugs>	 10SRE, 10Inuka-Team, 10Wikipedia-Preview, 10User-bd808: Add both Wikipedia Preview repos to Packagist - https://phabricator.wikimedia.org/T310938 (10Varnent) >>! In T310938#8863903, @bd808 wrote: > * https://packagist.org/packages/wikimedia/wikipedia-preview > * https://packagist.org/packages/wikimedia/wik...
[12:00:47] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:31] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:23] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "Let's keep this disabled on private wikis and locked down ones." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah)
[12:12:16] <wikibugs>	 (03PS2) 10Majavah: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535)
[12:12:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Puppet additions to bring ssw1-f1-eqiad under management [puppet] - 10https://gerrit.wikimedia.org/r/921261 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[12:13:06] <wikibugs>	 (03PS3) 10Majavah: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535)
[12:14:15] <wikibugs>	 (03CR) 10Majavah: Enable RealMe (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah)
[12:14:35] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] ayounsi: update ssh key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921121 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi)
[12:15:02] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe)
[12:15:12] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] admin/data.yaml: ayounsi: add ssh-ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/921119 (owner: 10Ayounsi)
[12:15:32] <wikibugs>	 (03PS4) 10Majavah: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535)
[12:15:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/921255 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan)
[12:16:09] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Simplify bastion config in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/921125 (owner: 10Muehlenhoff)
[12:18:40] <wikibugs>	 (03Merged) 10jenkins-bot: ayounsi: update ssh key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921121 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi)
[12:18:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/920991 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[12:19:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad
[12:22:19] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) @volans I had a word with @ayounsi on this and we both feel if we can make it work via HTTP to the apt server that's probably best.  I...
[12:23:57] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:27:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] users: Update brett's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921073 (https://phabricator.wikimedia.org/T336769) (owner: 10BCornwall)
[12:28:33] <wikibugs>	 (03PS1) 10Legoktm: i18n: Add link to help page [extensions/RealMe] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921150 (https://phabricator.wikimedia.org/T322717)
[12:29:01] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] imagemagick: update test cases for fixes within libraries [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/921255 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan)
[12:30:11] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:30:36] <wikibugs>	 (03Merged) 10jenkins-bot: users: Update brett's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921073 (https://phabricator.wikimedia.org/T336769) (owner: 10BCornwall)
[12:34:02] <wikibugs>	 (03Merged) 10jenkins-bot: imagemagick: update test cases for fixes within libraries [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/921255 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan)
[12:37:59] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:39:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi)
[12:40:20] <wikibugs>	 (03PS10) 10Gmodena: mw-page-content-change-enrich: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656)
[12:45:47] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:29] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:56:56] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Formal approval from Releng. That is being done from the Hackathon so if something causes any trouble everyone is around :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm)
[12:57:11] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] Remove GWToolset configuration (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921253 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm)
[12:58:40] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "You might want to rebuild the localization cache, then I don't think having any unused/unreferenced message in the cache is causing any is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921254 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm)
[13:01:09] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:55] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Given it is done at the Hackathon there is all the expertise required to deploy it even if it is Friday today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah)
[13:05:31] <hashar>	 I am rolling all wikis to 1.40.0-wmf.9
[13:08:02] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921342 (https://phabricator.wikimedia.org/T330215)
[13:08:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921342 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[13:08:51] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:12] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921342 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[13:12:32] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: deprecate traffic 'global' rules [puppet] - 10https://gerrit.wikimedia.org/r/861826 (https://phabricator.wikimedia.org/T288196)
[13:12:34] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: remove global rules [puppet] - 10https://gerrit.wikimedia.org/r/921248 (https://phabricator.wikimedia.org/T288196)
[13:12:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: soft-disable 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921347 (https://phabricator.wikimedia.org/T288196)
[13:12:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add 'ensure' for ::server [puppet] - 10https://gerrit.wikimedia.org/r/921348
[13:12:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: de-provision 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196)
[13:12:42] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: remove 'global' instance references [puppet] - 10https://gerrit.wikimedia.org/r/921350 (https://phabricator.wikimedia.org/T288196)
[13:15:03] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:15:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: de-provision 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi)
[13:17:30] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.9  refs T330215
[13:17:34] <stashbot>	 T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215
[13:21:00] <wikibugs>	 (03CR) 10Muehlenhoff: "I like the approach, but given this all touches fairly critical functionality I think we should rather break it down and merge incremental" [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond)
[13:22:41] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:23:29] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:26:04] <topranks>	 !log Adding vlan config for row e/f vlans on ssw1-f1-eqiad (T322937)
[13:26:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:09] <stashbot>	 T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937
[13:30:19] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:31:00] <wikibugs>	 (03CR) 10Muehlenhoff: firewall: add basic firewall class (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond)
[13:33:51] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: de-provision 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196)
[13:33:53] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: remove 'global' instance references [puppet] - 10https://gerrit.wikimedia.org/r/921350 (https://phabricator.wikimedia.org/T288196)
[13:34:37] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs1020.eqiad.wmnet with reason: Move lvs1020 handoff port to row e/f from lsw1-f1 to ssw1-f1
[13:34:45] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided)
[13:34:51] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs1020.eqiad.wmnet with reason: Move lvs1020 handoff port to row e/f from lsw1-f1 to ssw1-f1
[13:34:55] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 10s)
[13:34:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c4ef01af-e7d5-458f-ae46-17500f124165) set by cmooney@cumin1001 f...
[13:36:06] <hashar>	 1.41.0-wmf.9 is on all wikis, I am triaging the error logs
[13:36:12] <hashar>	 but it looks quiet so far
[13:37:55] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:40:24] <hashar>	 taavi: legoktm: go ahead with your deployment. 1.41.0-wmf.9 looks stable :]
[13:40:30] <taavi>	 cool
[13:40:44] <hashar>	 you are both at the hackathon aren't you?
[13:40:56] <taavi>	 yep
[13:42:38] <wikibugs>	 (03PS1) 10Jelto: miscweb/annualreport: bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/921354 (https://phabricator.wikimedia.org/T336217)
[13:44:04] <wikibugs>	 10SRE, 10SRE-Access-Requests: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10ItamarWMDE)
[13:45:33] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10taavi) Please note that not all of our servers have `-sk` support yet, it's only on systems running Bullseye or newer.
[13:47:05] <wikibugs>	 (03CR) 10Jelto: miscweb annualreport: use wildcard redirect for 2020th report (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto)
[13:48:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10ItamarWMDE) Aha! Yeah I was wondering about that earlier. Will this suffice for running scripts on the maint machines and occasionally performing tasks on stat machines?
[13:49:17] <TheresNoTime>	 T336952 is such an odd bug..
[13:49:17] <stashbot>	 T336952: Wikibase\DataModel\Services\Lookup\ReferencedEntityIdLookupException: Referenced entity id lookup failed. Tried to find a referenced entity out of Q16334295 linked from Q13406463 via P279 - https://phabricator.wikimedia.org/T336952
[13:49:38] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Puppet additions to bring ssw1-f1-eqiad under management [puppet] - 10https://gerrit.wikimedia.org/r/921261 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[13:49:46] <hoo>	 TheresNoTime: Why is it odd?
[13:50:14] <TheresNoTime>	 I'll rephrase that to "is odd because I don't understand it" (:
[13:51:02] * TheresNoTime doesn't understand Wikibase in general to be honest..
[13:51:31] <wikibugs>	 (03PS3) 10Jelto: miscweb annualreport: use wildcard redirect for 2020th report [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217)
[13:53:09] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:54:03] <wikibugs>	 (03CR) 10Ottomata: mw-page-content-change-enrich: enable HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena)
[13:57:17] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336949 (10Jhancock.wm) re-secured power cabled. alert has cleared on chassis. waiting for it to clear in Grafana
[13:57:22] <wikibugs>	 (03CR) 10Jelto: add 15.wikipedia to cert and gateway hosts for miscweb behind istio ingress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[13:57:56] <RhinosF1>	 TheresNoTime: I'm with you on that
[13:58:00] <RhinosF1>	 WikiBase is confusing
[13:58:01] <hoo>	 TheresNoTime: It's a callback from a Lua function which tries to access items… and we throw an exception if a item is a double redirect
[13:58:56] <wikibugs>	 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10MatthewVernon) I think here we are talking about using the S3 protocol? That is currently only enabled on the thanos cluster (MOSS is a maybe-next-FY sort of thing, but will also do S3...
[13:59:41] <hoo>	 But no surprise… Wikibase is a fairly complex system on its own
[13:59:51] <hoo>	 So there's a lot to it
[14:00:17] <hoo>	 But we're trying our best(tm) to keep it understandable
[14:00:31] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10Jclark-ctr) a:03Jclark-ctr
[14:01:41] <icinga-wm>	 PROBLEM - Host ssw1-f1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10Jclark-ctr)
[14:05:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10Jclark-ctr) @Dwisehaupt  Disk have arrived and been installed
[14:05:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10Jclark-ctr) 05Open→03Resolved
[14:06:00] <wikibugs>	 (03PS1) 10Itamar Givon: Add new key generated with a security key [puppet] - 10https://gerrit.wikimedia.org/r/921356 (https://phabricator.wikimedia.org/T337037)
[14:06:05] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: shellbox: Add service mesh envoy retries [puppet] - 10https://gerrit.wikimedia.org/r/921357 (https://phabricator.wikimedia.org/T292663)
[14:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:06:43] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336826 (10Jclark-ctr) 05Open→03Resolved Replaced failed drive icinga alerts have cleared
[14:08:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Got a verbal +1 from _joe_ during the hackathon" [puppet] - 10https://gerrit.wikimedia.org/r/921357 (https://phabricator.wikimedia.org/T292663) (owner: 10Alexandros Kosiaris)
[14:08:02] <topranks>	 ^^ IPv6 ping alert for ssw1-f1-eqiad above is known and understood, just brought under mgmt IPv6 isn't enabled yet 
[14:08:16] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10Jclark-ctr) a:03Jclark-ctr troubleshooting now on site @fgiunchedi
[14:09:15] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:09:36] <wikibugs>	 (03PS1) 10Itamar Givon: [DNM] Remove old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/921359 (https://phabricator.wikimedia.org/T337037)
[14:10:24] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336949 (10Jhancock.wm) 05Open→03Resolved cleared. resolving
[14:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:11:33] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:12:07] <icinga-wm>	 RECOVERY - Host ssw1-f1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms
[14:14:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff)
[14:16:05] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs1014.eqiad.wmnet with reason: firmware update
[14:17:04] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs1014.eqiad.wmnet with reason: firmware update
[14:20:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10ItamarWMDE) Seeing now that the `maint` and `stat` machines are still on buster. Don't mind stalling it until an upgrade to bullseye.
[14:20:39] <sukhe>	 !log disable puppet on A:lvs to roll out CR 910566
[14:20:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:43] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:53] <wikibugs>	 (03CR) 10Itamar Givon: "Moving back to WIP as I saw sk support is only available on bullsye and up and I'll require access to machines that run buster" [puppet] - 10https://gerrit.wikimedia.org/r/921356 (https://phabricator.wikimedia.org/T337037) (owner: 10Itamar Givon)
[14:24:24] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] pybal/lvs: remove backward compatibility for buster [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[14:24:54] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10Jhancock.wm) a:05Papaul→03Jhancock.wm
[14:25:49] <icinga-wm>	 RECOVERY - Host ps1-c5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms
[14:26:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be204[0-3].codfw.wmnet - https://phabricator.wikimedia.org/T337011 (10Jhancock.wm) a:03Jhancock.wm
[14:27:53] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:43] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:30:01] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:48] <wikibugs>	 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10jcrespo) > I don't think thanos is currently backed up; @jcrespo is maestro of backups.  I suggested doing it but the answer I got was no, so no current backups.
[14:30:55] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:34:09] <icinga-wm>	 PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:35:34] <sukhe>	 !log enable puppet on A:lvs, finished rolling out change
[14:35:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:41] <icinga-wm>	 RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:36:11] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: upgrade bloom model with newer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/921366 (https://phabricator.wikimedia.org/T333861)
[14:36:18] <sukhe>	 I /win 59
[14:36:26] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on stat1009.eqiad.wmnet with reason: Bringing stat1009 into service
[14:36:40] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on stat1009.eqiad.wmnet with reason: Bringing stat1009 into service
[14:37:51] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: upgrade bloom model with newer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/921366 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[14:38:49] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: upgrade bloom model with newer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/921366 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[14:40:31] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[14:43:23] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10Jclark-ctr) replaced msw-c5-eqiad
[14:48:45] <icinga-wm>	 PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:49:33] <taavi>	 legoktm and I are starting to deploy stuff
[14:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:50:19] <icinga-wm>	 RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:50:53] <MatmaRex>	 :O
[14:50:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704 (10ayounsi) Upgrade to 3.2.9 didn't help, but we were expecting it a bit.  At this point I guess that it's related to the steady increase of Netbox usage and we should loo...
[14:52:20] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10Jclark-ctr) 05Open→03Resolved
[14:52:42] <wikibugs>	 (03CR) 10Multichill: [C: 03+1] "It's time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm)
[14:53:09] <wikibugs>	 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10Jclark-ctr) @elukey  would like to try to address next week are you available tuesday?
[14:54:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by legoktm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm)
[14:55:21] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: fix bloom model inference [deployment-charts] - 10https://gerrit.wikimedia.org/r/921371 (https://phabricator.wikimedia.org/T333861)
[14:56:52] <wikibugs>	 (03Merged) 10jenkins-bot: Disable GWToolset from Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm)
[14:57:05] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] ml-services: fix bloom model inference [deployment-charts] - 10https://gerrit.wikimedia.org/r/921371 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[14:57:11] <logmsgbot>	 !log legoktm@deploy1002 Started scap: Backport for [[gerrit:921252|Disable GWToolset from Commons (T270911)]]
[14:57:16] <stashbot>	 T270911: Remove GWToolset extension from Wikimedia Commons - https://phabricator.wikimedia.org/T270911
[14:57:57] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: fix bloom model inference [deployment-charts] - 10https://gerrit.wikimedia.org/r/921371 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos)
[14:58:27] <wikibugs>	 (03PS1) 10Robertsky: change wikimaniawiki logo to 2023 version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921372
[14:58:38] <logmsgbot>	 !log legoktm@deploy1002 legoktm: Backport for [[gerrit:921252|Disable GWToolset from Commons (T270911)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[14:59:09] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-eqiad
[15:00:17] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:01:20] <wikibugs>	 (03PS2) 10Robertsky: change wikimaniawiki logo to 2023 version. T337044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921372
[15:03:18] <wikibugs>	 (03PS1) 10Ayounsi: codfw: use new netflow server [homer/public] - 10https://gerrit.wikimedia.org/r/921375 (https://phabricator.wikimedia.org/T330884)
[15:03:26] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] i18n: Add link to help page [extensions/RealMe] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921150 (https://phabricator.wikimedia.org/T322717) (owner: 10Legoktm)
[15:04:51] <wikibugs>	 (03PS5) 10Majavah: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535)
[15:06:15] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[15:06:57] <logmsgbot>	 !log legoktm@deploy1002 Finished scap: Backport for [[gerrit:921252|Disable GWToolset from Commons (T270911)]] (duration: 09m 46s)
[15:07:01] <stashbot>	 T270911: Remove GWToolset extension from Wikimedia Commons - https://phabricator.wikimedia.org/T270911
[15:07:38] <wikibugs>	 (03Merged) 10jenkins-bot: i18n: Add link to help page [extensions/RealMe] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921150 (https://phabricator.wikimedia.org/T322717) (owner: 10Legoktm)
[15:08:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah)
[15:08:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah)
[15:09:21] <wikibugs>	 (03Merged) 10jenkins-bot: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah)
[15:09:38] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:921150|i18n: Add link to help page (T322717)]], [[gerrit:921326|Enable RealMe (T324535)]]
[15:09:43] <stashbot>	 T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535
[15:09:43] <stashbot>	 T322717: Allow Wikimedians to verify their Mastodon profile with rel="me" - https://phabricator.wikimedia.org/T322717
[15:14:00] <James_F>	 legoktm: If you're doing fun prod deploys to delete old code… https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/723652
[15:18:29] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:21:10] <logmsgbot>	 !log taavi@deploy1002 legoktm and taavi: Backport for [[gerrit:921150|i18n: Add link to help page (T322717)]], [[gerrit:921326|Enable RealMe (T324535)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[15:21:15] <stashbot>	 T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535
[15:21:16] <stashbot>	 T322717: Allow Wikimedians to verify their Mastodon profile with rel="me" - https://phabricator.wikimedia.org/T322717
[15:24:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[15:24:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[15:28:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:29:18] <wikibugs>	 (03PS1) 10Btullis: Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951)
[15:30:45] <wikibugs>	 (03CR) 10Btullis: "I plan to test this patch on an-test-client100[1-2] before deploying to the production stats servers." [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis)
[15:31:04] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] codfw: use new netflow server [homer/public] - 10https://gerrit.wikimedia.org/r/921375 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi)
[15:31:40] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:921150|i18n: Add link to help page (T322717)]], [[gerrit:921326|Enable RealMe (T324535)]] (duration: 22m 02s)
[15:31:41] <wikibugs>	 (03Merged) 10jenkins-bot: codfw: use new netflow server [homer/public] - 10https://gerrit.wikimedia.org/r/921375 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi)
[15:31:46] <stashbot>	 T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535
[15:31:46] <stashbot>	 T322717: Allow Wikimedians to verify their Mastodon profile with rel="me" - https://phabricator.wikimedia.org/T322717
[15:32:44] <wikibugs>	 (03PS4) 10Dzahn: httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891)
[15:33:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:36:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "now installed on deploy and cumin:" [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[15:38:39] <wikibugs>	 (03CR) 10Dzahn: "now deployed on deploy* and cumin*. example:" [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[15:41:48] <wikibugs>	 (03PS2) 10Dzahn: httpbb: add tests for gerrit and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/921075 (https://phabricator.wikimedia.org/T326891)
[15:43:07] <wikibugs>	 (03PS3) 10Dzahn: httpbb: add tests for gerrit and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/921075 (https://phabricator.wikimedia.org/T326891)
[15:47:55] <wikibugs>	 (03PS1) 10Ayounsi: Kafka: add netflow2003 to the allowed sources [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884)
[15:48:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for gerrit and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/921075 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[15:48:18] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi)
[15:52:14] <wikibugs>	 (03PS2) 10Ayounsi: Kafka: add netflow2003 to the allowed sources [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884)
[15:52:23] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi)
[15:53:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "now deployed on deploy* and cumin*" [puppet] - 10https://gerrit.wikimedia.org/r/921075 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[15:53:08] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi)
[15:55:08] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Kafka: add netflow2003 to the allowed sources [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi)
[16:02:44] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add tests for CI, https://integration.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891)
[16:03:37] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "[deploy1002:~] $ httpbb --hosts integration.wikimedia.org ./test_integration.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[16:08:39] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:10:10] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ayounsi) >>! In T330884#8863775, @MoritzMuehlenhoff wrote: > @ayounsi There's now netflow2003 running Bookworm with FNM 1.2.4. If that works fine, we can reimage the other...
[16:10:39] <wikibugs>	 (03PS13) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814)
[16:11:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[16:11:19] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:15:01] <wikibugs>	 (03PS2) 10Dzahn: httpbb: add tests for CI, https://integration.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891)
[16:15:03] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add tests for etherpad.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921389 (https://phabricator.wikimedia.org/T326891)
[16:16:29] <wikibugs>	 (03CR) 10Ottomata: "Fine with me!  But, won't this make troubleshooting problems harder? If the logs are removed on a failure, how will we know why something " [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis)
[16:20:46] <wikibugs>	 (03PS1) 10Ayounsi: Fastnetmon: enable Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/921390 (https://phabricator.wikimedia.org/T330884)
[16:22:57] <wikibugs>	 (03PS1) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391
[16:24:42] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add tests for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921392 (https://phabricator.wikimedia.org/T326891)
[16:25:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391 (owner: 10Ssingh)
[16:31:20] <wikibugs>	 (03PS1) 10Ayounsi: Prometheus: fetch FastNetMon metrics [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884)
[16:36:45] <wikibugs>	 (03CR) 10Btullis: Add an extra property 'CollectMode' to each user's jupyter service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis)
[16:42:22] <wikibugs>	 (03CR) 10Dzahn: "sounds good to me! thank you" [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto)
[16:42:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] miscweb/annualreport: bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/921354 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto)
[16:45:06] <wikibugs>	 (03PS2) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391
[16:46:09] <wikibugs>	 (03PS2) 10Ayounsi: Prometheus: fetch FastNetMon metrics [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884)
[16:51:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting:  CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Dzahn) @jijiki @Clement_Goubert  I got reminded of this today via this alert: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mw2448&service=m...
[16:51:40] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2448 is CRITICAL: Host mw2448 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T334429 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[16:55:32] <mutante>	 !log mw2448 - scap pull - T2334429
[16:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Improve logic getting switch port when primary IP is on bridge device (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/921032 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[17:03:02] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10Dzahn) This host is still in Icinga.. so not removed from puppet db or something...
[17:04:02] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10Dzahn) @MoritzMuehlenhoff https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=bast2002
[17:06:47] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:07:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10Dzahn) 05Open→03Stalled
[17:08:54] <wikibugs>	 (03PS4) 10Hnowlan: wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881)
[17:09:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan)
[17:09:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10Dzahn) >>! In T337037#8864974, @ItamarWMDE wrote: > Seeing now that the `maint` and `stat` machines are still on buster. Don't mind stalling it until an upg...
[17:11:38] <wikibugs>	 (03CR) 10Ayounsi: Validators: improve device name, add interface/outlet (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[17:11:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) p:05Triage→03Medium
[17:11:51] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:13:02] <wikibugs>	 (03PS9) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590)
[17:13:43] <wikibugs>	 (03PS1) 10Cathal Mooney: Add disable_ra var to homer config to enable manual disabling of IPv6 RAs [homer/public] - 10https://gerrit.wikimedia.org/r/921400 (https://phabricator.wikimedia.org/T337057)
[17:15:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney)
[17:15:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney)
[17:30:16] <wikibugs>	 (03Abandoned) 10Ottomata: Revert "Add flink-app default log config and use it in page_content_change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920576 (owner: 10Gmodena)
[17:30:37] <wikibugs>	 (03PS3) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391
[17:39:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for CI, https://integration.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[17:41:30] <wikibugs>	 (03PS4) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391
[17:42:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr all went well with that today thank you for your help.  For the next phase we need to move the following links:  |No|Ra...
[17:44:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) The migration went fine today, very quick move and all came up as expected.  EVPN MAC-move BGP signalling worked flawlessly was nice to see in...
[17:45:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts integration.wikimedia.org /srv/deployment/httpbb-tests/contint/test_integration.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[17:46:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) Another option, similar to the above patch, would maybe to make it a global toggle for a device.  So like...
[17:47:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for CI, https://integration.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[17:52:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for etherpad.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921389 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[17:52:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10ssingh) >>! In T292095#8865511, @cmooney wrote: > @Jclark-ctr all went well with that today thank you for your help. >  > For the next phase...
[18:22:11] <wikibugs>	 (03PS1) 10Eevans: cassandra: add dummy secrets for services-dev (test env) [labs/private] - 10https://gerrit.wikimedia.org/r/921408 (https://phabricator.wikimedia.org/T313814)
[18:22:48] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra: add dummy secrets for services-dev (test env) [labs/private] - 10https://gerrit.wikimedia.org/r/921408 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[18:22:58] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] cassandra: add dummy secrets for services-dev (test env) [labs/private] - 10https://gerrit.wikimedia.org/r/921408 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[18:28:07] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) Ok that sounds like a plan, let's try first if the FQDN link works and if not we'll fallback to the IP. Based on the test we might add t...
[18:30:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704 (10Volans) I'll try to have a look next week, but for now I downtimed the alert so it doesn't spam too much until the end of the month. https://icinga.wikimedia.org/cgi-bi...
[18:33:14] <wikibugs>	 (03CR) 10Volans: sre.{ganeti,hosts}.reimage: Confirm with hostname (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall)
[18:35:38] <wikibugs>	 (03CR) 10Volans: sre.ganeti.makevm call reimage after VM creation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede)
[18:35:41] <wikibugs>	 (03CR) 10BCornwall: sre.{ganeti,hosts}.reimage: Confirm with hostname (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall)
[18:41:41] <wikibugs>	 (03PS5) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533)
[18:43:05] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[18:47:44] <wikibugs>	 (03PS1) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531)
[18:49:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall)
[18:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:50:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts etherpad.wikimedia.org /srv/deployment/httpbb-tests/etherpad/test_etherpad.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/921389 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[18:59:38] <wikibugs>	 (03PS14) 10Eevans: cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814)
[19:00:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[19:00:45] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[19:04:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:09:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:10:45] <wikibugs>	 (03PS15) 10Eevans: cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814)
[19:11:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[19:11:39] <wikibugs>	 (03PS5) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391
[19:11:57] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[19:14:30] <wikibugs>	 (03Abandoned) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391 (owner: 10Ssingh)
[19:17:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:24:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:27:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:27:23] <mutante>	 ^ keeping an eye on this
[19:28:03] <mutante>	 ACKed one of them and taking a look but probably we can just let it finish
[19:28:37] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:29:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:29:07] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:30:03] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 1.232 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[19:30:30] <mutante>	 that host 1469 was just busy running ffmpeg.. scaled a video
[19:34:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:35:13] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[19:36:24] <logmsgbot>	 !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@b34c529]: (no justification provided)
[19:36:33] <logmsgbot>	 !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@b34c529]: (no justification provided) (duration: 00m 09s)
[19:36:41] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 3.280 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[19:36:57] <mutante>	 if it keeps doing that for much longer I am ready to depool mw1469 from videoscaler, but so far I think accetable
[19:38:15] <mutante>	 the overall jobrunner health (linked from runbook) looks ok to me.. so no action taken 
[19:39:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:39:55] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[19:41:23] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 4.778 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[19:44:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:45:27] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1469.eqiad.wmnet
[19:45:42] <mutante>	 !log depooled mw1469 from videoscaler, dedicating to just jobrunner
[19:45:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:58] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1469.eqiad.wmnet
[19:46:11] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[19:46:50] <mutante>	 !log mw1469 - sudo pkill ffmpeg (per runbook)
[19:46:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:35] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[19:48:09] <mutante>	 the jobs are more important than the video scaling.. so video scaling got killed.. shoud be retried later ..per docs
[19:48:20] <mutante>	 on this one host that was overloaded and doing both
[19:49:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:08:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:09:05] <TheresNoTime>	 really not happy is it (:
[20:13:07] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:15:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:20:07] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:23:07] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:23:32] <RhinosF1>	 mutante: another sad server
[20:24:33] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 1.816 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:25:07] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:26:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:30:22] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:31:07] <jinxer-wm>	 (ProbeDown) resolved: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:32:51] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:33:47] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[20:34:15] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[20:38:25] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 6.597 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[20:41:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:46:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:48:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:50:51] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[20:52:13] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[20:52:37] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1495.eqiad.wmnet
[20:53:07] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:19:23] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:21:19] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001"
[21:22:22] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001"
[21:22:22] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:29:11] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:43:21] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[21:46:21] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:46:27] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 7.241 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[21:47:47] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:51:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921392 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[21:55:48] <wikibugs>	 (03PS1) 10Dzahn: httpbb: fix path to test file for RT [puppet] - 10https://gerrit.wikimedia.org/r/921428 (https://phabricator.wikimedia.org/T326891)
[21:55:51] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[21:56:07] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[21:58:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] httpbb: fix path to test file for RT [puppet] - 10https://gerrit.wikimedia.org/r/921428 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[22:00:33] <wikibugs>	 (03CR) 10Superpes15: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921372 (owner: 10Robertsky)
[22:02:07] <jinxer-wm>	 (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:03:43] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 8.706 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[22:03:51] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[22:04:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts rt.wikimedia.org /srv/deployment/httpbb-tests/rt/test_rt.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/921428 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[22:07:07] <jinxer-wm>	 (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:08:27] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[22:08:37] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:09:41] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[22:12:35] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 1.283 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[22:13:37] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:15:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:16:51] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[22:17:37] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[22:19:49] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[22:20:07] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:21:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:22:27] <icinga-wm>	 PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner
[22:24:03] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[22:25:22] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:26:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:26:37] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:30:22] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:35:22] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:36:37] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:36:52] <jinxer-wm>	 (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:40:57] <icinga-wm>	 RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Jobrunner
[22:41:01] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[22:41:52] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:46:52] <jinxer-wm>	 (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:48:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:53:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:58:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:59:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:03:22] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:04:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:08:22] <jinxer-wm>	 (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:30:07] <jinxer-wm>	 (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:40:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown