[00:01:06] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:52] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:08] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:58] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:50] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921126 [00:39:37] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921126 (owner: 10TrainBranchBot) [00:46:32] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:48] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:43] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921126 (owner: 10TrainBranchBot) [01:00:34] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:22] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:16:12] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:24:02] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:20] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:12] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:53:54] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:50] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:40] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:26] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:16] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:31:00] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:44] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:46:30] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:48:55] (03PS1) 10Andrew Bogott: mwopenstackclients: use novaobserver creds to determine domain of a project [puppet] - 10https://gerrit.wikimedia.org/r/921112 [02:49:28] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: use novaobserver creds to determine domain of a project [puppet] - 10https://gerrit.wikimedia.org/r/921112 (owner: 10Andrew Bogott) [02:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:52:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:30] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:20] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:12] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:24:02] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:20] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:38:06] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:52] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:14] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:44] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:01:34] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:38] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:26] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:16] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:06] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:52] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:46:32] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:52:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:14] 10SRE, 10Wikidata, 10wdwb-tech, 10Shape Expressions (M2: Linking to EntitySchemas in statements), and 3 others: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 (10Arian_Bozorg) 05Open→03Resolved Looks good to me! Thank so much :) [05:00:32] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:34] RECOVERY - Check systemd state on wdqs2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:08:22] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:14] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:14] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:06] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:22] (03CR) 10Jdlrobson: Enable zebra ab test in hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [05:26:53] 10SRE-swift-storage, 10Commons, 10Tracking-Neverending: Thumbnail/imagescaler (tracking) - https://phabricator.wikimedia.org/T43371 (10Jdforrester-WMF) [05:30:20] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:24] (03PS1) 10Marostegui: db1121: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/921116 (https://phabricator.wikimedia.org/T336725) [05:36:58] (03CR) 10Marostegui: [C: 03+2] db1121: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/921116 (https://phabricator.wikimedia.org/T336725) (owner: 10Marostegui) [05:37:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1121 from dbctl T336725', diff saved to https://phabricator.wikimedia.org/P48367 and previous config saved to /var/cache/conftool/dbconfig/20230519-053719-marostegui.json [05:37:24] T336725: decommission db1121.eqiad.wmnet - https://phabricator.wikimedia.org/T336725 [05:38:10] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:15] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10phaultfinder) [05:44:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2032 to es1 master', diff saved to https://phabricator.wikimedia.org/P48368 and previous config saved to /var/cache/conftool/dbconfig/20230519-054403-marostegui.json [05:45:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2030', diff saved to https://phabricator.wikimedia.org/P48369 and previous config saved to /var/cache/conftool/dbconfig/20230519-054503-root.json [05:45:39] (03PS1) 10Marostegui: es2030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921117 [05:46:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:10] (03CR) 10Marostegui: [C: 03+2] es2030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921117 (owner: 10Marostegui) [05:47:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:47:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2033 to es2 master', diff saved to https://phabricator.wikimedia.org/P48370 and previous config saved to /var/cache/conftool/dbconfig/20230519-054737-marostegui.json [05:47:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2031', diff saved to https://phabricator.wikimedia.org/P48371 and previous config saved to /var/cache/conftool/dbconfig/20230519-054758-root.json [05:48:15] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10phaultfinder) [05:48:34] (03PS1) 10Marostegui: es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921118 [05:48:59] (03CR) 10Marostegui: [C: 03+2] es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921118 (owner: 10Marostegui) [05:49:12] (03PS1) 10Ayounsi: admin/data.yaml: ayounsi: add ssh-ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/921119 [05:49:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2034 to es3 master', diff saved to https://phabricator.wikimedia.org/P48372 and previous config saved to /var/cache/conftool/dbconfig/20230519-054923-marostegui.json [05:49:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2027', diff saved to https://phabricator.wikimedia.org/P48373 and previous config saved to /var/cache/conftool/dbconfig/20230519-054952-root.json [05:51:00] (03PS1) 10Marostegui: es2027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921120 [05:51:27] (03CR) 10Marostegui: [C: 03+2] es2027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/921120 (owner: 10Marostegui) [05:52:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:53:06] (03PS1) 10Ayounsi: ayounsi: update ssh key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921121 (https://phabricator.wikimedia.org/T336769) [05:53:16] (03PS1) 10Marostegui: Revert "es2030: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/920745 [05:53:43] (03CR) 10Marostegui: [C: 03+2] Revert "es2030: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/920745 (owner: 10Marostegui) [05:53:52] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48374 and previous config saved to /var/cache/conftool/dbconfig/20230519-055426-root.json [05:54:46] (03PS1) 10Marostegui: Revert "es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/921146 [05:55:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48375 and previous config saved to /var/cache/conftool/dbconfig/20230519-055511-root.json [05:55:14] (03CR) 10Marostegui: [C: 03+2] Revert "es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/921146 (owner: 10Marostegui) [05:56:20] (03PS1) 10Marostegui: Revert "es2027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/921147 [05:57:12] (03CR) 10Marostegui: [C: 03+2] Revert "es2027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/921147 (owner: 10Marostegui) [05:57:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48376 and previous config saved to /var/cache/conftool/dbconfig/20230519-055723-root.json [06:00:07] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230519T0600) [06:00:52] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10Data-Persistence: Q4:rack/setup/install dbproxy10[22-27]. - https://phabricator.wikimedia.org/T326346 (10Marostegui) Any ETA on when these will be installed? Thanks! [06:07:08] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48377 and previous config saved to /var/cache/conftool/dbconfig/20230519-060931-root.json [06:10:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48378 and previous config saved to /var/cache/conftool/dbconfig/20230519-061016-root.json [06:12:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 2%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48379 and previous config saved to /var/cache/conftool/dbconfig/20230519-061228-root.json [06:15:30] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:10] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:22:18] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49995 bytes in 6.368 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:23:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.335 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:24:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48380 and previous config saved to /var/cache/conftool/dbconfig/20230519-062435-root.json [06:25:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48381 and previous config saved to /var/cache/conftool/dbconfig/20230519-062520-root.json [06:27:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48382 and previous config saved to /var/cache/conftool/dbconfig/20230519-062733-root.json [06:30:27] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:47] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48383 and previous config saved to /var/cache/conftool/dbconfig/20230519-063940-root.json [06:40:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48384 and previous config saved to /var/cache/conftool/dbconfig/20230519-064025-root.json [06:41:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast6002.wikimedia.org [06:42:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48385 and previous config saved to /var/cache/conftool/dbconfig/20230519-064237-root.json [06:45:01] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast6002.wikimedia.org [06:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:51:46] (03PS1) 10Muehlenhoff: Simplify bastion config in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/921125 [06:53:05] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48386 and previous config saved to /var/cache/conftool/dbconfig/20230519-065445-root.json [06:55:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48387 and previous config saved to /var/cache/conftool/dbconfig/20230519-065530-root.json [06:57:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48388 and previous config saved to /var/cache/conftool/dbconfig/20230519-065742-root.json [06:59:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host netflow2003.codfw.wmnet with OS bookworm [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230519T0700) [07:00:33] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:19] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48389 and previous config saved to /var/cache/conftool/dbconfig/20230519-070949-root.json [07:10:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48390 and previous config saved to /var/cache/conftool/dbconfig/20230519-071034-root.json [07:11:50] !log installing emacs security updates [07:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48391 and previous config saved to /var/cache/conftool/dbconfig/20230519-071247-root.json [07:16:01] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:21:24] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: prometheus4001.ulsfo.wmnet [07:21:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: prometheus4001.ulsfo.wmnet [07:21:31] 10SRE, 10ops-ulsfo, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2022/2023-Q4): Decommission prometheus4001 - https://phabricator.wikimedia.org/T335585 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: prometheus4001.ulsfo.wmnet [07:22:47] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (gerrit1001, ...), Fresh: 122 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:23:41] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48392 and previous config saved to /var/cache/conftool/dbconfig/20230519-072454-root.json [07:25:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48393 and previous config saved to /var/cache/conftool/dbconfig/20230519-072539-root.json [07:27:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48394 and previous config saved to /var/cache/conftool/dbconfig/20230519-072751-root.json [07:31:23] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on netflow2003.codfw.wmnet with reason: host reimage [07:34:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow2003.codfw.wmnet with reason: host reimage [07:37:41] 10SRE, 10DBA: db1132 index for table pagetriage_page is corrupt - https://phabricator.wikimedia.org/T335632 (10Marostegui) 05Open→03Resolved I have repooled db1132 - I will investigate db1106 with mariadb (this host is non production) [07:37:46] (03CR) 10Jaime Nuche: [C: 04-1] doc: add password-protected rsync module for publishing from gitlab (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [07:39:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2030 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48395 and previous config saved to /var/cache/conftool/dbconfig/20230519-073959-root.json [07:40:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48396 and previous config saved to /var/cache/conftool/dbconfig/20230519-074044-root.json [07:41:35] (03CR) 10Jaime Nuche: [C: 04-1] doc: add password-protected rsync module for publishing from gitlab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [07:42:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P48397 and previous config saved to /var/cache/conftool/dbconfig/20230519-074256-root.json [07:49:59] (03PS2) 10Ilias Sarantopoulos: ml-services: deploy Bloom-560m model on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861) [07:52:19] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-cache2001.codfw.wmnet [07:53:41] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10fgiunchedi) All of these seem to be for C5 only, maybe some mgmt network problem there? [07:58:55] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2001.codfw.wmnet [08:00:21] (03CR) 10Elukey: ml-services: deploy Bloom-560m model on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:03:57] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-cache2002.codfw.wmnet [08:07:04] (03CR) 10Jaime Nuche: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921126 (owner: 10TrainBranchBot) [08:08:03] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:33] !log copy samplicator from bullseye-wikimedia to bookworm-wikimedia T330884 [08:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:37] T330884: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 [08:10:32] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2002.codfw.wmnet [08:11:15] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-cache2003.codfw.wmnet [08:13:50] (03PS3) 10Majavah: wmnet: Remove nfs-tools-project.svc.eqiad [dns] - 10https://gerrit.wikimedia.org/r/907136 (https://phabricator.wikimedia.org/T333477) [08:14:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host netflow2003.codfw.wmnet with OS bookworm [08:14:59] (03CR) 10Majavah: [C: 03+1] perl532: Add libphp-serialization-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919922 (https://phabricator.wikimedia.org/T323522) (owner: 10BryanDavis) [08:15:05] (03CR) 10Majavah: [C: 03+1] perl532: Add libmime-lite-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904) (owner: 10BryanDavis) [08:15:47] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:06] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache2003.codfw.wmnet [08:18:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:20:05] (03PS3) 10Ilias Sarantopoulos: ml-services: deploy Bloom-560m model on Lift Wing [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861) [08:22:03] (03CR) 10Ilias Sarantopoulos: ml-services: deploy Bloom-560m model on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:23:29] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:24:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921126 (owner: 10TrainBranchBot) [08:25:15] (03CR) 10David Caro: "LGTM, I'm not sure if 'shared-storage' is the best naming, as I would expect that to be just storage to share stuff with other tools/users" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/920259 (https://phabricator.wikimedia.org/T334081) (owner: 10Majavah) [08:27:58] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet [08:28:25] (03CR) 10Majavah: Add an option to disable NFS access (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/920259 (https://phabricator.wikimedia.org/T334081) (owner: 10Majavah) [08:31:09] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:51] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet [08:32:06] (03PS1) 10Muehlenhoff: Drop Boost packages from legacy package removal list for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/921174 (https://phabricator.wikimedia.org/T330495) [08:32:25] (03PS2) 10Muehlenhoff: Drop Boost packages from legacy package removal list for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/921174 (https://phabricator.wikimedia.org/T330495) [08:32:53] (03CR) 10Elukey: [C: 03+2] ml-services: deploy Bloom-560m model on Lift Wing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/919345 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [08:33:08] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Add configuration file support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336037 (10Joe) 05In progress→03Resolved [08:33:18] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [08:33:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921128 [08:33:36] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921128 (owner: 10TrainBranchBot) [08:34:23] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Add traffic sampling support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336038 (10Joe) 05In progress→03Resolved [08:34:32] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [08:34:39] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet [08:38:10] (03CR) 10Muehlenhoff: [C: 03+2] Drop Boost packages from legacy package removal list for Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/921174 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [08:38:21] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [08:38:30] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet [08:38:47] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:43] (03CR) 10BryanDavis: [C: 03+2] perl532: Add libmime-lite-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904) (owner: 10BryanDavis) [08:39:48] (03CR) 10BryanDavis: [C: 03+2] perl532: Add libphp-serialization-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919922 (https://phabricator.wikimedia.org/T323522) (owner: 10BryanDavis) [08:40:26] (03Merged) 10jenkins-bot: perl532: Add libmime-lite-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919920 (https://phabricator.wikimedia.org/T320904) (owner: 10BryanDavis) [08:40:30] (03Merged) 10jenkins-bot: perl532: Add libphp-serialization-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/919922 (https://phabricator.wikimedia.org/T323522) (owner: 10BryanDavis) [08:41:51] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet [08:45:45] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet [08:46:27] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:28] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [08:52:35] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/921128 (owner: 10TrainBranchBot) [08:53:21] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10MoritzMuehlenhoff) @ayounsi There's now netflow2003 running Bookworm with FNM 1.2.4. If that works fine, we can reimage the other netflow* VMs in-place once Bookworm is stable. I copied over s... [08:55:11] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:58:16] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [08:59:24] !log mvernon@cumin2002 START - Cookbook sre.hosts.decommission for hosts ms-be[2040-2043].codfw.wmnet [09:00:17] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:09] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [09:02:23] (03PS1) 10Muehlenhoff: Switch kadmin server back to krb1001 [puppet] - 10https://gerrit.wikimedia.org/r/921242 (https://phabricator.wikimedia.org/T331695) [09:04:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:07:57] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:51] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [09:15:00] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [09:15:39] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:29] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[2040-2043].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [09:20:12] (03CR) 10David Caro: Add an option to disable NFS access (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/920259 (https://phabricator.wikimedia.org/T334081) (owner: 10Majavah) [09:21:16] !log klausman@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [09:21:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[2040-2043].codfw.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin2002" [09:21:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:21:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[2040-2043].codfw.wmnet [09:21:39] 10SRE-swift-storage: Drain and then decommission ms-be20[40-43] - https://phabricator.wikimedia.org/T335280 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: `ms-be[2040-2043].codfw.wmnet` - ms-be2040.codfw.wmnet (**WARN**) - Downtimed host on Icinga/Alertmanager... [09:21:55] (03PS7) 10EoghanGaffney: doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [09:23:01] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:23:25] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:49] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:23:51] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout error getting connected_endpoint prefix - https://phabricator.wikimedia.org/T321704 (10cmooney) This appears to be happening more often now, and is starting to cause considerable noise in the dc-ops irc channel. @volans... [09:26:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:27:30] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be204[0-3].codfw.wmnet - https://phabricator.wikimedia.org/T337011 (10MatthewVernon) [09:28:45] 10SRE-swift-storage: Q4 ms backend refresh work (KR) - https://phabricator.wikimedia.org/T335270 (10MatthewVernon) [09:28:48] 10SRE-swift-storage: Drain and then decommission ms-be20[40-43] - https://phabricator.wikimedia.org/T335280 (10MatthewVernon) 05Open→03Resolved Hosts off and decom cookbook run; the DC-ops ticket to actually dispose of the hardware is T337011 [09:31:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2002.codfw.wmnet with OS bullseye [09:33:08] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:33:33] 10SRE, 10Inuka-Team, 10Wikipedia-Preview, 10User-bd808: Add both Wikipedia Preview repos to Packagist - https://phabricator.wikimedia.org/T310938 (10bd808) 05Open→03Resolved a:03bd808 * https://packagist.org/packages/wikimedia/wikipedia-preview * https://packagist.org/packages/wikimedia/wikipediaprev... [09:33:46] (03PS1) 10EoghanGaffney: Changes from hard-coded list of hosts in doc module [puppet] - 10https://gerrit.wikimedia.org/r/921244 [09:33:57] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704 (10cmooney) [09:37:22] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:05] (03CR) 10Btullis: [C: 03+1] "Looks good. Thanks elukey." [puppet] - 10https://gerrit.wikimedia.org/r/919802 (owner: 10Majavah) [09:39:14] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:41:01] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41240/console" [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney) [09:42:04] PROBLEM - Check systemd state on ml-serve2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:15] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [09:45:45] (03PS1) 10Muehlenhoff: Remove bast2002 from bastion hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/921247 (https://phabricator.wikimedia.org/T334287) [09:45:48] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:50] RECOVERY - Check systemd state on ml-serve2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:24] (03PS2) 10Filippo Giunchedi: prometheus: deprecate traffic 'global' rules [puppet] - 10https://gerrit.wikimedia.org/r/861826 (https://phabricator.wikimedia.org/T288196) [09:48:26] (03PS1) 10Filippo Giunchedi: prometheus: remove global rules [puppet] - 10https://gerrit.wikimedia.org/r/921248 (https://phabricator.wikimedia.org/T288196) [09:48:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast2002 from bastion hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/921247 (https://phabricator.wikimedia.org/T334287) (owner: 10Muehlenhoff) [09:48:39] (03PS2) 10Muehlenhoff: Remove bast2002 from bastion hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/921247 (https://phabricator.wikimedia.org/T334287) [09:48:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2002.codfw.wmnet with reason: host reimage [09:48:59] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:49:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:49:38] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:54:08] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:55:57] (03PS1) 10Muehlenhoff: Update host for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/921250 (https://phabricator.wikimedia.org/T336995) [09:56:56] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:00:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2002.codfw.wmnet with OS bullseye [10:02:06] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:03:12] (03CR) 10Filippo Giunchedi: [C: 03+1] Update host for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/921250 (https://phabricator.wikimedia.org/T336995) (owner: 10Muehlenhoff) [10:06:43] (03CR) 10Muehlenhoff: [C: 03+2] Update host for blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/921250 (https://phabricator.wikimedia.org/T336995) (owner: 10Muehlenhoff) [10:07:32] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:40] !log installing ncurses security updates [10:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:00] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:10:36] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:12:00] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:13:30] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:14:58] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:16:22] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:46] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:22:14] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:22:38] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:48] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:25:58] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:27:32] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:28:14] (03PS1) 10Legoktm: Disable GWToolset from Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) [10:30:28] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:25] (03PS1) 10Legoktm: Remove GWToolset configuration (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921253 (https://phabricator.wikimedia.org/T270911) [10:31:27] (03PS1) 10Legoktm: Remove GWToolset configuration (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921254 (https://phabricator.wikimedia.org/T270911) [10:33:58] (03PS1) 10Hnowlan: imagemagick: update test cases for fixes within libraries [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/921255 (https://phabricator.wikimedia.org/T334863) [10:35:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host testvm2004.codfw.wmnet with OS bullseye [10:37:34] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast2002 [10:38:11] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab-runner1003.eqiad.wmnet [10:38:20] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:38:27] (03CR) 10Zabe: [C: 03+1] "trust me, I am a pro in disabling extensions ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [10:41:02] (03CR) 10Jaime Nuche: [C: 03+1] Changes from hard-coded list of hosts in doc module [puppet] - 10https://gerrit.wikimedia.org/r/921244 (owner: 10EoghanGaffney) [10:41:10] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:42:29] 10SRE-swift-storage, 10Discovery-Search: Ensure swiftly access for non-SREs - https://phabricator.wikimedia.org/T335144 (10MatthewVernon) Do you need anything from Data Persistence apropos this? I think not, but wanted to check in that you're not waiting for something from me :) [10:44:41] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:45:05] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab-runner1003.eqiad.wmnet [10:46:08] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:16] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:48:44] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:50:10] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): gitlab-runner1003 is not coming back online - https://phabricator.wikimedia.org/T336737 (10Jelto) 05Open→03Resolved @Jclark-ctr thanks a lot for the quick response! Error is gone! I'm closing this task. [10:50:16] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage [10:51:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast2002 decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:53:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:53:37] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage [10:54:51] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:55:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: bast2002 decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:55:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:55:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts bast2002 [10:56:53] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:58:14] (03PS1) 10Muehlenhoff: Remove bast2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/921260 [10:59:00] 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10MoritzMuehlenhoff) [10:59:20] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/921260 (owner: 10Muehlenhoff) [11:00:49] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:21] (03CR) 10Btullis: [C: 04-1] "There is still a problem with this change, in that I added the wrong file to conda-analytics. See https://phabricator.wikimedia.org/T33276" [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [11:03:29] (03Abandoned) 10Cathal Mooney: Puppet additions for ssw1-e1-eqiad and ssw1-f1-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/906627 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [11:04:30] (03PS1) 10Cathal Mooney: Puppet additions to bring ssw1-f1-eqiad under management [puppet] - 10https://gerrit.wikimedia.org/r/921261 (https://phabricator.wikimedia.org/T322937) [11:06:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host testvm2004.codfw.wmnet with OS bullseye [11:06:11] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:07:37] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:10:25] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:13:51] (03CR) 10Brian Wolff: [C: 03+1] Disable GWToolset from Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [11:15:11] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:09] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:23:55] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:29] !log klausman@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw [11:30:15] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:35:21] (03PS1) 10EoghanGaffney: Allow gitlab-runner hosts to talk rsync to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/921265 (https://phabricator.wikimedia.org/T336168) [11:37:59] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:58] (03PS8) 10Gmodena: mw-page-content-change-enrich: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) [11:44:34] (03PS9) 10Gmodena: mw-page-content-change-enrich: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) [11:45:37] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:27] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41242/console" [puppet] - 10https://gerrit.wikimedia.org/r/921265 (https://phabricator.wikimedia.org/T336168) (owner: 10EoghanGaffney) [11:46:55] (03CR) 10Jaime Nuche: [C: 03+1] Allow gitlab-runner hosts to talk rsync to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/921265 (https://phabricator.wikimedia.org/T336168) (owner: 10EoghanGaffney) [11:47:26] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Allow gitlab-runner hosts to talk rsync to doc1003 [puppet] - 10https://gerrit.wikimedia.org/r/921265 (https://phabricator.wikimedia.org/T336168) (owner: 10EoghanGaffney) [11:51:13] (03PS1) 10Majavah: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) [11:51:56] (03CR) 10Bartosz Dziewoński: [C: 03+1] mwscript: Avoid prepending maintenance/ if >= 2 dots in argument [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920788 (https://phabricator.wikimedia.org/T336819) (owner: 10Ladsgroup) [11:53:13] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:17] 10SRE, 10Inuka-Team, 10Wikipedia-Preview, 10User-bd808: Add both Wikipedia Preview repos to Packagist - https://phabricator.wikimedia.org/T310938 (10Varnent) >>! In T310938#8863903, @bd808 wrote: > * https://packagist.org/packages/wikimedia/wikipedia-preview > * https://packagist.org/packages/wikimedia/wik... [12:00:47] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:31] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:23] (03CR) 10Legoktm: [C: 04-1] "Let's keep this disabled on private wikis and locked down ones." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [12:12:16] (03PS2) 10Majavah: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) [12:12:42] (03CR) 10Ayounsi: [C: 03+1] Puppet additions to bring ssw1-f1-eqiad under management [puppet] - 10https://gerrit.wikimedia.org/r/921261 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [12:13:06] (03PS3) 10Majavah: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) [12:14:15] (03CR) 10Majavah: Enable RealMe (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [12:14:35] (03CR) 10Ayounsi: [C: 03+2] ayounsi: update ssh key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921121 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi) [12:15:02] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Joe) [12:15:12] (03CR) 10Ayounsi: [C: 03+2] admin/data.yaml: ayounsi: add ssh-ed25519 key [puppet] - 10https://gerrit.wikimedia.org/r/921119 (owner: 10Ayounsi) [12:15:32] (03PS4) 10Majavah: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) [12:15:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/921255 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [12:16:09] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:20] (03CR) 10Muehlenhoff: [C: 03+2] Simplify bastion config in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/921125 (owner: 10Muehlenhoff) [12:18:40] (03Merged) 10jenkins-bot: ayounsi: update ssh key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921121 (https://phabricator.wikimedia.org/T336769) (owner: 10Ayounsi) [12:18:42] (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/920991 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [12:19:58] !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [12:22:19] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) @volans I had a word with @ayounsi on this and we both feel if we can make it work via HTTP to the apt server that's probably best. I... [12:23:57] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:49] (03CR) 10Ayounsi: [C: 03+2] users: Update brett's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921073 (https://phabricator.wikimedia.org/T336769) (owner: 10BCornwall) [12:28:33] (03PS1) 10Legoktm: i18n: Add link to help page [extensions/RealMe] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921150 (https://phabricator.wikimedia.org/T322717) [12:29:01] (03CR) 10Hnowlan: [C: 03+2] imagemagick: update test cases for fixes within libraries [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/921255 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [12:30:11] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:36] (03Merged) 10jenkins-bot: users: Update brett's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921073 (https://phabricator.wikimedia.org/T336769) (owner: 10BCornwall) [12:34:02] (03Merged) 10jenkins-bot: imagemagick: update test cases for fixes within libraries [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/921255 (https://phabricator.wikimedia.org/T334863) (owner: 10Hnowlan) [12:37:59] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [12:40:20] (03PS10) 10Gmodena: mw-page-content-change-enrich: enable HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) [12:45:47] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:29] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:56] (03CR) 10Hashar: [C: 03+1] "Formal approval from Releng. That is being done from the Hackathon so if something causes any trouble everyone is around :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [12:57:11] (03CR) 10Hashar: [C: 03+1] Remove GWToolset configuration (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921253 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [12:58:40] (03CR) 10Hashar: [C: 03+1] "You might want to rebuild the localization cache, then I don't think having any unused/unreferenced message in the cache is causing any is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921254 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [13:01:09] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:55] (03CR) 10Hashar: [C: 03+1] "Given it is done at the Hackathon there is all the expertise required to deploy it even if it is Friday today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [13:05:31] I am rolling all wikis to 1.40.0-wmf.9 [13:08:02] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921342 (https://phabricator.wikimedia.org/T330215) [13:08:04] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921342 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [13:08:51] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:12] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921342 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [13:12:32] (03PS3) 10Filippo Giunchedi: prometheus: deprecate traffic 'global' rules [puppet] - 10https://gerrit.wikimedia.org/r/861826 (https://phabricator.wikimedia.org/T288196) [13:12:34] (03PS2) 10Filippo Giunchedi: prometheus: remove global rules [puppet] - 10https://gerrit.wikimedia.org/r/921248 (https://phabricator.wikimedia.org/T288196) [13:12:36] (03PS1) 10Filippo Giunchedi: prometheus: soft-disable 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921347 (https://phabricator.wikimedia.org/T288196) [13:12:38] (03PS1) 10Filippo Giunchedi: prometheus: add 'ensure' for ::server [puppet] - 10https://gerrit.wikimedia.org/r/921348 [13:12:40] (03PS1) 10Filippo Giunchedi: prometheus: de-provision 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196) [13:12:42] (03PS1) 10Filippo Giunchedi: prometheus: remove 'global' instance references [puppet] - 10https://gerrit.wikimedia.org/r/921350 (https://phabricator.wikimedia.org/T288196) [13:15:03] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:10] (03CR) 10CI reject: [V: 04-1] prometheus: de-provision 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [13:17:30] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.9 refs T330215 [13:17:34] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [13:21:00] (03CR) 10Muehlenhoff: "I like the approach, but given this all touches fairly critical functionality I think we should rather break it down and merge incremental" [puppet] - 10https://gerrit.wikimedia.org/r/919060 (https://phabricator.wikimedia.org/T279683) (owner: 10Jbond) [13:22:41] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:29] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:04] !log Adding vlan config for row e/f vlans on ssw1-f1-eqiad (T322937) [13:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:09] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [13:30:19] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:00] (03CR) 10Muehlenhoff: firewall: add basic firewall class (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/919061 (owner: 10Jbond) [13:33:51] (03PS2) 10Filippo Giunchedi: prometheus: de-provision 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/921349 (https://phabricator.wikimedia.org/T288196) [13:33:53] (03PS2) 10Filippo Giunchedi: prometheus: remove 'global' instance references [puppet] - 10https://gerrit.wikimedia.org/r/921350 (https://phabricator.wikimedia.org/T288196) [13:34:37] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs1020.eqiad.wmnet with reason: Move lvs1020 handoff port to row e/f from lsw1-f1 to ssw1-f1 [13:34:45] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided) [13:34:51] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs1020.eqiad.wmnet with reason: Move lvs1020 handoff port to row e/f from lsw1-f1 to ssw1-f1 [13:34:55] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 10s) [13:34:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c4ef01af-e7d5-458f-ae46-17500f124165) set by cmooney@cumin1001 f... [13:36:06] 1.41.0-wmf.9 is on all wikis, I am triaging the error logs [13:36:12] but it looks quiet so far [13:37:55] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:24] taavi: legoktm: go ahead with your deployment. 1.41.0-wmf.9 looks stable :] [13:40:30] cool [13:40:44] you are both at the hackathon aren't you? [13:40:56] yep [13:42:38] (03PS1) 10Jelto: miscweb/annualreport: bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/921354 (https://phabricator.wikimedia.org/T336217) [13:44:04] 10SRE, 10SRE-Access-Requests: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10ItamarWMDE) [13:45:33] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:03] 10SRE, 10SRE-Access-Requests: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10taavi) Please note that not all of our servers have `-sk` support yet, it's only on systems running Bullseye or newer. [13:47:05] (03CR) 10Jelto: miscweb annualreport: use wildcard redirect for 2020th report (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto) [13:48:59] 10SRE, 10SRE-Access-Requests: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10ItamarWMDE) Aha! Yeah I was wondering about that earlier. Will this suffice for running scripts on the maint machines and occasionally performing tasks on stat machines? [13:49:17] T336952 is such an odd bug.. [13:49:17] T336952: Wikibase\DataModel\Services\Lookup\ReferencedEntityIdLookupException: Referenced entity id lookup failed. Tried to find a referenced entity out of Q16334295 linked from Q13406463 via P279 - https://phabricator.wikimedia.org/T336952 [13:49:38] (03CR) 10Cathal Mooney: [C: 03+2] Puppet additions to bring ssw1-f1-eqiad under management [puppet] - 10https://gerrit.wikimedia.org/r/921261 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [13:49:46] TheresNoTime: Why is it odd? [13:50:14] I'll rephrase that to "is odd because I don't understand it" (: [13:51:02] * TheresNoTime doesn't understand Wikibase in general to be honest.. [13:51:31] (03PS3) 10Jelto: miscweb annualreport: use wildcard redirect for 2020th report [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) [13:53:09] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:03] (03CR) 10Ottomata: mw-page-content-change-enrich: enable HA (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920268 (https://phabricator.wikimedia.org/T336656) (owner: 10Gmodena) [13:57:17] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336949 (10Jhancock.wm) re-secured power cabled. alert has cleared on chassis. waiting for it to clear in Grafana [13:57:22] (03CR) 10Jelto: add 15.wikipedia to cert and gateway hosts for miscweb behind istio ingress (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [13:57:56] TheresNoTime: I'm with you on that [13:58:00] WikiBase is confusing [13:58:01] TheresNoTime: It's a callback from a Lua function which tries to access items… and we throw an exception if a item is a double redirect [13:58:56] 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10MatthewVernon) I think here we are talking about using the S3 protocol? That is currently only enabled on the thanos cluster (MOSS is a maybe-next-FY sort of thing, but will also do S3... [13:59:41] But no surprise… Wikibase is a fairly complex system on its own [13:59:51] So there's a lot to it [14:00:17] But we're trying our best(tm) to keep it understandable [14:00:31] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10Jclark-ctr) a:03Jclark-ctr [14:01:41] PROBLEM - Host ssw1-f1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10Jclark-ctr) [14:05:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10Jclark-ctr) @Dwisehaupt Disk have arrived and been installed [14:05:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q4: install disks in frmon1002 - https://phabricator.wikimedia.org/T336569 (10Jclark-ctr) 05Open→03Resolved [14:06:00] (03PS1) 10Itamar Givon: Add new key generated with a security key [puppet] - 10https://gerrit.wikimedia.org/r/921356 (https://phabricator.wikimedia.org/T337037) [14:06:05] (03PS1) 10Alexandros Kosiaris: shellbox: Add service mesh envoy retries [puppet] - 10https://gerrit.wikimedia.org/r/921357 (https://phabricator.wikimedia.org/T292663) [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:43] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:51] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on analytics1068 - https://phabricator.wikimedia.org/T336826 (10Jclark-ctr) 05Open→03Resolved Replaced failed drive icinga alerts have cleared [14:08:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Got a verbal +1 from _joe_ during the hackathon" [puppet] - 10https://gerrit.wikimedia.org/r/921357 (https://phabricator.wikimedia.org/T292663) (owner: 10Alexandros Kosiaris) [14:08:02] ^^ IPv6 ping alert for ssw1-f1-eqiad above is known and understood, just brought under mgmt IPv6 isn't enabled yet [14:08:16] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10Jclark-ctr) a:03Jclark-ctr troubleshooting now on site @fgiunchedi [14:09:15] PROBLEM - Check systemd state on ml-serve1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:36] (03PS1) 10Itamar Givon: [DNM] Remove old ssh key [puppet] - 10https://gerrit.wikimedia.org/r/921359 (https://phabricator.wikimedia.org/T337037) [14:10:24] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336949 (10Jhancock.wm) 05Open→03Resolved cleared. resolving [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:33] RECOVERY - Check systemd state on ml-serve1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:07] RECOVERY - Host ssw1-f1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [14:14:37] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [14:16:05] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:39] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs1014.eqiad.wmnet with reason: firmware update [14:17:04] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs1014.eqiad.wmnet with reason: firmware update [14:20:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10ItamarWMDE) Seeing now that the `maint` and `stat` machines are still on buster. Don't mind stalling it until an upgrade to bullseye. [14:20:39] !log disable puppet on A:lvs to roll out CR 910566 [14:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:43] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:53] (03CR) 10Itamar Givon: "Moving back to WIP as I saw sk support is only available on bullsye and up and I'll require access to machines that run buster" [puppet] - 10https://gerrit.wikimedia.org/r/921356 (https://phabricator.wikimedia.org/T337037) (owner: 10Itamar Givon) [14:24:24] (03CR) 10Ssingh: [V: 03+1 C: 03+2] pybal/lvs: remove backward compatibility for buster [puppet] - 10https://gerrit.wikimedia.org/r/910566 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:24:54] 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10Jhancock.wm) a:05Papaul→03Jhancock.wm [14:25:49] RECOVERY - Host ps1-c5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms [14:26:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission ms-be204[0-3].codfw.wmnet - https://phabricator.wikimedia.org/T337011 (10Jhancock.wm) a:03Jhancock.wm [14:27:53] PROBLEM - Check systemd state on ml-serve1006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:43] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:30:01] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:48] 10SRE-swift-storage, 10serviceops-collab: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10jcrespo) > I don't think thanos is currently backed up; @jcrespo is maestro of backups. I suggested doing it but the answer I got was no, so no current backups. [14:30:55] RECOVERY - Check systemd state on ml-serve1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:09] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:35:34] !log enable puppet on A:lvs, finished rolling out change [14:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:41] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:36:11] (03PS1) 10Ilias Sarantopoulos: ml-services: upgrade bloom model with newer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/921366 (https://phabricator.wikimedia.org/T333861) [14:36:18] I /win 59 [14:36:26] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on stat1009.eqiad.wmnet with reason: Bringing stat1009 into service [14:36:40] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on stat1009.eqiad.wmnet with reason: Bringing stat1009 into service [14:37:51] (03CR) 10Klausman: [C: 03+2] ml-services: upgrade bloom model with newer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/921366 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:38:49] (03Merged) 10jenkins-bot: ml-services: upgrade bloom model with newer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/921366 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:40:31] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:43:23] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10Jclark-ctr) replaced msw-c5-eqiad [14:48:45] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:49:33] legoktm and I are starting to deploy stuff [14:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:50:19] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:53] :O [14:50:59] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704 (10ayounsi) Upgrade to 3.2.9 didn't help, but we were expecting it a bit. At this point I guess that it's related to the steady increase of Netbox usage and we should loo... [14:52:20] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T336992 (10Jclark-ctr) 05Open→03Resolved [14:52:42] (03CR) 10Multichill: [C: 03+1] "It's time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [14:53:09] 10SRE, 10ops-eqiad: Move two GPUs from Hadoop to Lift Wing - https://phabricator.wikimedia.org/T335031 (10Jclark-ctr) @elukey would like to try to address next week are you available tuesday? [14:54:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by legoktm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [14:55:21] (03PS1) 10Ilias Sarantopoulos: ml-services: fix bloom model inference [deployment-charts] - 10https://gerrit.wikimedia.org/r/921371 (https://phabricator.wikimedia.org/T333861) [14:56:52] (03Merged) 10jenkins-bot: Disable GWToolset from Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921252 (https://phabricator.wikimedia.org/T270911) (owner: 10Legoktm) [14:57:05] (03CR) 10Klausman: [C: 03+2] ml-services: fix bloom model inference [deployment-charts] - 10https://gerrit.wikimedia.org/r/921371 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:57:11] !log legoktm@deploy1002 Started scap: Backport for [[gerrit:921252|Disable GWToolset from Commons (T270911)]] [14:57:16] T270911: Remove GWToolset extension from Wikimedia Commons - https://phabricator.wikimedia.org/T270911 [14:57:57] (03Merged) 10jenkins-bot: ml-services: fix bloom model inference [deployment-charts] - 10https://gerrit.wikimedia.org/r/921371 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [14:58:27] (03PS1) 10Robertsky: change wikimaniawiki logo to 2023 version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921372 [14:58:38] !log legoktm@deploy1002 legoktm: Backport for [[gerrit:921252|Disable GWToolset from Commons (T270911)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [14:59:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-eqiad [15:00:17] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:01:20] (03PS2) 10Robertsky: change wikimaniawiki logo to 2023 version. T337044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921372 [15:03:18] (03PS1) 10Ayounsi: codfw: use new netflow server [homer/public] - 10https://gerrit.wikimedia.org/r/921375 (https://phabricator.wikimedia.org/T330884) [15:03:26] (03CR) 10Majavah: [C: 03+2] i18n: Add link to help page [extensions/RealMe] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921150 (https://phabricator.wikimedia.org/T322717) (owner: 10Legoktm) [15:04:51] (03PS5) 10Majavah: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) [15:06:15] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:06:57] !log legoktm@deploy1002 Finished scap: Backport for [[gerrit:921252|Disable GWToolset from Commons (T270911)]] (duration: 09m 46s) [15:07:01] T270911: Remove GWToolset extension from Wikimedia Commons - https://phabricator.wikimedia.org/T270911 [15:07:38] (03Merged) 10jenkins-bot: i18n: Add link to help page [extensions/RealMe] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/921150 (https://phabricator.wikimedia.org/T322717) (owner: 10Legoktm) [15:08:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [15:08:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [15:09:21] (03Merged) 10jenkins-bot: Enable RealMe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921326 (https://phabricator.wikimedia.org/T324535) (owner: 10Majavah) [15:09:38] !log taavi@deploy1002 Started scap: Backport for [[gerrit:921150|i18n: Add link to help page (T322717)]], [[gerrit:921326|Enable RealMe (T324535)]] [15:09:43] T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535 [15:09:43] T322717: Allow Wikimedians to verify their Mastodon profile with rel="me" - https://phabricator.wikimedia.org/T322717 [15:14:00] legoktm: If you're doing fun prod deploys to delete old code… https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/723652 [15:18:29] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:21:10] !log taavi@deploy1002 legoktm and taavi: Backport for [[gerrit:921150|i18n: Add link to help page (T322717)]], [[gerrit:921326|Enable RealMe (T324535)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [15:21:15] T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535 [15:21:16] T322717: Allow Wikimedians to verify their Mastodon profile with rel="me" - https://phabricator.wikimedia.org/T322717 [15:24:42] (03CR) 10Dzahn: [C: 03+2] httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [15:24:50] (03CR) 10Dzahn: [C: 03+2] httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [15:28:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:29:18] (03PS1) 10Btullis: Add an extra property 'CollectMode' to each user's jupyter service [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951) [15:30:45] (03CR) 10Btullis: "I plan to test this patch on an-test-client100[1-2] before deploying to the production stats servers." [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis) [15:31:04] (03CR) 10Ayounsi: [C: 03+2] codfw: use new netflow server [homer/public] - 10https://gerrit.wikimedia.org/r/921375 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [15:31:40] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:921150|i18n: Add link to help page (T322717)]], [[gerrit:921326|Enable RealMe (T324535)]] (duration: 22m 02s) [15:31:41] (03Merged) 10jenkins-bot: codfw: use new netflow server [homer/public] - 10https://gerrit.wikimedia.org/r/921375 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [15:31:46] T324535: Deploy RealMe to production - https://phabricator.wikimedia.org/T324535 [15:31:46] T322717: Allow Wikimedians to verify their Mastodon profile with rel="me" - https://phabricator.wikimedia.org/T322717 [15:32:44] (03PS4) 10Dzahn: httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891) [15:33:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:36:36] (03CR) 10Dzahn: [C: 03+2] "now installed on deploy and cumin:" [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [15:38:39] (03CR) 10Dzahn: "now deployed on deploy* and cumin*. example:" [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [15:41:48] (03PS2) 10Dzahn: httpbb: add tests for gerrit and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/921075 (https://phabricator.wikimedia.org/T326891) [15:43:07] (03PS3) 10Dzahn: httpbb: add tests for gerrit and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/921075 (https://phabricator.wikimedia.org/T326891) [15:47:55] (03PS1) 10Ayounsi: Kafka: add netflow2003 to the allowed sources [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884) [15:48:08] (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for gerrit and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/921075 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [15:48:18] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [15:52:14] (03PS2) 10Ayounsi: Kafka: add netflow2003 to the allowed sources [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884) [15:52:23] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [15:53:04] (03CR) 10Dzahn: [C: 03+2] "now deployed on deploy* and cumin*" [puppet] - 10https://gerrit.wikimedia.org/r/921075 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [15:53:08] (03CR) 10Btullis: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [15:55:08] (03CR) 10Ayounsi: [C: 03+2] Kafka: add netflow2003 to the allowed sources [puppet] - 10https://gerrit.wikimedia.org/r/921384 (https://phabricator.wikimedia.org/T330884) (owner: 10Ayounsi) [16:02:44] (03PS1) 10Dzahn: httpbb: add tests for CI, https://integration.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891) [16:03:37] (03CR) 10Dzahn: [V: 03+1] "[deploy1002:~] $ httpbb --hosts integration.wikimedia.org ./test_integration.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [16:08:39] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:10:10] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ayounsi) >>! In T330884#8863775, @MoritzMuehlenhoff wrote: > @ayounsi There's now netflow2003 running Bookworm with FNM 1.2.4. If that works fine, we can reimage the other... [16:10:39] (03PS13) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [16:11:03] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [16:11:19] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:01] (03PS2) 10Dzahn: httpbb: add tests for CI, https://integration.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891) [16:15:03] (03PS1) 10Dzahn: httpbb: add tests for etherpad.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921389 (https://phabricator.wikimedia.org/T326891) [16:16:29] (03CR) 10Ottomata: "Fine with me! But, won't this make troubleshooting problems harder? If the logs are removed on a failure, how will we know why something " [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis) [16:20:46] (03PS1) 10Ayounsi: Fastnetmon: enable Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/921390 (https://phabricator.wikimedia.org/T330884) [16:22:57] (03PS1) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391 [16:24:42] (03PS1) 10Dzahn: httpbb: add tests for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921392 (https://phabricator.wikimedia.org/T326891) [16:25:08] (03CR) 10CI reject: [V: 04-1] WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391 (owner: 10Ssingh) [16:31:20] (03PS1) 10Ayounsi: Prometheus: fetch FastNetMon metrics [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884) [16:36:45] (03CR) 10Btullis: Add an extra property 'CollectMode' to each user's jupyter service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921382 (https://phabricator.wikimedia.org/T336951) (owner: 10Btullis) [16:42:22] (03CR) 10Dzahn: "sounds good to me! thank you" [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto) [16:42:53] (03CR) 10Dzahn: [C: 03+1] miscweb/annualreport: bump image version to v1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/921354 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto) [16:45:06] (03PS2) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391 [16:46:09] (03PS2) 10Ayounsi: Prometheus: fetch FastNetMon metrics [puppet] - 10https://gerrit.wikimedia.org/r/921394 (https://phabricator.wikimedia.org/T330884) [16:51:17] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: hw troubleshooting: CPU error for mw2448.codfw.wmnet - https://phabricator.wikimedia.org/T334429 (10Dzahn) @jijiki @Clement_Goubert I got reminded of this today via this alert: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=mw2448&service=m... [16:51:40] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2448 is CRITICAL: Host mw2448 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T334429 https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:55:32] !log mw2448 - scap pull - T2334429 [16:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:52] (03CR) 10Ayounsi: [C: 03+1] Improve logic getting switch port when primary IP is on bridge device (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/921032 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [17:03:02] 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10Dzahn) This host is still in Icinga.. so not removed from puppet db or something... [17:04:02] 10SRE, 10ops-codfw, 10decommission-hardware: decommission bast2002.wikimedia.org - https://phabricator.wikimedia.org/T336995 (10Dzahn) @MoritzMuehlenhoff https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=bast2002 [17:06:47] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:42] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10Dzahn) 05Open→03Stalled [17:08:54] (03PS4) 10Hnowlan: wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) [17:09:03] (03CR) 10CI reject: [V: 04-1] wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [17:09:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Replacing SSH key for Itamar Givon - https://phabricator.wikimedia.org/T337037 (10Dzahn) >>! In T337037#8864974, @ItamarWMDE wrote: > Seeing now that the `maint` and `stat` machines are still on buster. Don't mind stalling it until an upg... [17:11:38] (03CR) 10Ayounsi: Validators: improve device name, add interface/outlet (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [17:11:44] 10SRE, 10Infrastructure-Foundations, 10netops: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) p:05Triage→03Medium [17:11:51] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:13:02] (03PS9) 10Ayounsi: Validators: improve device name, add interface/outlet [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) [17:13:43] (03PS1) 10Cathal Mooney: Add disable_ra var to homer config to enable manual disabling of IPv6 RAs [homer/public] - 10https://gerrit.wikimedia.org/r/921400 (https://phabricator.wikimedia.org/T337057) [17:15:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) [17:15:45] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) [17:30:16] (03Abandoned) 10Ottomata: Revert "Add flink-app default log config and use it in page_content_change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/920576 (owner: 10Gmodena) [17:30:37] (03PS3) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391 [17:39:09] (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for CI, https://integration.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [17:41:30] (03PS4) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391 [17:42:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr all went well with that today thank you for your help. For the next phase we need to move the following links: |No|Ra... [17:44:02] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) The migration went fine today, very quick move and all came up as expected. EVPN MAC-move BGP signalling worked flawlessly was nice to see in... [17:45:33] (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts integration.wikimedia.org /srv/deployment/httpbb-tests/contint/test_integration.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [17:46:33] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) Another option, similar to the above patch, would maybe to make it a global toggle for a device. So like... [17:47:23] (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for CI, https://integration.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921387 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [17:52:14] (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for etherpad.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921389 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [17:52:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10ssingh) >>! In T292095#8865511, @cmooney wrote: > @Jclark-ctr all went well with that today thank you for your help. > > For the next phase... [18:22:11] (03PS1) 10Eevans: cassandra: add dummy secrets for services-dev (test env) [labs/private] - 10https://gerrit.wikimedia.org/r/921408 (https://phabricator.wikimedia.org/T313814) [18:22:48] (03CR) 10Eevans: [C: 03+2] cassandra: add dummy secrets for services-dev (test env) [labs/private] - 10https://gerrit.wikimedia.org/r/921408 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [18:22:58] (03CR) 10Eevans: [V: 03+2 C: 03+2] cassandra: add dummy secrets for services-dev (test env) [labs/private] - 10https://gerrit.wikimedia.org/r/921408 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [18:28:07] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10Volans) Ok that sounds like a plan, let's try first if the FQDN link works and if not we'll fallback to the IP. Based on the test we might add t... [18:30:57] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704 (10Volans) I'll try to have a look next week, but for now I downtimed the alert so it doesn't spam too much until the end of the month. https://icinga.wikimedia.org/cgi-bi... [18:33:14] (03CR) 10Volans: sre.{ganeti,hosts}.reimage: Confirm with hostname (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall) [18:35:38] (03CR) 10Volans: sre.ganeti.makevm call reimage after VM creation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/920203 (https://phabricator.wikimedia.org/T336491) (owner: 10Slyngshede) [18:35:41] (03CR) 10BCornwall: sre.{ganeti,hosts}.reimage: Confirm with hostname (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall) [18:41:41] (03PS5) 10BCornwall: Add cookbook to handle restarts of Wikimedia DNS [cookbooks] - 10https://gerrit.wikimedia.org/r/915848 (https://phabricator.wikimedia.org/T335533) [18:43:05] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:47:44] (03PS1) 10BCornwall: Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) [18:49:59] (03CR) 10CI reject: [V: 04-1] Create cookbook to upgrade Apache Traffic Server [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [18:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:50:49] (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts etherpad.wikimedia.org /srv/deployment/httpbb-tests/etherpad/test_etherpad.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/921389 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [18:59:38] (03PS14) 10Eevans: cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [19:00:02] (03CR) 10CI reject: [V: 04-1] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:00:45] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:04:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:10:45] (03PS15) 10Eevans: cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [19:11:14] (03CR) 10CI reject: [V: 04-1] cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:11:39] (03PS5) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391 [19:11:57] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:14:30] (03Abandoned) 10Ssingh: WIP: P:bird::anycast_healthchecker: allow binding to multiple services [puppet] - 10https://gerrit.wikimedia.org/r/921391 (owner: 10Ssingh) [19:17:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:27:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:27:23] ^ keeping an eye on this [19:28:03] ACKed one of them and taking a look but probably we can just let it finish [19:28:37] PROBLEM - PHP7 rendering on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:29:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:30:03] RECOVERY - PHP7 rendering on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 1.232 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:30:30] that host 1469 was just busy running ffmpeg.. scaled a video [19:34:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:35:13] PROBLEM - PHP7 jobrunner on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [19:36:24] !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@b34c529]: (no justification provided) [19:36:33] !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@b34c529]: (no justification provided) (duration: 00m 09s) [19:36:41] RECOVERY - PHP7 jobrunner on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 3.280 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [19:36:57] if it keeps doing that for much longer I am ready to depool mw1469 from videoscaler, but so far I think accetable [19:38:15] the overall jobrunner health (linked from runbook) looks ok to me.. so no action taken [19:39:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:39:55] PROBLEM - PHP7 jobrunner on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [19:41:23] RECOVERY - PHP7 jobrunner on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 4.778 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [19:44:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:45:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: cluster=videoscaler,name=mw1469.eqiad.wmnet [19:45:42] !log depooled mw1469 from videoscaler, dedicating to just jobrunner [19:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:58] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,name=mw1469.eqiad.wmnet [19:46:11] PROBLEM - PHP7 jobrunner on mw1469 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [19:46:50] !log mw1469 - sudo pkill ffmpeg (per runbook) [19:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:35] RECOVERY - PHP7 jobrunner on mw1469 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [19:48:09] the jobs are more important than the video scaling.. so video scaling got killed.. shoud be retried later ..per docs [19:48:20] on this one host that was overloaded and doing both [19:49:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:08:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:09:05] really not happy is it (: [20:13:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:15:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:20:07] (ProbeDown) firing: (4) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:23:07] PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:23:32] mutante: another sad server [20:24:33] RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 1.816 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:25:07] (ProbeDown) resolved: (4) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:26:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:30:22] (ProbeDown) firing: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:31:07] (ProbeDown) resolved: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:32:51] PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:33:47] PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:34:15] RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:38:25] RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 6.597 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:41:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:46:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:48:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:50:51] PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [20:52:13] RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:52:37] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: cluster=jobrunner,name=mw1495.eqiad.wmnet [20:53:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:19:23] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:21:19] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001" [21:22:22] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add entries for ssw link addresses in eqiad - cmooney@cumin1001" [21:22:22] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:29:11] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:43:21] PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:46:21] PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:46:27] RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 7.241 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [21:47:47] RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:51:34] (03CR) 10Dzahn: [C: 03+2] httpbb: add tests for rt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921392 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [21:55:48] (03PS1) 10Dzahn: httpbb: fix path to test file for RT [puppet] - 10https://gerrit.wikimedia.org/r/921428 (https://phabricator.wikimedia.org/T326891) [21:55:51] PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [21:56:07] PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [21:58:40] (03CR) 10Dzahn: [C: 03+2] httpbb: fix path to test file for RT [puppet] - 10https://gerrit.wikimedia.org/r/921428 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [22:00:33] (03CR) 10Superpes15: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921372 (owner: 10Robertsky) [22:02:07] (ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:43] RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 282 bytes in 8.706 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:03:51] RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:04:54] (03CR) 10Dzahn: [C: 03+2] "[deploy1002:~] $ httpbb --hosts rt.wikimedia.org /srv/deployment/httpbb-tests/rt/test_rt.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/921428 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [22:07:07] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:08:27] PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:08:37] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:09:41] PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:12:35] RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 283 bytes in 1.283 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:13:37] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:15:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:16:51] PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:17:37] RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 280 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:19:49] RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:20:07] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:21:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:22:27] PROBLEM - PHP7 jobrunner on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [22:24:03] PROBLEM - PHP7 rendering on mw1495 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:25:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:26:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:26:37] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:30:22] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:35:22] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:36:37] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:36:52] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:57] RECOVERY - PHP7 jobrunner on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:41:01] RECOVERY - PHP7 rendering on mw1495 is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [22:41:52] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:46:52] (ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#jobrunner:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:48:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:53:07] (ProbeDown) firing: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:58:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:59:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:22] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:04:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:22] (ProbeDown) resolved: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:30:07] (ProbeDown) firing: Service videoscaler:443 has failed probes (http_videoscaler_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#videoscaler:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:40:07] (ProbeDown) resolved: (2) Service jobrunner:443 has failed probes (http_jobrunner_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown