[00:00:56] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:02:02] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:38] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:08:38] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:20] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:04] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:12] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] arclamp: switch redis server to arclamp1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920299 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron)
[00:22:06] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:22:34] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:44] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:50] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM. Puppet patches should go out a few minutes before this, and remember to restart the arclamp-log process if it doesn't do so automati" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920298 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron)
[00:26:46] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:28:22] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:28:26] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:08] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:39:33] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/920342
[00:39:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/920342 (owner: 10TrainBranchBot)
[00:45:58] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:50] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:56:25] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/920342 (owner: 10TrainBranchBot)
[01:00:10] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:20] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:04:24] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:08:00] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:15:50] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:20:04] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:23:40] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:24:55] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10JoKalliauer) 05Stalled→03Resolved a:03JoKalliauer |file |https://commons.wikimedia.org/wiki/File:T184369.svg | | librsvg2.40 |...
[01:29:28] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:02] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:08] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:30] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:48] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:45:40] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:51:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:53:32] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:00:58] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:01:28] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:02:34] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:05:46] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:46] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:13:34] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:15:38] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:23:30] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:24:34] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:27:48] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:29:20] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:24] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:32:28] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:37:40] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:45:32] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:53:24] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:16] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:07:36] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:15:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:23:18] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:31:10] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:35:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:39:02] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:18] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:53:06] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:00:46] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:08:28] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:10:57] <wikibugs>	 (03CR) 10Santhosh: [C: 03+1] MinT: Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 (owner: 10KartikMistry)
[04:16:10] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:22:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[04:24:04] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:30:20] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:33:50] <wikibugs>	 (03CR) 10Aaron Schulz: "I didn't make a puppet patch yet. I was thinking about just ignoring all these variables for probe connections instead." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918612 (owner: 10Aaron Schulz)
[04:34:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Add add_user_is_temp_T336886.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/920771 (https://phabricator.wikimedia.org/T336886) (owner: 10Ladsgroup)
[04:36:46] <wikibugs>	 (03PS1) 10Marostegui: control-mariadb-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/920809 (https://phabricator.wikimedia.org/T336462)
[04:37:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/920809 (https://phabricator.wikimedia.org/T336462) (owner: 10Marostegui)
[04:38:08] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:45:50] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:47:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: maintenance
[04:48:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: maintenance
[04:53:32] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:01:12] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:07:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:08:58] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:12:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:15:16] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:23:06] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:28:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:30:54] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:33:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:38:34] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:46:18] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:51:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:52:36] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Time to snap out of that daydream and deploy Primary database switchover. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T0600).
[06:00:30] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:04:29] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1122 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/920969 (https://phabricator.wikimedia.org/T336833)
[06:07:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1122 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/920969 (https://phabricator.wikimedia.org/T336833) (owner: 10Marostegui)
[06:07:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1122 from dbctl T336833', diff saved to https://phabricator.wikimedia.org/P48362 and previous config saved to /var/cache/conftool/dbconfig/20230518-060734-marostegui.json
[06:07:39] <stashbot>	 T336833: decommission db1122.eqiad.wmnet - https://phabricator.wikimedia.org/T336833
[06:08:20] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:14:29] <wikibugs>	 (03PS2) 10KartikMistry: MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687
[06:15:58] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:17:52] <icinga-wm>	 PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:38] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2134,2160].codfw.wmnet,db[1159,1217].eqiad.wmnet with reason: maintenance
[06:23:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2134,2160].codfw.wmnet,db[1159,1217].eqiad.wmnet with reason: maintenance
[06:31:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:37:32] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:45:12] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:46:10] <wikibugs>	 (03PS1) 10Marostegui: phabricator.my.cnf.erb: Set gtid_domain_id=0 [puppet] - 10https://gerrit.wikimedia.org/r/920986 (https://phabricator.wikimedia.org/T336228)
[06:46:28] <wikibugs>	 (03Abandoned) 10Marostegui: control-mariadb-client-10.4: Remove file [software] - 10https://gerrit.wikimedia.org/r/920653 (owner: 10Marostegui)
[06:46:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] phabricator.my.cnf.erb: Set gtid_domain_id=0 [puppet] - 10https://gerrit.wikimedia.org/r/920986 (https://phabricator.wikimedia.org/T336228) (owner: 10Marostegui)
[06:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:52:52] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:56] <wikibugs>	 (03PS3) 10KartikMistry: MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 (https://phabricator.wikimedia.org/T336483)
[06:54:33] <wikibugs>	 (03CR) 10Mvolz: rest-gateway: add citoid support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T0700).
[07:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:12] <apergos>	 morning! there are no trainees signed up today. kart_ I see you have just the one patch which looks straight-forward enough. will you be self-deploying today? 
[07:00:24] <kart_>	 apergos: yes :)
[07:00:32] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:32] <apergos>	 ok!  it's all yours :-)
[07:01:27] <kart_>	 Thanks!
[07:02:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry)
[07:03:05] <wikibugs>	 (03PS3) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868)
[07:04:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry)
[07:05:20] <wikibugs>	 (03Merged) 10jenkins-bot: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry)
[07:06:06] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:920577|Enable the new Special:Contribute page entry point for desktop on selected wikis (T327868)]]
[07:06:10] <stashbot>	 T327868: Enable the new Special:Contribute page entry point for desktop on selected wikis - https://phabricator.wikimedia.org/T327868
[07:07:34] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:920577|Enable the new Special:Contribute page entry point for desktop on selected wikis (T327868)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[07:08:10] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:15:24] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:920577|Enable the new Special:Contribute page entry point for desktop on selected wikis (T327868)]] (duration: 09m 18s)
[07:15:28] <stashbot>	 T327868: Enable the new Special:Contribute page entry point for desktop on selected wikis - https://phabricator.wikimedia.org/T327868
[07:15:48] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:16:40] <kart_>	 apergos: I'm done if anyone wants to continue..
[07:16:50] <apergos>	 thanks!  
[07:17:02] <apergos>	 I'll give it 5 minutes and then close up shop for today
[07:23:28] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:25:52] <apergos>	 !log UTC morning backport and config training window done
[07:25:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:00] <apergos>	 see folks next time!
[07:27:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Following up from an IRC conversation:" [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle)
[07:31:08] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:34:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] cadvisor: add explicity metrics enable [puppet] - 10https://gerrit.wikimedia.org/r/920660 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[07:38:50] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:46:32] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:52:42] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:59:57] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=registry2003.codfw.wmnet
[08:00:05] <jouncebot>	 dancy and hashar: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T0800).
[08:00:24] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:49] <akosiaris>	 !log upgrade registry on registry2003 to 2.8.2 
[08:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:14] <wikibugs>	 (03PS3) 10Filippo Giunchedi: cadvisor: disable percpu and cpuLoad metric classes [puppet] - 10https://gerrit.wikimedia.org/r/920661 (https://phabricator.wikimedia.org/T108027)
[08:04:46] <wikibugs>	 (03PS1) 10KartikMistry: Beta: Enable the new Special:Contribute page entry point for desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868)
[08:08:02] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:12:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: rollout cadvisor to PoPs [puppet] - 10https://gerrit.wikimedia.org/r/920991 (https://phabricator.wikimedia.org/T108027)
[08:12:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: change isvc name to revertrisk-language-agnostic [deployment-charts] - 10https://gerrit.wikimedia.org/r/920725 (https://phabricator.wikimedia.org/T332998) (owner: 10AikoChou)
[08:13:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: change lift wing's kafka topic in changeprop's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/920696 (https://phabricator.wikimedia.org/T333468) (owner: 10Elukey)
[08:14:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41229/console" [puppet] - 10https://gerrit.wikimedia.org/r/920991 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[08:15:40] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:18:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "This is close to being a noop, since cadvisor already runs on a big chunk of hosts in PoPs anyways (cp hosts)" [puppet] - 10https://gerrit.wikimedia.org/r/920991 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[08:19:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[08:19:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[08:21:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[08:22:11] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[08:22:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[08:23:18] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:23:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: clean up eventgate prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/921023 (https://phabricator.wikimedia.org/T309009)
[08:26:12] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=registry2003.codfw.wmnet
[08:27:26] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[08:28:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[08:29:00] <akosiaris>	 !log upgrade docker-registry to 2.8.2 on all registry hosts
[08:29:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:26] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[08:31:00] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:32:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:37:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:38:52] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:43:18] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:43:22] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:44:54] <wikibugs>	 (03PS4) 10KartikMistry: MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 (https://phabricator.wikimedia.org/T336483)
[08:45:08] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:53:02] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:00:54] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:08:46] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:30] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:22:48] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:26:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) The updated PuppetDB -> Netbox import script has now been merged, and I've run it against all servers in Netbox in state 'active'...
[09:30:42] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:38:34] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:46:24] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:51:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:52:42] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:55:34] <TheresNoTime>	 jouncebot: nowandnext
[09:55:34] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T0800)
[09:55:35] <jouncebot>	 In 0 hour(s) and 4 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1000)
[09:55:35] <jouncebot>	 In 0 hour(s) and 4 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1000)
[09:55:46] <TheresNoTime>	 🚂
[09:58:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] changeprop: add liftwing outlink topic stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou)
[10:00:05] <jouncebot>	 mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1000).
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1000)
[10:00:36] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:05:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync
[10:06:07] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[10:08:24] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:14:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:15:08] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:16:02] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:17:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:22:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:23:52] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:24:41] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1001.eqiad.wmnet
[10:24:54] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ml-cache1001.eqiad.wmnet
[10:25:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1001.eqiad.wmnet
[10:25:59] <wikibugs>	 (03PS1) 10Cathal Mooney: Improve logic getting switch port when primary IP is on bridge device [cookbooks] - 10https://gerrit.wikimedia.org/r/921032 (https://phabricator.wikimedia.org/T296832)
[10:27:04] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:27:24] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:28:55] <wikibugs>	 (03PS2) 10Cathal Mooney: Improve logic getting switch port when primary IP is on bridge device [cookbooks] - 10https://gerrit.wikimedia.org/r/921032 (https://phabricator.wikimedia.org/T296832)
[10:29:12] <wikibugs>	 (03PS1) 10MVernon: hiera: remove ms-be204[0-3] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/921034 (https://phabricator.wikimedia.org/T335280)
[10:29:16] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:29:58] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:29:59] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on an-worker1110.eqiad.wmnet with reason: Troubleshooting failed disk
[10:30:08] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:30:13] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-worker1110.eqiad.wmnet with reason: Troubleshooting failed disk
[10:31:00] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: var-lib-hadoop-data-f.mount Btullis Troubleshooting failed disk - T336929 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:00] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1110 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Troubleshooting failed disk - T336929 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:32:03] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1001.eqiad.wmnet
[10:32:15] <wikibugs>	 (03CR) 10MVernon: "I've updated our docs a bit about the decom process - all looks clear to you?" [puppet] - 10https://gerrit.wikimedia.org/r/921034 (https://phabricator.wikimedia.org/T335280) (owner: 10MVernon)
[10:37:58] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:45:50] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:50:15] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1002.eqiad.wmnet
[10:51:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add nginx logs for docker-registry host to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/919350 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney)
[10:51:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Send nginx and docker-registry logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/919351 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney)
[10:53:42] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:57:15] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1002.eqiad.wmnet
[11:00:37] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1003.eqiad.wmnet
[11:01:30] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:03:23] <TheresNoTime>	 jouncebot: nowandnext
[11:03:23] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 56 minute(s)
[11:03:23] <jouncebot>	 In 1 hour(s) and 56 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1300)
[11:03:24] <jouncebot>	 In 1 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1300)
[11:04:36] <wikibugs>	 (03PS6) 10Samtar: InitialiseSettings: Set wgWatchersMaxAge=30days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti)
[11:05:13] * kart_ is deploying MinT
[11:06:20] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 (https://phabricator.wikimedia.org/T336483) (owner: 10KartikMistry)
[11:07:03] <wikibugs>	 (03Merged) 10jenkins-bot: MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 (https://phabricator.wikimedia.org/T336483) (owner: 10KartikMistry)
[11:07:34] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1003.eqiad.wmnet
[11:07:42] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:09:28] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[11:11:00] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[11:11:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney)
[11:11:17] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney)
[11:12:05] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10cmooney) 05Open→03Resolved Couple of niggles getting this going on the...
[11:15:34] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:17:04] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:17:28] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1110 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:17:41] <TheresNoTime>	 Question: where is the `deployment.eqiad.wmnet` service alias set?
[11:18:29] <RhinosF1>	 TheresNoTime: dns repo I believe
[11:20:47] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[11:21:17] <RhinosF1>	 TheresNoTime: https://github.com/wikimedia/operations-dns/blob/master/templates/wmnet#L30
[11:21:55] <TheresNoTime>	 RhinosF1: smh, I was looking at https://wikitech.wikimedia.org/wiki/Deployment_server#Service and expecting to see `deploy2002`
[11:23:26] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:38] <RhinosF1>	 TheresNoTime: ah
[11:23:39] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[11:24:42] <RhinosF1>	 TheresNoTime: I think updating the page is manual and probably didn't follow the switchover if it's wrong
[11:25:11] <RhinosF1>	 https://github.com/wikimedia/operations-dns/commit/20df3f9118e9f0471066e224638d0501483390e0
[11:25:20] <RhinosF1>	 Ye
[11:26:25] * TheresNoTime shrug
[11:27:14] <RhinosF1>	 TheresNoTime: it's a wiki, you can fix it :)
[11:27:41] <TheresNoTime>	 :p
[11:28:05] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[11:29:18] <wikibugs>	 (03PS3) 10Slyngshede: Offboarding: Allow managers to offboard users. [software/bitu] - 10https://gerrit.wikimedia.org/r/920665 (https://phabricator.wikimedia.org/T335476)
[11:31:18] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:54] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: User offboarding - https://phabricator.wikimedia.org/T335476 (10RhinosF1) Employees being off-boarded from the WMF may wish to continue in some roles as a volunteer.  Will this support keeping some roles or switching from 'wmf' to 'nda' ldap...
[11:34:02] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[11:36:45] <kart_>	 !log MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 (T336483)
[11:36:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:49] <stashbot>	 T336483: Long sequence of a repeated word appears only when using MinT but not NLLB-200 directly - https://phabricator.wikimedia.org/T336483
[11:36:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!  Some comments in line in terms of the approach with interface names but overall looks good I expect it should work and do what we n" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[11:37:34] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:38:03] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[11:38:41] <wikibugs>	 (03Merged) 10jenkins-bot: Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney)
[11:41:23] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: User offboarding - https://phabricator.wikimedia.org/T335476 (10SLyngshede-WMF) As currently planned there will just be a list of roles/LDAP groups which is removed from users during off-boarding. Any other groups that user belongs to is not...
[11:45:22] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:09] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1110 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T336932 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:49:13] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on an-worker1110 - https://phabricator.wikimedia.org/T336932 (10ops-monitoring-bot)
[11:51:10] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1001.eqiad.wmnet
[11:52:10] <wikibugs>	 (03PS3) 10Hnowlan: wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881)
[11:53:08] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:02] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1001.eqiad.wmnet
[11:55:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: don't fail on unknown blackbox probe type [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) (owner: 10Filippo Giunchedi)
[11:56:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan)
[11:56:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:56:25] <topranks>	 !log reconfiguring DHCP relay function on eqiad core routers (T320508)
[11:56:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:29] <stashbot>	 T320508: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508
[12:00:36] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Investigate HW requirements for Thanos frontend - https://phabricator.wikimedia.org/T312201 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving as we'll be getting the hardware (same specs as current thanos-fe)
[12:00:58] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[12:02:42] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1001.eqiad.wmnet
[12:06:32] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1001.eqiad.wmnet
[12:08:40] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1002.eqiad.wmnet
[12:12:10] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[12:12:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm
[12:15:07] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1002.eqiad.wmnet
[12:16:18] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:16:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: add systemd unit crashloop alert [alerts] - 10https://gerrit.wikimedia.org/r/921047 (https://phabricator.wikimedia.org/T293970)
[12:16:41] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1003.eqiad.wmnet
[12:17:27] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[12:17:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002...
[12:18:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Current units crashlooping: https://thanos.wikimedia.org/graph?g0.expr=%20%20%20%20%20%20%20%20%20%20increase(node_systemd_service_restart" [alerts] - 10https://gerrit.wikimedia.org/r/921047 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi)
[12:19:16] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[12:19:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm
[12:20:32] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1003.eqiad.wmnet
[12:23:58] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:24:09] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[12:24:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002...
[12:24:33] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[12:24:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm
[12:28:46] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet
[12:30:51] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:34:08] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:35:13] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:35:18] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet
[12:35:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet
[12:35:51] <wikibugs>	 (03PS1) 10Ottomata: Revert "Enable First Input Delay events." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920735
[12:36:48] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Revert "Enable First Input Delay events." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920735 (owner: 10Ottomata)
[12:37:01] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:37:34] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable First Input Delay events." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920735 (owner: 10Ottomata)
[12:41:49] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet
[12:44:08] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[12:44:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002...
[12:44:30] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[12:44:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm
[12:46:13] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:46:21] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:46:47] <elukey>	 !log clean up old jupyterhub.service references (crash looping) on stat* nodes that had it
[12:46:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:51] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/ext-EventLogging.php: Revert Enable First Input Delay events. This is causing validation errors as well as breakages in the hadoop ingestion pipepine - T332012 (duration: 07m 00s)
[12:46:55] <stashbot>	 T332012: Collect first input delay - https://phabricator.wikimedia.org/T332012
[12:47:09] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:51:04] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[12:51:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002...
[12:51:15] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:51:24] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[12:51:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm
[12:51:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of comments all around. The premise and logic sound fine to me. That being said, this is a huge patch, I did my best, but something" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[12:51:48] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker
[12:51:56] <wikibugs>	 (03PS2) 10Samtar: Beta: Enable the new Special:Contribute page entry point for desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry)
[12:52:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "PCC btw appears fine as far as I can tell. Changes only to puppet resources, so this should, in theory, be a noop" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm)
[12:52:57] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:54:51] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-staging-worker
[12:55:54] <TheresNoTime>	 kart_: if you're around, did you want me to push your beta-only change now?
[12:56:48] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[12:56:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002...
[12:57:03] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[12:57:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm
[12:57:28] <jan_drewniak>	 TheresNoTime: I also have a last-minute beta-only change :D 
[12:57:45] <TheresNoTime>	 jan_drewniak: I'll do that now if that's okay?
[12:57:53] <jan_drewniak>	 TheresNoTime: yes please :) 
[12:57:54] <wikibugs>	 (03PS3) 10Samtar: [Beta] Enable Vector AB test on beta spanish wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920793 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak)
[12:58:07] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:58:13] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:58:39] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:58:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920793 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak)
[12:59:18] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: Revert Enable First Input Delay events. This is causing validation errors as well as breakages in the hadoop ingestion pipepine - T332012 (duration: 06m 19s)
[12:59:22] <stashbot>	 T332012: Collect first input delay - https://phabricator.wikimedia.org/T332012
[12:59:29] <wikibugs>	 (03Merged) 10jenkins-bot: [Beta] Enable Vector AB test on beta spanish wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920793 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak)
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1300).
[13:00:05] <jouncebot>	 kart_, TheresNoTime, and jan_drewniak: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:07] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:11] * TheresNoTime can deploy
[13:00:29] <TheresNoTime>	 jan_drewniak: done, will be on the next `beta-code-update-eqiad`
[13:00:39] <jan_drewniak>	 TheresNoTime: thanks!
[13:00:42] <kart_>	 TheresNoTime: sure. Go ahead.
[13:00:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry)
[13:01:06] <wikibugs>	 (03PS3) 10Samtar: Beta: Enable the new Special:Contribute page entry point for desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry)
[13:01:53] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry)
[13:02:42] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[13:02:42] <wikibugs>	 (03Merged) 10jenkins-bot: Beta: Enable the new Special:Contribute page entry point for desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry)
[13:02:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002...
[13:03:19] <TheresNoTime>	 kart_: done, in the next `beta-code-update-eqiad` :)
[13:03:38] <wikibugs>	 (03PS7) 10Samtar: InitialiseSettings: Set wgWatchersMaxAge=30days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti)
[13:04:38] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Add dump user subdirectories to support testing of new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915423 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[13:04:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti)
[13:05:40] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings: Set wgWatchersMaxAge=30days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti)
[13:06:09] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:919023|InitialiseSettings: Set wgWatchersMaxAge=30days (T336250)]]
[13:06:13] <stashbot>	 T336250: Decrease wgWatchersMaxAge to 30 days, and display this on `action=info` - https://phabricator.wikimedia.org/T336250
[13:06:15] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] add nfs tester to dumps worker (snapshot) testbed role [puppet] - 10https://gerrit.wikimedia.org/r/915437 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[13:06:34] <kart_>	 TheresNoTime: looks good. Had to refresh page multiple times! Please go ahead for full deployment.
[13:07:27] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:31] <TheresNoTime>	 kart_: that should be live proper on the beta cluster now
[13:07:40] <logmsgbot>	 !log samtar@deploy1002 samtar and s-mukuti: Backport for [[gerrit:919023|InitialiseSettings: Set wgWatchersMaxAge=30days (T336250)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[13:07:43] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[13:07:51] <TheresNoTime>	 (testing my patch ^)
[13:08:04] <kart_>	 TheresNoTime: Thanks!
[13:09:27] <TheresNoTime>	 (syncing mine)
[13:13:18] <wikibugs>	 (03PS2) 10ArielGlenn: create custom db list files for testing of nfs shares for xml dumps [puppet] - 10https://gerrit.wikimedia.org/r/915447 (https://phabricator.wikimedia.org/T325232)
[13:14:54] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:919023|InitialiseSettings: Set wgWatchersMaxAge=30days (T336250)]] (duration: 08m 45s)
[13:14:59] <stashbot>	 T336250: Decrease wgWatchersMaxAge to 30 days, and display this on `action=info` - https://phabricator.wikimedia.org/T336250
[13:16:15] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:18:44] <TheresNoTime>	 !log closing backport window
[13:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:16] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cadvisor: disable percpu and cpuLoad metric classes [puppet] - 10https://gerrit.wikimedia.org/r/920661 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[13:23:37] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:10] <wikibugs>	 (03CR) 10Herron: [C: 03+1] sre: add systemd unit crashloop alert [alerts] - 10https://gerrit.wikimedia.org/r/921047 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi)
[13:25:52] <wikibugs>	 10SRE, 10ops-codfw, 10Cloud-Services, 10cloud-services-team: cloudbackup2001 lockup on 2023-05-05 - https://phabricator.wikimedia.org/T336060 (10Jhancock.wm) @Andrew I went back through the lifecycle logs on the idrac and I could not find a cause for the ssh going down or the lagged response. I inspected t...
[13:31:07] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:47] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:43:41] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "envscripts: include OS_CLOUD in environment." [puppet] - 10https://gerrit.wikimedia.org/r/920736
[13:44:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "envscripts: include OS_CLOUD in environment." [puppet] - 10https://gerrit.wikimedia.org/r/920736 (owner: 10Andrew Bogott)
[13:46:21] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker
[13:49:21] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-staging-worker
[13:50:54] <topranks>	 elukey: does part of what you're doing involve reimaging an-worker1156 ?
[13:50:55] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker
[13:51:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:52:25] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:52:47] <elukey>	 topranks: nope! only ml nodes
[13:52:57] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-staging-worker
[13:53:14] <topranks>	 ok - just checking - I may have borked DHCP for some things, working on it..... someone else must be trying
[13:53:15] <topranks>	 thanks!
[13:56:38] <wikibugs>	 (03PS1) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921049 (https://phabricator.wikimedia.org/T327868)
[13:58:34] <wikibugs>	 (03PS1) 10Samtar: logspam-watch: Add a fox emoji [puppet] - 10https://gerrit.wikimedia.org/r/921050
[13:59:40] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw
[14:01:09] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:01:19] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:01:21] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:49] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-serve-worker-codfw
[14:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:08:45] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:12:08] <wikibugs>	 (03PS2) 10Hnowlan: engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber)
[14:12:43] <wikibugs>	 (03PS1) 10Elukey: knative-serving: add replica count config for webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/921052
[14:15:23] <wikibugs>	 (03CR) 10Ssingh: "Thanks for working on this patch! Comments inline:" [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall)
[14:16:07] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: add replica count config for webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/921052 (owner: 10Elukey)
[14:16:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:33] <wikibugs>	 (03PS3) 10Hnowlan: engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber)
[14:20:54] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: move xcf support to imagemagick [deployment-charts] - 10https://gerrit.wikimedia.org/r/921053 (https://phabricator.wikimedia.org/T260285)
[14:23:29] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber)
[14:24:21] <icinga-wm>	 PROBLEM - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:25:27] <icinga-wm>	 RECOVERY - Host gitlab-runner1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[14:30:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[14:30:28] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[14:30:55] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:19] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[14:31:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002...
[14:31:43] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm
[14:31:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm
[14:34:30] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker
[14:38:19] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts gitlab-runner1003.eqiad.wmnet
[14:41:12] <wikibugs>	 10SRE, 10Traffic: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792 (10ssingh)
[14:43:35] <wikibugs>	 (03PS4) 10Hnowlan: engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber)
[14:45:43] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:46:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:47:21] <wikibugs>	 (03PS1) 10Cathal Mooney: Add trust-option-82 to dhcp relay conf for core routers [homer/public] - 10https://gerrit.wikimedia.org/r/921054 (https://phabricator.wikimedia.org/T320508)
[14:49:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] openstack: remove old Gerrit IP from cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn)
[14:49:50] <wikibugs>	 (03PS2) 10Dzahn: openstack: remove old Gerrit IP from cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427)
[14:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:50:02] <wikibugs>	 (03PS1) 10Elukey: knative-serving: fix pdb for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/921055
[14:51:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:52:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:52:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: fix pdb for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/921055 (owner: 10Elukey)
[14:53:21] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:54:15] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add trust-option-82 to dhcp relay conf for core routers [homer/public] - 10https://gerrit.wikimedia.org/r/921054 (https://phabricator.wikimedia.org/T320508) (owner: 10Cathal Mooney)
[14:54:50] <wikibugs>	 (03Merged) 10jenkins-bot: Add trust-option-82 to dhcp relay conf for core routers [homer/public] - 10https://gerrit.wikimedia.org/r/921054 (https://phabricator.wikimedia.org/T320508) (owner: 10Cathal Mooney)
[14:56:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:57:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:57:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:58:29] <wikibugs>	 (03Abandoned) 10Samtar: diff: Only show inline legend for text slot [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920578 (https://phabricator.wikimedia.org/T336481) (owner: 10Samtar)
[14:59:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[14:59:56] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[15:00:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "I was personally for it but maybe there is no consensus on this one. Other people should also vote here." [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall)
[15:00:33] <wikibugs>	 (03PS1) 10Ottomata: page_content_chnage - Shorten name of service and namespace in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/921056 (https://phabricator.wikimedia.org/T330507)
[15:00:35] <wikibugs>	 (03Abandoned) 10Samtar: onDifferenceEngineBeforeDiffTable: Return early on Special pages [extensions/VisualEditor] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920579 (https://phabricator.wikimedia.org/T336582) (owner: 10Samtar)
[15:01:05] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:01:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] page_content_chnage - Shorten name of service and namespace in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/921056 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata)
[15:02:38] <wikibugs>	 (03CR) 10Dzahn: "yea, I mean I am not opposed to it but it probably needs testing and that is kind of the part I wanted to skip. so not sure how to vote, I" [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto)
[15:02:58] <logmsgbot>	 !log stevemunene@deploy1002 Started deploy [airflow-dags/analytics_product@6e3358d]: (no justification provided)
[15:03:05] <logmsgbot>	 !log stevemunene@deploy1002 Finished deploy [airflow-dags/analytics_product@6e3358d]: (no justification provided) (duration: 00m 06s)
[15:04:19] <wikibugs>	 (03PS2) 10Ottomata: page_content_chnage - Shorten name of service and namespace in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/921056 (https://phabricator.wikimedia.org/T330507)
[15:04:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2001.codfw.wmnet
[15:06:31] <wikibugs>	 (03CR) 10Hnowlan: rest-gateway: add citoid support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[15:08:37] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2001.codfw.wmnet
[15:08:45] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:09:12] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2002.codfw.wmnet
[15:10:48] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] page_content_chnage - Shorten name of service and namespace in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/921056 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata)
[15:13:06] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2002.codfw.wmnet
[15:13:14] <wikibugs>	 (03Merged) 10jenkins-bot: page_content_chnage - Shorten name of service and namespace in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/921056 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata)
[15:15:23] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:16:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2003.codfw.wmnet
[15:16:27] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:19] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:18:17] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[15:18:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10cmooney) >>! In T320508#8488549, @ayounsi wrote: > Marking this task dependent on DHCP option 97 to reduce the risk of DHCP oddities related to Option 82.  Ironic I hadn't...
[15:18:39] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:19:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10cmooney) 05Open→03Resolved
[15:19:23] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-staging-worker
[15:20:15] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2003.codfw.wmnet
[15:20:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney)
[15:20:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10cmooney) 05Open→03Resolved Complete now after merging above patch.
[15:22:31] <wikibugs>	 (03PS1) 10Ottomata: kubernetes::deployment_server::services - define mw-page-content-change-enrich [puppet] - 10https://gerrit.wikimedia.org/r/921058 (https://phabricator.wikimedia.org/T330507)
[15:22:35] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:13] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet
[15:23:59] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] kubernetes::deployment_server::services - define mw-page-content-change-enrich [puppet] - 10https://gerrit.wikimedia.org/r/921058 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata)
[15:25:33] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm
[15:25:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002...
[15:26:20] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] "Reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/921050 (owner: 10Samtar)
[15:26:28] <TheresNoTime>	 :p
[15:26:43] <icinga-wm>	 PROBLEM - Check systemd state on ml-staging2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:28:56] <mutante>	 ignores that because it's staging.. but normally failed ferm is bad
[15:29:53] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet
[15:30:05] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:47] <elukey>	 mutante: temporary dns failure, fixed :)
[15:30:52] <elukey>	 (after a reboot)
[15:30:55] <mutante>	 elukey: cool:) ack!
[15:31:07] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet
[15:31:09] <mutante>	 failed DNS did come to mind when I saw failed ferm, indeed
[15:31:11] <icinga-wm>	 RECOVERY - Check systemd state on ml-staging2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:36:01] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] hiera: remove ms-be204[0-3] from swift::storagehosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921034 (https://phabricator.wikimedia.org/T335280) (owner: 10MVernon)
[15:37:33] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[15:37:37] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:37:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bullseye
[15:37:43] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet
[15:39:45] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wcqs: Configure webproxy for federated queries [puppet] - 10https://gerrit.wikimedia.org/r/919216 (https://phabricator.wikimedia.org/T334470) (owner: 10Ebernhardson)
[15:43:28] <wikibugs>	 (03PS4) 10EoghanGaffney: doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[15:44:06] <mutante>	 try typing the word "piwik" without your hand auto-completing to "wiki", I needed 3 times :)
[15:45:17] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:36] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41231/console" [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[15:46:15] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:46:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:46:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): Mapping Client IPs to Resolver IPs - https://phabricator.wikimedia.org/T336947 (10JameelKaisar)
[15:47:15] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@6e3358d]: (no justification provided)
[15:47:25] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@6e3358d]: (no justification provided) (duration: 00m 10s)
[15:49:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (19) High Kubernetes API latency (LIST apiservices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:50:24] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[15:52:41] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:53:37] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage
[15:54:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (21) High Kubernetes API latency (LIST apiservices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:57:49] <inflatador>	 !log bking@cumin1001 starting rolling restart of wcqs for java updates T334470
[15:57:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:53] <stashbot>	 T334470: Federated queries to Lingua Libre time out in the Commons query service - https://phabricator.wikimedia.org/T334470
[15:58:18] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[15:58:21] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[15:58:33] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309)
[15:59:59] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1 C: 03+1] "Looks good, with the additional change of switching the static list of gitlab-runner hosts to `wmflib::role::hosts` so that it resolves th" [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[16:00:04] <jouncebot>	 jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:16] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:04:06] <wikibugs>	 (03PS2) 10Kimberly Sarabia: Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309)
[16:07:24] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:02] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bullseye
[16:10:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bullseye completed: - sretest1002 (**PASS**)...
[16:11:36] <wikibugs>	 (03PS1) 10Ottomata: page_content_change - fix kafka SSL setting in wikikube staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/921060 (https://phabricator.wikimedia.org/T330507)
[16:13:01] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336949 (10phaultfinder)
[16:15:20] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] hiera: remove ms-be204[0-3] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/921034 (https://phabricator.wikimedia.org/T335280) (owner: 10MVernon)
[16:15:20] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:05] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] page_content_change - fix kafka SSL setting in wikikube staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/921060 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata)
[16:19:46] <wikibugs>	 (03PS1) 10Ottomata: page_content_change - move common kafka SSL settings into values-main.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/921061 (https://phabricator.wikimedia.org/T330507)
[16:20:58] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] page_content_change - move common kafka SSL settings into values-main.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/921061 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata)
[16:21:21] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[16:21:25] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[16:22:38] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal
[16:22:40] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:24:14] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber)
[16:25:19] <wikibugs>	 (03PS1) 10Andrew Bogott: cinder-backup: reduce backup workers to 2 per host [puppet] - 10https://gerrit.wikimedia.org/r/921063
[16:25:28] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal
[16:26:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cinder-backup: reduce backup workers to 2 per host [puppet] - 10https://gerrit.wikimedia.org/r/921063 (owner: 10Andrew Bogott)
[16:29:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:29:47] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+1] Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)
[16:29:52] <bblack>	 ^ pybal alerts about wcqs
[16:29:59] <bblack>	 known/ongoing maint stuff?
[16:30:09] <inflatador>	 bblack my bad, I forgot to repool. 1 sec
[16:30:25] <bblack>	 np, just checking
[16:30:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:00] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:31:11] <wikibugs>	 (03PS1) 10TChin: Allow managing leases in flink-operator namespace when using HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/921064 (https://phabricator.wikimedia.org/T336185)
[16:33:44] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:34:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:38:12] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:42:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:43:38] <wikibugs>	 (03CR) 10Ottomata: "LGTM! Nits about comments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/921064 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin)
[16:45:58] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:47:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:49:20] <wikibugs>	 (03PS12) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814)
[16:49:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[16:53:44] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:54:19] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891)
[16:55:22] <icinga-wm>	 RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:55:51] <XioNoX>	 !log push new pfw policies - T336896
[16:55:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:00] <wikibugs>	 (03CR) 10Dzahn: "This was also supposed to be like a demo of the syntax for other services, if you have ideas what are good test assertions." [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[16:56:54] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:57:21] <wikibugs>	 (03PS2) 10Dzahn: httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891)
[16:57:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:00:05] <jouncebot>	 bd808: Time to snap out of that daydream and deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1700).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1700)
[17:00:16] <wikibugs>	 (03CR) 10Hokwelum: [C: 03+1] "checks out!" [puppet] - 10https://gerrit.wikimedia.org/r/915447 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[17:00:38] <wikibugs>	 (03PS2) 10TChin: Allow managing leases in flink-operator namespace when using HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/921064 (https://phabricator.wikimedia.org/T336185)
[17:01:15] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] create custom db list files for testing of nfs shares for xml dumps [puppet] - 10https://gerrit.wikimedia.org/r/915447 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[17:01:28] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:03:30] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:03:33] <wikibugs>	 (03PS2) 10ArielGlenn: introduce a script to create output dirs on xml dumps test nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232)
[17:05:03] <wikibugs>	 (03PS1) 10Clare Ming: Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742
[17:05:12] <wikibugs>	 (03PS2) 10Clare Ming: Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742
[17:05:14] <wikibugs>	 (03PS3) 10BCornwall: doh: Clearer expression of service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533)
[17:06:41] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41233/console" [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall)
[17:07:29] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] doh: Clearer expression of service dependencies (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall)
[17:07:31] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Allow managing leases in flink-operator namespace when using HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/921064 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin)
[17:07:40] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:08:45] <wikibugs>	 (03PS1) 10Ayounsi: Update ntp_servers list [homer/public] - 10https://gerrit.wikimedia.org/r/921066
[17:09:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update ntp_servers list [homer/public] - 10https://gerrit.wikimedia.org/r/921066 (owner: 10Ayounsi)
[17:09:52] <wikibugs>	 (03Merged) 10jenkins-bot: Allow managing leases in flink-operator namespace when using HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/921064 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin)
[17:12:18] <wikibugs>	 (03PS2) 10Ayounsi: Update ntp_servers list [homer/public] - 10https://gerrit.wikimedia.org/r/921066
[17:13:05] <wikibugs>	 (03PS1) 10Ottomata: flink-operator - enable HA leader election and set replicas: 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/921067 (https://phabricator.wikimedia.org/T336185)
[17:13:07] <wikibugs>	 (03PS2) 10Ayounsi: users: change my own key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920638 (https://phabricator.wikimedia.org/T336769) (owner: 10Btullis)
[17:13:24] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891)
[17:14:35] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Checked active DNS hosts and the IPs, looks good!" [homer/public] - 10https://gerrit.wikimedia.org/r/921066 (owner: 10Ayounsi)
[17:15:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Update ntp_servers list [homer/public] - 10https://gerrit.wikimedia.org/r/921066 (owner: 10Ayounsi)
[17:15:25] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] users: change my own key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920638 (https://phabricator.wikimedia.org/T336769) (owner: 10Btullis)
[17:15:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:15:32] <wikibugs>	 (03PS5) 10Btullis: Use the spark3 shuffle jars to yarn on a test host [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363)
[17:15:40] <wikibugs>	 (03Merged) 10jenkins-bot: Update ntp_servers list [homer/public] - 10https://gerrit.wikimedia.org/r/921066 (owner: 10Ayounsi)
[17:15:50] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:16:00] <wikibugs>	 (03Merged) 10jenkins-bot: users: change my own key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920638 (https://phabricator.wikimedia.org/T336769) (owner: 10Btullis)
[17:16:10] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:16:16] <wikibugs>	 (03PS3) 10Dzahn: httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891)
[17:16:18] <wikibugs>	 (03PS2) 10Dzahn: httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891)
[17:17:06] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:17:14] <wikibugs>	 (03PS1) 10TChin: "Bump flink-operator version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/921071
[17:18:43] <wikibugs>	 (03PS6) 10Btullis: Use the spark3 shuffle jars to yarn on a test host [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363)
[17:18:57] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[17:20:45] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] "Bump flink-operator version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/921071 (owner: 10TChin)
[17:20:54] <wikibugs>	 (03PS1) 10BCornwall: users: Update brett's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921073 (https://phabricator.wikimedia.org/T336769)
[17:23:14] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:25:59] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[17:26:05] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[17:26:20] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[17:26:52] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[17:26:58] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[17:27:14] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[17:28:45] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] flink-operator - enable HA leader election and set replicas: 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/921067 (https://phabricator.wikimedia.org/T336185) (owner: 10Ottomata)
[17:29:21] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[17:29:58] <logmsgbot>	 !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[17:31:04] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:35:42] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[17:36:25] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[17:37:35] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[17:38:27] <logmsgbot>	 !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[17:38:52] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:38:52] <wikibugs>	 (03PS1) 10Dzahn: httpbb: add tests for gerrit and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/921075 (https://phabricator.wikimedia.org/T326891)
[17:40:12] <wikibugs>	 (03PS3) 10Dzahn: httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891)
[17:41:01] <wikibugs>	 (03PS4) 10Dzahn: httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891)
[17:45:06] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:45:22] <icinga-wm>	 PROBLEM - Host ps1-c5-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[17:51:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:52:54] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:59:21] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] sre: add systemd unit crashloop alert [alerts] - 10https://gerrit.wikimedia.org/r/921047 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi)
[17:59:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin1001 - T274204
[17:59:44] <stashbot>	 T274204: Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster - https://phabricator.wikimedia.org/T274204
[18:00:04] <jouncebot>	 dancy, hashar, and brennen: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1800).
[18:00:11] <brennen>	 o/
[18:00:30] <brennen>	 rolling forward as soon as i get a cup of tea and an english muffin made.
[18:00:42] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:04:21] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[18:07:35] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin1001 - T274204
[18:07:39] <stashbot>	 T274204: Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster - https://phabricator.wikimedia.org/T274204
[18:08:32] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:09:34] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921079 (https://phabricator.wikimedia.org/T330215)
[18:09:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921079 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[18:09:50] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin1001 - T332355
[18:09:55] <stashbot>	 T332355: Deploy Turkish Analyzer Plugin - https://phabricator.wikimedia.org/T332355
[18:10:41] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921079 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[18:11:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade - bking@cumin1001 - T332355
[18:12:16] <wikibugs>	 (03PS1) 10Ebernhardson: Update prometheus-blazegraph-exporter for python 3.9 [puppet] - 10https://gerrit.wikimedia.org/r/921081
[18:12:36] <wikibugs>	 (03PS3) 10Hokwelum: introduce a script to create output dirs on xml dumps test nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[18:16:20] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:18:22] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.9  refs T330215
[18:18:27] <stashbot>	 T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215
[18:19:23] <wikibugs>	 (03PS4) 10Hokwelum: introduce a script to create output dirs on xml dumps test nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[18:19:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi)
[18:19:57] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1 irb int dns - cmooney@cumin1001"
[18:20:54] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1 irb int dns - cmooney@cumin1001"
[18:20:55] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:22:32] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:27:43] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[18:29:20] <brennen>	 rolling back here.
[18:30:08] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1 irb int dns - cmooney@cumin1001"
[18:30:22] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:30:53] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921083 (https://phabricator.wikimedia.org/T330215)
[18:30:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921083 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[18:31:09] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1 irb int dns - cmooney@cumin1001"
[18:31:09] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:31:42] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921083 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[18:33:38] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts gitlab-runner1003.eqiad.wmnet
[18:36:49] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10KFrancis) Hi All, I know we're waiting on some approvals, but in the meantime, I will need the volunteer's full name, mailing address, an...
[18:38:02] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:40:34] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10kzimmerman) 05Resolved→03Open Hi all, we have re-hired Hamid Ghani. I am the hiring manager. Can you please re-enable his accounts? Thank you!
[18:40:36] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on an-worker1110 - https://phabricator.wikimedia.org/T336932 (10Jclark-ctr) Submitted Dell ticket Confirmed: Service Request 168420493 was successfully submitted.
[18:40:51] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on an-worker1110 - https://phabricator.wikimedia.org/T336932 (10Jclark-ctr) a:03Jclark-ctr
[18:40:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10kzimmerman)
[18:41:41] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): gitlab-runner1003 is not coming back online - https://phabricator.wikimedia.org/T336737 (10Jclark-ctr) @Jelto  Updated Firmware on Server additionally and performed reboot  still boots properly
[18:44:10] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1112.eqiad.wmnet - https://phabricator.wikimedia.org/T336332 (10Jclark-ctr)
[18:44:21] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1112.eqiad.wmnet - https://phabricator.wikimedia.org/T336332 (10Jclark-ctr) 05Open→03Resolved
[18:45:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:45:44] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:45:56] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev: move nova-fullstack test to cloudcontrol2005-dev [puppet] - 10https://gerrit.wikimedia.org/r/921085 (https://phabricator.wikimedia.org/T336963)
[18:48:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move nova-fullstack test to cloudcontrol2005-dev [puppet] - 10https://gerrit.wikimedia.org/r/921085 (https://phabricator.wikimedia.org/T336963) (owner: 10Andrew Bogott)
[18:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:50:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:50:28] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.8  refs T330215
[18:50:33] <stashbot>	 T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215
[18:50:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) lvs1020 is currently the "secondary" lvs in eqiad, so I'd propose we start with trying to do that one if we can.  It's c...
[18:53:30] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:55:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr hey.  It's taken a bit of time to line this up, hit a few bumps in the road with the Juniper config.  As detailed in T3...
[18:55:33] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin1001 - T332355
[18:55:37] <stashbot>	 T332355: Deploy Turkish Analyzer Plugin - https://phabricator.wikimedia.org/T332355
[18:56:18] <wikibugs>	 10SRE, 10Traffic: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792 (10ssingh)
[18:56:38] <logmsgbot>	 !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@502ddae]: T333001
[18:56:42] <stashbot>	 T333001: Setup for allowing Airflow deployment via Git Repository - https://phabricator.wikimedia.org/T333001
[18:57:13] <logmsgbot>	 !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@502ddae]: T333001 (duration: 00m 35s)
[19:00:32] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] Update prometheus-blazegraph-exporter for python 3.9 [puppet] - 10https://gerrit.wikimedia.org/r/921081 (owner: 10Ebernhardson)
[19:00:34] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Update prometheus-blazegraph-exporter for python 3.9 [puppet] - 10https://gerrit.wikimedia.org/r/921081 (owner: 10Ebernhardson)
[19:01:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) @cmooney  i am available tomorrow if you would like to address it that quickly. otherwise monday
[19:13:43] <wikibugs>	 (03PS1) 10Andrew Bogott: move nova-fullstack test to cloudcontrol2005-dev and cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/921087 (https://phabricator.wikimedia.org/T336963)
[19:15:35] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:17:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] move nova-fullstack test to cloudcontrol2005-dev and cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/921087 (https://phabricator.wikimedia.org/T336963) (owner: 10Andrew Bogott)
[19:19:57] <icinga-wm>	 PROBLEM - Check systemd state on elastic2065 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) elasticsearch-disable-readahead.service Failed on elastic2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:22:11] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:31:01] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:32:03] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:32:07] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:38:41] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:45:33] <icinga-wm>	 RECOVERY - Check systemd state on elastic2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:46:14] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) elasticsearch-disable-readahead.service Failed on elastic2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:46:31] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:50:36] <wikibugs>	 (03CR) 10Herron: [C: 03+2] mwlog: keep/reuse /srv filesystem across reimages [puppet] - 10https://gerrit.wikimedia.org/r/920698 (https://phabricator.wikimedia.org/T333614) (owner: 10Herron)
[19:52:43] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:54:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr thanks yeah I just had a word with @ssingh and I think tomorrow if probably possible.  What time suits you to be on site?
[19:59:18] <wikibugs>	 (03CR) 10Jdrewniak: Enable zebra ab test in hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia)
[20:00:06] <jouncebot>	 brennen and TheresNoTime: May I have your attention please! UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T2000)
[20:00:06] <jouncebot>	 kimberly_sarabia: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:31] <kimberly_sarabia>	 hello
[20:00:34] <urbanecm>	 hey!
[20:00:49] <TheresNoTime>	 part the way through recovering a filesystem, if someone else is around to deploy..?
[20:00:55] <urbanecm>	 ok if i do some tests for an UBN first? 
[20:00:59] <RhinosF1>	 Vector zebra sounds very fancy
[20:01:03] <urbanecm>	 ^^
[20:01:08] <urbanecm>	 i can deploy in ~5
[20:01:11] <urbanecm>	 (after the tests)
[20:01:12] <kimberly_sarabia>	 no problem
[20:01:23] <kimberly_sarabia>	 it is very fancy
[20:03:35] <urbanecm>	 okay, deploying
[20:03:44] <wikibugs>	 (03PS3) 10Urbanecm: Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)
[20:03:47] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)
[20:04:43] <kimberly_sarabia>	 ty
[20:04:50] <wikibugs>	 (03Merged) 10jenkins-bot: Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia)
[20:06:11] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:921059|Reverts hewiki A/B test (T335309)]]
[20:06:15] <stashbot>	 T335309: Add skin key for mediawiki_web_ab_test_enrollment schema firing events - https://phabricator.wikimedia.org/T335309
[20:07:42] <logmsgbot>	 !log urbanecm@deploy1002 ksarabia and urbanecm: Backport for [[gerrit:921059|Reverts hewiki A/B test (T335309)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[20:07:55] <urbanecm>	 kimberly_sarabia: can you test at mwdebug1002 please?
[20:08:03] <kimberly_sarabia>	 sure thing. one moment.
[20:09:05] <wikibugs>	 (03PS4) 10BCornwall: wikidough: bind hc to pdns-recursor and dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792)
[20:10:45] <kimberly_sarabia>	 LGTM!
[20:11:06] <urbanecm>	 syncing!
[20:11:13] <kimberly_sarabia>	 ty
[20:11:32] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41235/console" [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) (owner: 10BCornwall)
[20:11:56] <wikibugs>	 (03CR) 10BCornwall: wikidough: bind hc to pdns-recursor and dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) (owner: 10BCornwall)
[20:12:10] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] wikidough: bind hc to pdns-recursor and dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) (owner: 10BCornwall)
[20:15:32] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:16:36] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:921059|Reverts hewiki A/B test (T335309)]] (duration: 10m 25s)
[20:16:41] <stashbot>	 T335309: Add skin key for mediawiki_web_ab_test_enrollment schema firing events - https://phabricator.wikimedia.org/T335309
[20:16:44] <urbanecm>	 kimberly_sarabia: should be done!
[20:17:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:18:01] <kimberly_sarabia>	 urbanecm: i appreciate it!
[20:18:07] <urbanecm>	 any time
[20:18:33] <wikibugs>	 (03PS1) 10BCornwall: dnsbox: bind hc to pdns-recursor and gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/921095 (https://phabricator.wikimedia.org/T336973)
[20:22:16] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:22:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:29:31] <wikibugs>	 (03PS1) 10Urbanecm: Silently ignore istype-depicts image suggestion type [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962)
[20:29:41] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Silently ignore istype-depicts image suggestion type [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) (owner: 10Urbanecm)
[20:29:43] <wikibugs>	 (03PS2) 10Gergő Tisza: Silently ignore istype-depicts image suggestion type [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) (owner: 10Urbanecm)
[20:31:10] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:31:57] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Silently ignore istype-depicts image suggestion type [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) (owner: 10Urbanecm)
[20:33:18] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade - bking@cumin1001 - T332355
[20:33:23] <stashbot>	 T332355: Deploy Turkish Analyzer Plugin - https://phabricator.wikimedia.org/T332355
[20:34:23] <wikibugs>	 (03PS1) 10Brennen Bearnes: cache: Do not throw on empty set in LinkBatch::constructSet [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920744 (https://phabricator.wikimedia.org/T336964)
[20:36:01] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade - bking@cumin1001 - T332355
[20:38:56] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:40:59] <wikibugs>	 (03PS7) 10Ebernhardson: search: Add alert based on age of titlesuggest indices [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199)
[20:41:01] <wikibugs>	 (03CR) 10Ebernhardson: search: Add alert based on age of titlesuggest indices (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson)
[20:41:53] <urbanecm>	 brennen: do you want to deploy the .9 backport you uploaded too?
[20:43:18] <brennen>	 urbanecm: i can go ahead and do that one once you're clear
[20:43:31] <brennen>	 and then i think we can probably safely roll the train forward
[20:43:42] <urbanecm>	 sure, was thinking about +2'ing it, since core's ci takes forever
[20:43:48] <brennen>	 yeah, good idea
[20:44:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] cache: Do not throw on empty set in LinkBatch::constructSet [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920744 (https://phabricator.wikimedia.org/T336964) (owner: 10Brennen Bearnes)
[20:44:10] <urbanecm>	 i'll ping you once i'm done :)
[20:45:06] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:45:06] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:46:04] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:47:26] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:48:02] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:49:22] <brennen>	 sounds good
[20:49:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10CDanis) a:05jbond→03CDanis
[20:50:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) (owner: 10Urbanecm)
[20:52:48] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:52:51] <wikibugs>	 (03Merged) 10jenkins-bot: Silently ignore istype-depicts image suggestion type [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) (owner: 10Urbanecm)
[20:53:20] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:920743|Silently ignore istype-depicts image suggestion type (T336962)]]
[20:53:25] <stashbot>	 T336962: UnexpectedValueException: Unknown image suggestions API kind: istype-depicts - https://phabricator.wikimedia.org/T336962
[20:54:49] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:920743|Silently ignore istype-depicts image suggestion type (T336962)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:57:16] <urbanecm>	 okay, no error logged at mwdebug1002 with cswiki promoted to .9 via wikiversions.php, proceeding
[20:59:50] <wikibugs>	 (03Merged) 10jenkins-bot: cache: Do not throw on empty set in LinkBatch::constructSet [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920744 (https://phabricator.wikimedia.org/T336964) (owner: 10Brennen Bearnes)
[21:00:28] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:01:30] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:920743|Silently ignore istype-depicts image suggestion type (T336962)]] (duration: 08m 09s)
[21:01:34] <stashbot>	 T336962: UnexpectedValueException: Unknown image suggestions API kind: istype-depicts - https://phabricator.wikimedia.org/T336962
[21:01:42] <urbanecm>	 brennen: okay, i'm done, over to you
[21:02:26] <brennen>	 urbanecm: thanks, will deploy the linkbatch one.
[21:02:40] <urbanecm>	 brennen: no problem. could you ping me after train's promoted again please? i'd like to double check few things (this feature is hard to test on non-Wikipedia).
[21:03:41] <logmsgbot>	 !log brennen@deploy1002 Started scap: Backport for [[gerrit:920744|cache: Do not throw on empty set in LinkBatch::constructSet (T336964)]]
[21:03:46] <stashbot>	 T336964: InvalidArgumentException: Data for lt_namespace and lt_title must be non-empty - https://phabricator.wikimedia.org/T336964
[21:03:49] <brennen>	 urbanecm: yeah - although i'm now wondering about https://phabricator.wikimedia.org/T330215#8862937
[21:04:17] <brennen>	 though i assume that wouldn't get any _worse_ with train rolled forward, if it is fallout of .9
[21:05:13] <logmsgbot>	 !log brennen@deploy1002 brennen: Backport for [[gerrit:920744|cache: Do not throw on empty set in LinkBatch::constructSet (T336964)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[21:08:12] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:13:19] <logmsgbot>	 !log brennen@deploy1002 Finished scap: Backport for [[gerrit:920744|cache: Do not throw on empty set in LinkBatch::constructSet (T336964)]] (duration: 09m 38s)
[21:13:24] <stashbot>	 T336964: InvalidArgumentException: Data for lt_namespace and lt_title must be non-empty - https://phabricator.wikimedia.org/T336964
[21:14:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:15:54] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:19:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:23:06] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921104 (https://phabricator.wikimedia.org/T330215)
[21:23:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921104 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[21:23:32] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:24:13] <wikibugs>	 (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921104 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[21:28:58] <wikibugs>	 (03PS1) 10Dzahn: planet: add wikimediastatus.net to English feeds [puppet] - 10https://gerrit.wikimedia.org/r/921105
[21:29:49] <wikibugs>	 (03PS2) 10Dzahn: planet: add wikimediastatus.net to English feeds [puppet] - 10https://gerrit.wikimedia.org/r/921105 (https://phabricator.wikimedia.org/T336701)
[21:31:15] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.9  refs T330215
[21:31:18] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:31:21] <stashbot>	 T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215
[21:37:30] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:41:34] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/917918/41236/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[21:45:12] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:03] <mutante>	 !log maintenance for zuul (CI) on contint servers
[21:47:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:16] <brennen>	 urbanecm: train is at all wikis, at least for the moment.
[21:50:37] <urbanecm>	 thanks
[21:52:54] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:00:18] <hashar>	 brennen: congratulations :)
[22:00:32] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:00:45] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[22:00:52] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[22:02:00] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "> sudo chown -R 923:923 /srv/zuul/git /var/lib/zuul" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[22:04:01] <mutante>	 jouncebot: nowandnext
[22:04:01] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 55 minute(s)
[22:04:01] <jouncebot>	 In 7 hour(s) and 55 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230519T0600)
[22:08:18] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:08:24] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "puppet disabled on contint1002 and contint2001, testing on inactive host contint2002 first" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[22:08:39] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] zuul: switch to fixed uid/gid 923 [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[22:14:43] <wikibugs>	 (03PS5) 10EoghanGaffney: doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[22:16:02] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:16:05] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41238/console" [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[22:18:15] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "in addition to sudo chown -R 923:923 /srv/zuul/git /var/lib/zuul this also needs:" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[22:20:16] <mutante>	 !log short down-time for zuul-merger on contint2001 
[22:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:34] <mutante>	 !log contint2001 - moving files owned by zuul to new UID/GID - in progress
[22:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:26] <icinga-wm>	 PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[22:23:42] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:24:09] <icinga-wm>	 ACKNOWLEDGEMENT - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger daniel_zahn maintenance work https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[22:31:03] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "there isn't directly a problem it's just that the recursive chmod on a server that actually has data in /srv/zuul/git will take a much lon" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[22:31:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:31:30] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "chown of course, not chmod" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[22:32:35] <wikibugs>	 (03PS6) 10EoghanGaffney: doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[22:33:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[22:34:35] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41239/console" [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche)
[22:37:36] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:45:18] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:50:01] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:52:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:52:21] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "on contint2001 there was a small extra problem, that was even though zuul-merger service was stopped there was still a process running own" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[22:52:48] <icinga-wm>	 RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[22:53:00] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:54:34] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "the service name is "zuul" not "zuul-server", the user running it is called "zuul-server" though, so the systemctl commands had not stoppe" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[22:57:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:59:12] <icinga-wm>	 PROBLEM - zuul_gearman_service on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[22:59:22] <icinga-wm>	 PROBLEM - zuul_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-server https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[22:59:31] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade - bking@cumin1001 - T332355
[22:59:36] <stashbot>	 T332355: Deploy Turkish Analyzer Plugin - https://phabricator.wikimedia.org/T332355
[23:00:14] <icinga-wm>	 ACKNOWLEDGEMENT - zuul_gearman_service on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused daniel_zahn maintenance https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[23:00:14] <icinga-wm>	 ACKNOWLEDGEMENT - zuul_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-server daniel_zahn maintenance https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[23:00:44] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:02:20] <icinga-wm>	 RECOVERY - zuul_gearman_service on contint2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4730 https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[23:02:28] <icinga-wm>	 RECOVERY - zuul_service_running on contint2001 is OK: PROCS OK: 2 processes with regex args bin/zuul-server https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[23:02:57] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921109 (https://phabricator.wikimedia.org/T330215)
[23:02:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921109 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[23:08:19] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921109 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot)
[23:08:30] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:09:39] <brennen>	 (currently doing a train rollback.)
[23:16:14] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:18:37] <wikibugs>	 (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn)
[23:24:00] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:26:44] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.9  refs T330215
[23:26:49] <stashbot>	 T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215
[23:28:12] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "deployed on all 3 hosts, so yea, took a bit longer, zuul-server is just zuul, so I first didnt stop it right and killed it, added the comm" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[23:29:23] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] zuul: switch to fixed uid/gid 923 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[23:30:14] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:33:31] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336949 (10wiki_willy) a:03Jhancock.wm
[23:37:58] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:42:21] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) after deploying the change above carefully on all 3 contint* servers, stopping services, running manual chown commands ,...
[23:45:42] <icinga-wm>	 RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:46:14] <jinxer-wm>	 (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:46:43] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) - disable puppet - stop services - chown -R 923:923 /srv/zuul/git /var/lib/zuul - chown -R 923:923 /var/log/zuul_repack/...
[23:53:20] <icinga-wm>	 PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state