[00:00:56] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:02] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:38] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:38] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:20] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:04] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:12] (03CR) 10Krinkle: [C: 03+1] arclamp: switch redis server to arclamp1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/920299 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [00:22:06] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:34] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:44] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:50] (03CR) 10Krinkle: [C: 03+1] "LGTM. Puppet patches should go out a few minutes before this, and remember to restart the arclamp-log process if it doesn't do so automati" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920298 (https://phabricator.wikimedia.org/T327277) (owner: 10Herron) [00:26:46] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:22] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:28:26] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:26] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:08] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:39:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/920342 [00:39:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/920342 (owner: 10TrainBranchBot) [00:45:58] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:50] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/920342 (owner: 10TrainBranchBot) [01:00:10] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:20] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:24] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:00] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:50] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:04] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:40] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:24:55] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10JoKalliauer) 05Stalledβ†’03Resolved a:03JoKalliauer |file |https://commons.wikimedia.org/wiki/File:T184369.svg | | librsvg2.40 |... [01:29:28] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:02] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:08] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:30] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:48] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:40] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:53:32] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:58] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:28] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:02:34] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:46] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:46] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:34] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:38] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:23:30] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:34] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:48] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:20] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:24] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:28] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:40] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:32] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:53:24] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:16] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:36] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:26] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:23:18] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:31:10] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:35:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:39:02] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:18] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:06] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:46] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:08:28] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:57] (03CR) 10Santhosh: [C: 03+1] MinT: Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 (owner: 10KartikMistry) [04:16:10] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:22:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [04:24:04] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:20] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:50] (03CR) 10Aaron Schulz: "I didn't make a puppet patch yet. I was thinking about just ignoring all these variables for probe connections instead." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/918612 (owner: 10Aaron Schulz) [04:34:03] (03CR) 10Marostegui: [C: 03+1] Add add_user_is_temp_T336886.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/920771 (https://phabricator.wikimedia.org/T336886) (owner: 10Ladsgroup) [04:36:46] (03PS1) 10Marostegui: control-mariadb-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/920809 (https://phabricator.wikimedia.org/T336462) [04:37:30] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/920809 (https://phabricator.wikimedia.org/T336462) (owner: 10Marostegui) [04:38:08] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:50] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: maintenance [04:48:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2096,2101,2115,2131].codfw.wmnet with reason: maintenance [04:53:32] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:12] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:08:58] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:15:16] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:06] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:30:54] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:38:34] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:46:18] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:36] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T0600) [06:00:05] kormat, marostegui, and Amir1: Time to snap out of that daydream and deploy Primary database switchover. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T0600). [06:00:30] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:29] (03PS1) 10Marostegui: instances.yaml: Remove db1122 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/920969 (https://phabricator.wikimedia.org/T336833) [06:07:08] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1122 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/920969 (https://phabricator.wikimedia.org/T336833) (owner: 10Marostegui) [06:07:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1122 from dbctl T336833', diff saved to https://phabricator.wikimedia.org/P48362 and previous config saved to /var/cache/conftool/dbconfig/20230518-060734-marostegui.json [06:07:39] T336833: decommission db1122.eqiad.wmnet - https://phabricator.wikimedia.org/T336833 [06:08:20] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:29] (03PS2) 10KartikMistry: MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 [06:15:58] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:17:52] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:38] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2134,2160].codfw.wmnet,db[1159,1217].eqiad.wmnet with reason: maintenance [06:23:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2134,2160].codfw.wmnet,db[1159,1217].eqiad.wmnet with reason: maintenance [06:31:26] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:32] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:12] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:10] (03PS1) 10Marostegui: phabricator.my.cnf.erb: Set gtid_domain_id=0 [puppet] - 10https://gerrit.wikimedia.org/r/920986 (https://phabricator.wikimedia.org/T336228) [06:46:28] (03Abandoned) 10Marostegui: control-mariadb-client-10.4: Remove file [software] - 10https://gerrit.wikimedia.org/r/920653 (owner: 10Marostegui) [06:46:44] (03CR) 10Marostegui: [C: 03+2] phabricator.my.cnf.erb: Set gtid_domain_id=0 [puppet] - 10https://gerrit.wikimedia.org/r/920986 (https://phabricator.wikimedia.org/T336228) (owner: 10Marostegui) [06:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:52:52] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:56] (03PS3) 10KartikMistry: MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 (https://phabricator.wikimedia.org/T336483) [06:54:33] (03CR) 10Mvolz: rest-gateway: add citoid support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [07:00:05] Amir1, apergos, and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] morning! there are no trainees signed up today. kart_ I see you have just the one patch which looks straight-forward enough. will you be self-deploying today? [07:00:24] apergos: yes :) [07:00:32] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:32] ok! it's all yours :-) [07:01:27] Thanks! [07:02:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [07:03:05] (03PS3) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) [07:04:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [07:05:20] (03Merged) 10jenkins-bot: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920577 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [07:06:06] !log kartik@deploy1002 Started scap: Backport for [[gerrit:920577|Enable the new Special:Contribute page entry point for desktop on selected wikis (T327868)]] [07:06:10] T327868: Enable the new Special:Contribute page entry point for desktop on selected wikis - https://phabricator.wikimedia.org/T327868 [07:07:34] !log kartik@deploy1002 kartik: Backport for [[gerrit:920577|Enable the new Special:Contribute page entry point for desktop on selected wikis (T327868)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [07:08:10] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:24] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:920577|Enable the new Special:Contribute page entry point for desktop on selected wikis (T327868)]] (duration: 09m 18s) [07:15:28] T327868: Enable the new Special:Contribute page entry point for desktop on selected wikis - https://phabricator.wikimedia.org/T327868 [07:15:48] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:40] apergos: I'm done if anyone wants to continue.. [07:16:50] thanks! [07:17:02] I'll give it 5 minutes and then close up shop for today [07:23:28] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:25:52] !log UTC morning backport and config training window done [07:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:00] see folks next time! [07:27:24] (03CR) 10Filippo Giunchedi: "Following up from an IRC conversation:" [puppet] - 10https://gerrit.wikimedia.org/r/919419 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [07:31:08] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:16] (03CR) 10Filippo Giunchedi: [C: 03+2] cadvisor: add explicity metrics enable [puppet] - 10https://gerrit.wikimedia.org/r/920660 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [07:38:50] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:32] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:57] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: name=registry2003.codfw.wmnet [08:00:05] dancy and hashar: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T0800). [08:00:24] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:49] !log upgrade registry on registry2003 to 2.8.2 [08:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:14] (03PS3) 10Filippo Giunchedi: cadvisor: disable percpu and cpuLoad metric classes [puppet] - 10https://gerrit.wikimedia.org/r/920661 (https://phabricator.wikimedia.org/T108027) [08:04:46] (03PS1) 10KartikMistry: Beta: Enable the new Special:Contribute page entry point for desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) [08:08:02] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:01] (03PS1) 10Filippo Giunchedi: profile: rollout cadvisor to PoPs [puppet] - 10https://gerrit.wikimedia.org/r/920991 (https://phabricator.wikimedia.org/T108027) [08:12:45] (03CR) 10Elukey: [C: 03+2] ml-services: change isvc name to revertrisk-language-agnostic [deployment-charts] - 10https://gerrit.wikimedia.org/r/920725 (https://phabricator.wikimedia.org/T332998) (owner: 10AikoChou) [08:13:00] (03CR) 10Elukey: [C: 03+2] services: change lift wing's kafka topic in changeprop's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/920696 (https://phabricator.wikimedia.org/T333468) (owner: 10Elukey) [08:14:24] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41229/console" [puppet] - 10https://gerrit.wikimedia.org/r/920991 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:15:40] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:18:01] (03CR) 10Filippo Giunchedi: [V: 03+1] "This is close to being a noop, since cadvisor already runs on a big chunk of hosts in PoPs anyways (cp hosts)" [puppet] - 10https://gerrit.wikimedia.org/r/920991 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [08:19:05] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [08:19:19] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [08:21:55] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [08:22:11] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [08:22:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:23:18] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:58] (03PS1) 10Filippo Giunchedi: prometheus: clean up eventgate prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/921023 (https://phabricator.wikimedia.org/T309009) [08:26:12] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=registry2003.codfw.wmnet [08:27:26] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:28:28] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:29:00] !log upgrade docker-registry to 2.8.2 on all registry hosts [08:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:26] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:31:00] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:37:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:38:52] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:18] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:43:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:44:54] (03PS4) 10KartikMistry: MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 (https://phabricator.wikimedia.org/T336483) [08:45:08] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:53:02] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:54] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:46] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:30] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:48] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:12] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) The updated PuppetDB -> Netbox import script has now been merged, and I've run it against all servers in Netbox in state 'active'... [09:30:42] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:34] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:24] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:34] jouncebot: nowandnext [09:55:34] For the next 0 hour(s) and 4 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T0800) [09:55:35] In 0 hour(s) and 4 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1000) [09:55:35] In 0 hour(s) and 4 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1000) [09:55:46] πŸš‚ [09:58:59] (03CR) 10Elukey: [C: 03+2] changeprop: add liftwing outlink topic stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/920282 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [10:00:05] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1000) [10:00:36] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:56] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [10:06:07] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [10:08:24] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:18] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:15:08] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:16:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:22:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:23:52] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:41] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1001.eqiad.wmnet [10:24:54] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ml-cache1001.eqiad.wmnet [10:25:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1001.eqiad.wmnet [10:25:59] (03PS1) 10Cathal Mooney: Improve logic getting switch port when primary IP is on bridge device [cookbooks] - 10https://gerrit.wikimedia.org/r/921032 (https://phabricator.wikimedia.org/T296832) [10:27:04] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:27:24] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:28:55] (03PS2) 10Cathal Mooney: Improve logic getting switch port when primary IP is on bridge device [cookbooks] - 10https://gerrit.wikimedia.org/r/921032 (https://phabricator.wikimedia.org/T296832) [10:29:12] (03PS1) 10MVernon: hiera: remove ms-be204[0-3] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/921034 (https://phabricator.wikimedia.org/T335280) [10:29:16] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:29:58] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:29:59] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on an-worker1110.eqiad.wmnet with reason: Troubleshooting failed disk [10:30:08] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-worker1110.eqiad.wmnet with reason: Troubleshooting failed disk [10:31:00] ACKNOWLEDGEMENT - Check systemd state on an-worker1110 is CRITICAL: CRITICAL - degraded: The following units failed: var-lib-hadoop-data-f.mount Btullis Troubleshooting failed disk - T336929 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:00] ACKNOWLEDGEMENT - MegaRAID on an-worker1110 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Troubleshooting failed disk - T336929 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:32:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1001.eqiad.wmnet [10:32:15] (03CR) 10MVernon: "I've updated our docs a bit about the decom process - all looks clear to you?" [puppet] - 10https://gerrit.wikimedia.org/r/921034 (https://phabricator.wikimedia.org/T335280) (owner: 10MVernon) [10:37:58] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:50] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:50:15] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1002.eqiad.wmnet [10:51:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add nginx logs for docker-registry host to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/919350 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney) [10:51:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] Send nginx and docker-registry logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/919351 (https://phabricator.wikimedia.org/T322579) (owner: 10EoghanGaffney) [10:53:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1002.eqiad.wmnet [11:00:37] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1003.eqiad.wmnet [11:01:30] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:23] jouncebot: nowandnext [11:03:23] No deployments scheduled for the next 1 hour(s) and 56 minute(s) [11:03:23] In 1 hour(s) and 56 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1300) [11:03:24] In 1 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1300) [11:04:36] (03PS6) 10Samtar: InitialiseSettings: Set wgWatchersMaxAge=30days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti) [11:05:13] * kart_ is deploying MinT [11:06:20] (03CR) 10KartikMistry: [C: 03+2] MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 (https://phabricator.wikimedia.org/T336483) (owner: 10KartikMistry) [11:07:03] (03Merged) 10jenkins-bot: MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/920687 (https://phabricator.wikimedia.org/T336483) (owner: 10KartikMistry) [11:07:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1003.eqiad.wmnet [11:07:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:28] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [11:11:00] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [11:11:07] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [11:11:17] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) [11:12:05] 10SRE, 10ops-codfw, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan - https://phabricator.wikimedia.org/T336587 (10cmooney) 05Openβ†’03Resolved Couple of niggles getting this going on the... [11:15:34] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:04] RECOVERY - Check systemd state on an-worker1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:28] RECOVERY - MegaRAID on an-worker1110 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:17:41] Question: where is the `deployment.eqiad.wmnet` service alias set? [11:18:29] TheresNoTime: dns repo I believe [11:20:47] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [11:21:17] TheresNoTime: https://github.com/wikimedia/operations-dns/blob/master/templates/wmnet#L30 [11:21:55] RhinosF1: smh, I was looking at https://wikitech.wikimedia.org/wiki/Deployment_server#Service and expecting to see `deploy2002` [11:23:26] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:38] TheresNoTime: ah [11:23:39] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [11:24:42] TheresNoTime: I think updating the page is manual and probably didn't follow the switchover if it's wrong [11:25:11] https://github.com/wikimedia/operations-dns/commit/20df3f9118e9f0471066e224638d0501483390e0 [11:25:20] Ye [11:26:25] * TheresNoTime shrug [11:27:14] TheresNoTime: it's a wiki, you can fix it :) [11:27:41] :p [11:28:05] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [11:29:18] (03PS3) 10Slyngshede: Offboarding: Allow managers to offboard users. [software/bitu] - 10https://gerrit.wikimedia.org/r/920665 (https://phabricator.wikimedia.org/T335476) [11:31:18] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:54] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: User offboarding - https://phabricator.wikimedia.org/T335476 (10RhinosF1) Employees being off-boarded from the WMF may wish to continue in some roles as a volunteer. Will this support keeping some roles or switching from 'wmf' to 'nda' ldap... [11:34:02] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [11:36:45] !log MinT: Update to 2023-05-18-060931-production and Set CT2_INTRA_THREADS to 0 (T336483) [11:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:49] T336483: Long sequence of a repeated word appears only when using MinT but not NLLB-200 directly - https://phabricator.wikimedia.org/T336483 [11:36:54] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Some comments in line in terms of the approach with interface names but overall looks good I expect it should work and do what we n" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/917876 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [11:37:34] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:03] (03CR) 10Cathal Mooney: [C: 03+2] Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [11:38:41] (03Merged) 10jenkins-bot: Automate and update DHCP relay configuration [homer/public] - 10https://gerrit.wikimedia.org/r/908346 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [11:41:23] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: User offboarding - https://phabricator.wikimedia.org/T335476 (10SLyngshede-WMF) As currently planned there will just be a list of roles/LDAP groups which is removed from users during off-boarding. Any other groups that user belongs to is not... [11:45:22] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:09] ACKNOWLEDGEMENT - MegaRAID on an-worker1110 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T336932 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:49:13] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1110 - https://phabricator.wikimedia.org/T336932 (10ops-monitoring-bot) [11:51:10] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-etcd1001.eqiad.wmnet [11:52:10] (03PS3) 10Hnowlan: wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) [11:53:08] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd1001.eqiad.wmnet [11:55:49] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: don't fail on unknown blackbox probe type [puppet] - 10https://gerrit.wikimedia.org/r/919331 (https://phabricator.wikimedia.org/T320620) (owner: 10Filippo Giunchedi) [11:56:08] (03CR) 10CI reject: [V: 04-1] wip: upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [11:56:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:56:25] !log reconfiguring DHCP relay function on eqiad core routers (T320508) [11:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:29] T320508: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 [12:00:36] 10SRE-swift-storage, 10Observability-Metrics, 10User-fgiunchedi: Investigate HW requirements for Thanos frontend - https://phabricator.wikimedia.org/T312201 (10fgiunchedi) 05Openβ†’03Resolved a:03fgiunchedi Resolving as we'll be getting the hardware (same specs as current thanos-fe) [12:00:58] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:02:42] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1001.eqiad.wmnet [12:06:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1001.eqiad.wmnet [12:08:40] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:16] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1002.eqiad.wmnet [12:12:10] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [12:12:15] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm [12:15:07] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1002.eqiad.wmnet [12:16:18] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:32] (03PS1) 10Filippo Giunchedi: sre: add systemd unit crashloop alert [alerts] - 10https://gerrit.wikimedia.org/r/921047 (https://phabricator.wikimedia.org/T293970) [12:16:41] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1003.eqiad.wmnet [12:17:27] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [12:17:32] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002... [12:18:24] (03CR) 10Filippo Giunchedi: "Current units crashlooping: https://thanos.wikimedia.org/graph?g0.expr=%20%20%20%20%20%20%20%20%20%20increase(node_systemd_service_restart" [alerts] - 10https://gerrit.wikimedia.org/r/921047 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi) [12:19:16] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [12:19:22] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm [12:20:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1003.eqiad.wmnet [12:23:58] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:09] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [12:24:15] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002... [12:24:33] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [12:24:39] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm [12:28:46] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet [12:30:51] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:08] !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:35:13] !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:35:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet [12:35:28] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet [12:35:51] (03PS1) 10Ottomata: Revert "Enable First Input Delay events." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920735 [12:36:48] (03CR) 10Ottomata: [C: 03+2] Revert "Enable First Input Delay events." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920735 (owner: 10Ottomata) [12:37:01] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:34] (03Merged) 10jenkins-bot: Revert "Enable First Input Delay events." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920735 (owner: 10Ottomata) [12:41:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet [12:44:08] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [12:44:13] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002... [12:44:30] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [12:44:35] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm [12:46:13] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:46:47] !log clean up old jupyterhub.service references (crash looping) on stat* nodes that had it [12:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:51] !log otto@deploy1002 Synchronized wmf-config/ext-EventLogging.php: Revert Enable First Input Delay events. This is causing validation errors as well as breakages in the hadoop ingestion pipepine - T332012 (duration: 07m 00s) [12:46:55] T332012: Collect first input delay - https://phabricator.wikimedia.org/T332012 [12:47:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:04] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [12:51:09] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002... [12:51:15] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:51:24] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [12:51:30] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm [12:51:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of comments all around. The premise and logic sound fine to me. That being said, this is a huge patch, I did my best, but something" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:51:48] !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker [12:51:56] (03PS2) 10Samtar: Beta: Enable the new Special:Contribute page entry point for desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [12:52:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "PCC btw appears fine as far as I can tell. Changes only to puppet resources, so this should, in theory, be a noop" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [12:52:57] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:54:51] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-staging-worker [12:55:54] kart_: if you're around, did you want me to push your beta-only change now? [12:56:48] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [12:56:53] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002... [12:57:03] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [12:57:09] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm [12:57:28] TheresNoTime: I also have a last-minute beta-only change :D [12:57:45] jan_drewniak: I'll do that now if that's okay? [12:57:53] TheresNoTime: yes please :) [12:57:54] (03PS3) 10Samtar: [Beta] Enable Vector AB test on beta spanish wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920793 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak) [12:58:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:13] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:39] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:58:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920793 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak) [12:59:18] !log otto@deploy1002 Synchronized wmf-config/ext-EventStreamConfig.php: Revert Enable First Input Delay events. This is causing validation errors as well as breakages in the hadoop ingestion pipepine - T332012 (duration: 06m 19s) [12:59:22] T332012: Collect first input delay - https://phabricator.wikimedia.org/T332012 [12:59:29] (03Merged) 10jenkins-bot: [Beta] Enable Vector AB test on beta spanish wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920793 (https://phabricator.wikimedia.org/T335972) (owner: 10Jdrewniak) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1300). [13:00:05] kart_, TheresNoTime, and jan_drewniak: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:11] * TheresNoTime can deploy [13:00:29] jan_drewniak: done, will be on the next `beta-code-update-eqiad` [13:00:39] TheresNoTime: thanks! [13:00:42] TheresNoTime: sure. Go ahead. [13:00:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [13:01:06] (03PS3) 10Samtar: Beta: Enable the new Special:Contribute page entry point for desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [13:01:53] (03CR) 10TrainBranchBot: "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [13:02:42] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [13:02:42] (03Merged) 10jenkins-bot: Beta: Enable the new Special:Contribute page entry point for desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920988 (https://phabricator.wikimedia.org/T327868) (owner: 10KartikMistry) [13:02:49] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002... [13:03:19] kart_: done, in the next `beta-code-update-eqiad` :) [13:03:38] (03PS7) 10Samtar: InitialiseSettings: Set wgWatchersMaxAge=30days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti) [13:04:38] (03CR) 10ArielGlenn: [C: 03+2] Add dump user subdirectories to support testing of new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915423 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [13:04:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti) [13:05:40] (03Merged) 10jenkins-bot: InitialiseSettings: Set wgWatchersMaxAge=30days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/919023 (https://phabricator.wikimedia.org/T336250) (owner: 10Sarah Mukuti) [13:06:09] !log samtar@deploy1002 Started scap: Backport for [[gerrit:919023|InitialiseSettings: Set wgWatchersMaxAge=30days (T336250)]] [13:06:13] T336250: Decrease wgWatchersMaxAge to 30 days, and display this on `action=info` - https://phabricator.wikimedia.org/T336250 [13:06:15] (03CR) 10ArielGlenn: [C: 03+2] add nfs tester to dumps worker (snapshot) testbed role [puppet] - 10https://gerrit.wikimedia.org/r/915437 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [13:06:34] TheresNoTime: looks good. Had to refresh page multiple times! Please go ahead for full deployment. [13:07:27] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:31] kart_: that should be live proper on the beta cluster now [13:07:40] !log samtar@deploy1002 samtar and s-mukuti: Backport for [[gerrit:919023|InitialiseSettings: Set wgWatchersMaxAge=30days (T336250)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:07:43] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [13:07:51] (testing my patch ^) [13:08:04] TheresNoTime: Thanks! [13:09:27] (syncing mine) [13:13:18] (03PS2) 10ArielGlenn: create custom db list files for testing of nfs shares for xml dumps [puppet] - 10https://gerrit.wikimedia.org/r/915447 (https://phabricator.wikimedia.org/T325232) [13:14:54] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:919023|InitialiseSettings: Set wgWatchersMaxAge=30days (T336250)]] (duration: 08m 45s) [13:14:59] T336250: Decrease wgWatchersMaxAge to 30 days, and display this on `action=info` - https://phabricator.wikimedia.org/T336250 [13:16:15] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:18:44] !log closing backport window [13:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:16] (03CR) 10Vgutierrez: [C: 03+1] cadvisor: disable percpu and cpuLoad metric classes [puppet] - 10https://gerrit.wikimedia.org/r/920661 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [13:23:37] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:10] (03CR) 10Herron: [C: 03+1] sre: add systemd unit crashloop alert [alerts] - 10https://gerrit.wikimedia.org/r/921047 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi) [13:25:52] 10SRE, 10ops-codfw, 10Cloud-Services, 10cloud-services-team: cloudbackup2001 lockup on 2023-05-05 - https://phabricator.wikimedia.org/T336060 (10Jhancock.wm) @Andrew I went back through the lifecycle logs on the idrac and I could not find a cause for the ssh going down or the lagged response. I inspected t... [13:31:07] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:47] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:41] (03PS1) 10Andrew Bogott: Revert "envscripts: include OS_CLOUD in environment." [puppet] - 10https://gerrit.wikimedia.org/r/920736 [13:44:37] (03CR) 10Andrew Bogott: [C: 03+2] Revert "envscripts: include OS_CLOUD in environment." [puppet] - 10https://gerrit.wikimedia.org/r/920736 (owner: 10Andrew Bogott) [13:46:21] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:18] !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker [13:49:21] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-staging-worker [13:50:54] elukey: does part of what you're doing involve reimaging an-worker1156 ? [13:50:55] !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker [13:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:25] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:47] topranks: nope! only ml nodes [13:52:57] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-staging-worker [13:53:14] ok - just checking - I may have borked DHCP for some things, working on it..... someone else must be trying [13:53:15] thanks! [13:56:38] (03PS1) 10KartikMistry: Enable the new Special:Contribute page entry point for desktop on selected wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921049 (https://phabricator.wikimedia.org/T327868) [13:58:34] (03PS1) 10Samtar: logspam-watch: Add a fox emoji [puppet] - 10https://gerrit.wikimedia.org/r/921050 [13:59:40] !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw [14:01:09] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:01:19] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:01:21] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:49] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-serve-worker-codfw [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:45] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:08] (03PS2) 10Hnowlan: engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber) [14:12:43] (03PS1) 10Elukey: knative-serving: add replica count config for webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/921052 [14:15:23] (03CR) 10Ssingh: "Thanks for working on this patch! Comments inline:" [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [14:16:07] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:13] (03CR) 10Elukey: [C: 03+2] knative-serving: add replica count config for webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/921052 (owner: 10Elukey) [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:33] (03PS3) 10Hnowlan: engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber) [14:20:54] (03PS1) 10Hnowlan: thumbor: move xcf support to imagemagick [deployment-charts] - 10https://gerrit.wikimedia.org/r/921053 (https://phabricator.wikimedia.org/T260285) [14:23:29] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:56] (03CR) 10CI reject: [V: 04-1] engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber) [14:24:21] PROBLEM - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:25:27] RECOVERY - Host gitlab-runner1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:30:02] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:30:28] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:30:55] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:19] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [14:31:25] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002... [14:31:43] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [14:31:48] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm [14:34:30] !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker [14:38:19] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:58] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts gitlab-runner1003.eqiad.wmnet [14:41:12] 10SRE, 10Traffic: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792 (10ssingh) [14:43:35] (03PS4) 10Hnowlan: engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber) [14:45:43] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:47:21] (03PS1) 10Cathal Mooney: Add trust-option-82 to dhcp relay conf for core routers [homer/public] - 10https://gerrit.wikimedia.org/r/921054 (https://phabricator.wikimedia.org/T320508) [14:49:45] (03CR) 10Dzahn: [C: 03+2] openstack: remove old Gerrit IP from cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [14:49:50] (03PS2) 10Dzahn: openstack: remove old Gerrit IP from cloudgw [puppet] - 10https://gerrit.wikimedia.org/r/919408 (https://phabricator.wikimedia.org/T336427) [14:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:50:02] (03PS1) 10Elukey: knative-serving: fix pdb for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/921055 [14:51:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:52:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:54] (03CR) 10Elukey: [C: 03+2] knative-serving: fix pdb for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/921055 (owner: 10Elukey) [14:53:21] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:15] (03CR) 10Cathal Mooney: [C: 03+2] Add trust-option-82 to dhcp relay conf for core routers [homer/public] - 10https://gerrit.wikimedia.org/r/921054 (https://phabricator.wikimedia.org/T320508) (owner: 10Cathal Mooney) [14:54:50] (03Merged) 10jenkins-bot: Add trust-option-82 to dhcp relay conf for core routers [homer/public] - 10https://gerrit.wikimedia.org/r/921054 (https://phabricator.wikimedia.org/T320508) (owner: 10Cathal Mooney) [14:56:32] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:57:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:57:05] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:58:29] (03Abandoned) 10Samtar: diff: Only show inline legend for text slot [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920578 (https://phabricator.wikimedia.org/T336481) (owner: 10Samtar) [14:59:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:59:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:00:04] (03CR) 10Dzahn: [C: 03+1] "I was personally for it but maybe there is no consensus on this one. Other people should also vote here." [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall) [15:00:33] (03PS1) 10Ottomata: page_content_chnage - Shorten name of service and namespace in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/921056 (https://phabricator.wikimedia.org/T330507) [15:00:35] (03Abandoned) 10Samtar: onDifferenceEngineBeforeDiffTable: Return early on Special pages [extensions/VisualEditor] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920579 (https://phabricator.wikimedia.org/T336582) (owner: 10Samtar) [15:01:05] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:08] (03CR) 10CI reject: [V: 04-1] page_content_chnage - Shorten name of service and namespace in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/921056 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [15:02:38] (03CR) 10Dzahn: "yea, I mean I am not opposed to it but it probably needs testing and that is kind of the part I wanted to skip. so not sure how to vote, I" [puppet] - 10https://gerrit.wikimedia.org/r/918424 (https://phabricator.wikimedia.org/T336217) (owner: 10Jelto) [15:02:58] !log stevemunene@deploy1002 Started deploy [airflow-dags/analytics_product@6e3358d]: (no justification provided) [15:03:05] !log stevemunene@deploy1002 Finished deploy [airflow-dags/analytics_product@6e3358d]: (no justification provided) (duration: 00m 06s) [15:04:19] (03PS2) 10Ottomata: page_content_chnage - Shorten name of service and namespace in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/921056 (https://phabricator.wikimedia.org/T330507) [15:04:43] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2001.codfw.wmnet [15:06:31] (03CR) 10Hnowlan: rest-gateway: add citoid support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/920710 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [15:08:37] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2001.codfw.wmnet [15:08:45] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:12] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2002.codfw.wmnet [15:10:48] (03CR) 10Ottomata: [C: 03+2] page_content_chnage - Shorten name of service and namespace in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/921056 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [15:13:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2002.codfw.wmnet [15:13:14] (03Merged) 10jenkins-bot: page_content_chnage - Shorten name of service and namespace in wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/921056 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [15:15:23] !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:16:22] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2003.codfw.wmnet [15:16:27] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:19] !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:18:17] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:18:29] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10cmooney) >>! In T320508#8488549, @ayounsi wrote: > Marking this task dependent on DHCP option 97 to reduce the risk of DHCP oddities related to Option 82. Ironic I hadn't... [15:18:39] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:19:14] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10cmooney) 05Openβ†’03Resolved [15:19:23] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:ml-staging-worker [15:20:15] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2003.codfw.wmnet [15:20:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) [15:20:48] 10SRE, 10Infrastructure-Foundations, 10netops: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10cmooney) 05Openβ†’03Resolved Complete now after merging above patch. [15:22:31] (03PS1) 10Ottomata: kubernetes::deployment_server::services - define mw-page-content-change-enrich [puppet] - 10https://gerrit.wikimedia.org/r/921058 (https://phabricator.wikimedia.org/T330507) [15:22:35] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:13] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [15:23:59] (03CR) 10Ottomata: [C: 03+2] kubernetes::deployment_server::services - define mw-page-content-change-enrich [puppet] - 10https://gerrit.wikimedia.org/r/921058 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [15:25:33] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [15:25:39] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors: - sretest1002... [15:26:20] (03CR) 10Brennen Bearnes: [C: 03+1] "Reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/921050 (owner: 10Samtar) [15:26:28] :p [15:26:43] PROBLEM - Check systemd state on ml-staging2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:28:56] ignores that because it's staging.. but normally failed ferm is bad [15:29:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [15:30:05] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:47] mutante: temporary dns failure, fixed :) [15:30:52] (after a reboot) [15:30:55] elukey: cool:) ack! [15:31:07] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [15:31:09] failed DNS did come to mind when I saw failed ferm, indeed [15:31:11] RECOVERY - Check systemd state on ml-staging2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:01] (03CR) 10Eevans: [C: 03+1] hiera: remove ms-be204[0-3] from swift::storagehosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/921034 (https://phabricator.wikimedia.org/T335280) (owner: 10MVernon) [15:37:33] !log cmooney@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [15:37:37] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:39] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bullseye [15:37:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [15:39:45] (03CR) 10Bking: [C: 03+2] wcqs: Configure webproxy for federated queries [puppet] - 10https://gerrit.wikimedia.org/r/919216 (https://phabricator.wikimedia.org/T334470) (owner: 10Ebernhardson) [15:43:28] (03PS4) 10EoghanGaffney: doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [15:44:06] try typing the word "piwik" without your hand auto-completing to "wiki", I needed 3 times :) [15:45:17] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:36] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41231/console" [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [15:46:15] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:46:21] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers kubemaster2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:46:27] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Performance-Team (Radar): Mapping Client IPs to Resolver IPs - https://phabricator.wikimedia.org/T336947 (10JameelKaisar) [15:47:15] !log mforns@deploy1002 Started deploy [airflow-dags/analytics_test@6e3358d]: (no justification provided) [15:47:25] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics_test@6e3358d]: (no justification provided) (duration: 00m 10s) [15:49:34] (KubernetesAPILatency) firing: (19) High Kubernetes API latency (LIST apiservices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:50:24] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [15:52:41] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [15:54:34] (KubernetesAPILatency) resolved: (21) High Kubernetes API latency (LIST apiservices) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:57:49] !log bking@cumin1001 starting rolling restart of wcqs for java updates T334470 [15:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:53] T334470: Federated queries to Lingua Libre time out in the Commons query service - https://phabricator.wikimedia.org/T334470 [15:58:18] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [15:58:21] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:58:33] (03PS1) 10Kimberly Sarabia: Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309) [15:59:59] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+1] "Looks good, with the additional change of switching the static list of gitlab-runner hosts to `wmflib::role::hosts` so that it resolves th" [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [16:00:04] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:16] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:06] (03PS2) 10Kimberly Sarabia: Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309) [16:07:24] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:02] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bullseye [16:10:08] 10SRE, 10Infrastructure-Foundations, 10netops: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bullseye completed: - sretest1002 (**PASS**)... [16:11:36] (03PS1) 10Ottomata: page_content_change - fix kafka SSL setting in wikikube staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/921060 (https://phabricator.wikimedia.org/T330507) [16:13:01] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336949 (10phaultfinder) [16:15:20] (03CR) 10MVernon: [C: 03+2] hiera: remove ms-be204[0-3] from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/921034 (https://phabricator.wikimedia.org/T335280) (owner: 10MVernon) [16:15:20] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:05] (03CR) 10Ottomata: [C: 03+2] page_content_change - fix kafka SSL setting in wikikube staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/921060 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [16:19:46] (03PS1) 10Ottomata: page_content_change - move common kafka SSL settings into values-main.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/921061 (https://phabricator.wikimedia.org/T330507) [16:20:58] (03CR) 10Ottomata: [C: 03+2] page_content_change - move common kafka SSL settings into values-main.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/921061 (https://phabricator.wikimedia.org/T330507) (owner: 10Ottomata) [16:21:21] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:21:25] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:22:38] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [16:22:40] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:14] (03CR) 10Hnowlan: [C: 03+1] engine: Remove custom XCF handler in favor of ImageMagick [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/619864 (https://phabricator.wikimedia.org/T260285) (owner: 10AntiCompositeNumber) [16:25:19] (03PS1) 10Andrew Bogott: cinder-backup: reduce backup workers to 2 per host [puppet] - 10https://gerrit.wikimedia.org/r/921063 [16:25:28] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [16:26:03] (03CR) 10Andrew Bogott: [C: 03+2] cinder-backup: reduce backup workers to 2 per host [puppet] - 10https://gerrit.wikimedia.org/r/921063 (owner: 10Andrew Bogott) [16:29:18] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:29:47] (03CR) 10Jdrewniak: [C: 03+1] Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [16:29:52] ^ pybal alerts about wcqs [16:29:59] known/ongoing maint stuff? [16:30:09] bblack my bad, I forgot to repool. 1 sec [16:30:25] np, just checking [16:30:26] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:00] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:31:11] (03PS1) 10TChin: Allow managing leases in flink-operator namespace when using HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/921064 (https://phabricator.wikimedia.org/T336185) [16:33:44] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:34:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:38:12] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:43:38] (03CR) 10Ottomata: "LGTM! Nits about comments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/921064 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin) [16:45:58] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:49:20] (03PS12) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [16:49:49] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [16:53:44] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:19] (03PS1) 10Dzahn: httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) [16:55:22] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:51] !log push new pfw policies - T336896 [16:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:00] (03CR) 10Dzahn: "This was also supposed to be like a demo of the syntax for other services, if you have ideas what are good test assertions." [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [16:56:54] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:57:21] (03PS2) 10Dzahn: httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) [16:57:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:00:05] bd808: Time to snap out of that daydream and deploy Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1700). [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1700) [17:00:16] (03CR) 10Hokwelum: [C: 03+1] "checks out!" [puppet] - 10https://gerrit.wikimedia.org/r/915447 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [17:00:38] (03PS2) 10TChin: Allow managing leases in flink-operator namespace when using HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/921064 (https://phabricator.wikimedia.org/T336185) [17:01:15] (03CR) 10ArielGlenn: [C: 03+2] create custom db list files for testing of nfs shares for xml dumps [puppet] - 10https://gerrit.wikimedia.org/r/915447 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [17:01:28] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:30] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:03:33] (03PS2) 10ArielGlenn: introduce a script to create output dirs on xml dumps test nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232) [17:05:03] (03PS1) 10Clare Ming: Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742 [17:05:12] (03PS2) 10Clare Ming: Revert "Revert "VisualEditorFeatureUse sampling rate to 1 everywhere"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920742 [17:05:14] (03PS3) 10BCornwall: doh: Clearer expression of service dependencies [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) [17:06:41] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41233/console" [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [17:07:29] (03CR) 10BCornwall: [V: 03+1] doh: Clearer expression of service dependencies (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T335533) (owner: 10BCornwall) [17:07:31] (03CR) 10Ottomata: [C: 03+2] Allow managing leases in flink-operator namespace when using HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/921064 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin) [17:07:40] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:08:45] (03PS1) 10Ayounsi: Update ntp_servers list [homer/public] - 10https://gerrit.wikimedia.org/r/921066 [17:09:15] (03CR) 10CI reject: [V: 04-1] Update ntp_servers list [homer/public] - 10https://gerrit.wikimedia.org/r/921066 (owner: 10Ayounsi) [17:09:52] (03Merged) 10jenkins-bot: Allow managing leases in flink-operator namespace when using HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/921064 (https://phabricator.wikimedia.org/T336185) (owner: 10TChin) [17:12:18] (03PS2) 10Ayounsi: Update ntp_servers list [homer/public] - 10https://gerrit.wikimedia.org/r/921066 [17:13:05] (03PS1) 10Ottomata: flink-operator - enable HA leader election and set replicas: 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/921067 (https://phabricator.wikimedia.org/T336185) [17:13:07] (03PS2) 10Ayounsi: users: change my own key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920638 (https://phabricator.wikimedia.org/T336769) (owner: 10Btullis) [17:13:24] (03PS1) 10Dzahn: httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891) [17:14:35] (03CR) 10Ssingh: [C: 03+1] "Checked active DNS hosts and the IPs, looks good!" [homer/public] - 10https://gerrit.wikimedia.org/r/921066 (owner: 10Ayounsi) [17:15:05] (03CR) 10Ayounsi: [C: 03+2] Update ntp_servers list [homer/public] - 10https://gerrit.wikimedia.org/r/921066 (owner: 10Ayounsi) [17:15:25] (03CR) 10Ayounsi: [C: 03+2] users: change my own key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920638 (https://phabricator.wikimedia.org/T336769) (owner: 10Btullis) [17:15:26] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:32] (03PS5) 10Btullis: Use the spark3 shuffle jars to yarn on a test host [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) [17:15:40] (03Merged) 10jenkins-bot: Update ntp_servers list [homer/public] - 10https://gerrit.wikimedia.org/r/921066 (owner: 10Ayounsi) [17:15:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 20 Jun 2023 04:41:39 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:16:00] (03Merged) 10jenkins-bot: users: change my own key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/920638 (https://phabricator.wikimedia.org/T336769) (owner: 10Btullis) [17:16:10] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:16:16] (03PS3) 10Dzahn: httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) [17:16:18] (03PS2) 10Dzahn: httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891) [17:17:06] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:17:14] (03PS1) 10TChin: "Bump flink-operator version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/921071 [17:18:43] (03PS6) 10Btullis: Use the spark3 shuffle jars to yarn on a test host [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) [17:18:57] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [17:20:45] (03CR) 10Ottomata: [C: 03+2] "Bump flink-operator version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/921071 (owner: 10TChin) [17:20:54] (03PS1) 10BCornwall: users: Update brett's key to ed25519 [homer/public] - 10https://gerrit.wikimedia.org/r/921073 (https://phabricator.wikimedia.org/T336769) [17:23:14] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:59] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:26:05] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:26:20] !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:26:52] !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:26:58] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:27:14] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:28:45] (03CR) 10Ottomata: [C: 03+2] flink-operator - enable HA leader election and set replicas: 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/921067 (https://phabricator.wikimedia.org/T336185) (owner: 10Ottomata) [17:29:21] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:29:58] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:31:04] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:42] !log otto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [17:36:25] !log otto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [17:37:35] !log otto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:38:27] !log otto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:38:52] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:52] (03PS1) 10Dzahn: httpbb: add tests for gerrit and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/921075 (https://phabricator.wikimedia.org/T326891) [17:40:12] (03PS3) 10Dzahn: httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891) [17:41:01] (03PS4) 10Dzahn: httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) [17:45:06] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:22] PROBLEM - Host ps1-c5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:51:14] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:54] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:21] (03CR) 10Andrea Denisse: [C: 03+1] sre: add systemd unit crashloop alert [alerts] - 10https://gerrit.wikimedia.org/r/921047 (https://phabricator.wikimedia.org/T293970) (owner: 10Filippo Giunchedi) [17:59:39] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin1001 - T274204 [17:59:44] T274204: Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster - https://phabricator.wikimedia.org/T274204 [18:00:04] dancy, hashar, and brennen: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T1800). [18:00:11] o/ [18:00:30] rolling forward as soon as i get a cup of tea and an english muffin made. [18:00:42] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:21] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:07:35] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge elasticsearch and plugin upgrade - bking@cumin1001 - T274204 [18:07:39] T274204: Deploy new version of Extra Plugin (with Khmer filter) to Elasticsearch cluster - https://phabricator.wikimedia.org/T274204 [18:08:32] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:34] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921079 (https://phabricator.wikimedia.org/T330215) [18:09:36] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921079 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [18:09:50] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin1001 - T332355 [18:09:55] T332355: Deploy Turkish Analyzer Plugin - https://phabricator.wikimedia.org/T332355 [18:10:41] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921079 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [18:11:27] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade - bking@cumin1001 - T332355 [18:12:16] (03PS1) 10Ebernhardson: Update prometheus-blazegraph-exporter for python 3.9 [puppet] - 10https://gerrit.wikimedia.org/r/921081 [18:12:36] (03PS3) 10Hokwelum: introduce a script to create output dirs on xml dumps test nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [18:16:20] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:22] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.9 refs T330215 [18:18:27] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [18:19:23] (03PS4) 10Hokwelum: introduce a script to create output dirs on xml dumps test nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/915455 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [18:19:42] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Update network SSH keys to ssh-ed25519 - https://phabricator.wikimedia.org/T336769 (10ayounsi) [18:19:57] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1 irb int dns - cmooney@cumin1001" [18:20:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1 irb int dns - cmooney@cumin1001" [18:20:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:22:32] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:43] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [18:29:20] rolling back here. [18:30:08] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1 irb int dns - cmooney@cumin1001" [18:30:22] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:53] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921083 (https://phabricator.wikimedia.org/T330215) [18:30:55] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921083 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [18:31:09] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add ssw1 irb int dns - cmooney@cumin1001" [18:31:09] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:31:42] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921083 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [18:33:38] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts gitlab-runner1003.eqiad.wmnet [18:36:49] 10SRE, 10All-and-every-Wikisource, 10Search-Console-access-request: Search Console access request for Wikisource (Volunteer) - https://phabricator.wikimedia.org/T336255 (10KFrancis) Hi All, I know we're waiting on some approvals, but in the meantime, I will need the volunteer's full name, mailing address, an... [18:38:02] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:34] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10kzimmerman) 05Resolvedβ†’03Open Hi all, we have re-hired Hamid Ghani. I am the hiring manager. Can you please re-enable his accounts? Thank you! [18:40:36] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1110 - https://phabricator.wikimedia.org/T336932 (10Jclark-ctr) Submitted Dell ticket Confirmed: Service Request 168420493 was successfully submitted. [18:40:51] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1110 - https://phabricator.wikimedia.org/T336932 (10Jclark-ctr) a:03Jclark-ctr [18:40:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10kzimmerman) [18:41:41] 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): gitlab-runner1003 is not coming back online - https://phabricator.wikimedia.org/T336737 (10Jclark-ctr) @Jelto Updated Firmware on Server additionally and performed reboot still boots properly [18:44:10] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1112.eqiad.wmnet - https://phabricator.wikimedia.org/T336332 (10Jclark-ctr) [18:44:21] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1112.eqiad.wmnet - https://phabricator.wikimedia.org/T336332 (10Jclark-ctr) 05Openβ†’03Resolved [18:45:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:45:44] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:56] (03PS1) 10Andrew Bogott: codfw1dev: move nova-fullstack test to cloudcontrol2005-dev [puppet] - 10https://gerrit.wikimedia.org/r/921085 (https://phabricator.wikimedia.org/T336963) [18:48:33] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move nova-fullstack test to cloudcontrol2005-dev [puppet] - 10https://gerrit.wikimedia.org/r/921085 (https://phabricator.wikimedia.org/T336963) (owner: 10Andrew Bogott) [18:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:50:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:50:28] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.8 refs T330215 [18:50:33] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [18:50:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) lvs1020 is currently the "secondary" lvs in eqiad, so I'd propose we start with trying to do that one if we can. It's c... [18:53:30] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:55:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr hey. It's taken a bit of time to line this up, hit a few bumps in the road with the Juniper config. As detailed in T3... [18:55:33] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin1001 - T332355 [18:55:37] T332355: Deploy Turkish Analyzer Plugin - https://phabricator.wikimedia.org/T332355 [18:56:18] 10SRE, 10Traffic: Add systemd-level service bindings for Wikimedia DNS - https://phabricator.wikimedia.org/T336792 (10ssingh) [18:56:38] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@502ddae]: T333001 [18:56:42] T333001: Setup for allowing Airflow deployment via Git Repository - https://phabricator.wikimedia.org/T333001 [18:57:13] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@502ddae]: T333001 (duration: 00m 35s) [19:00:32] (03CR) 10Ryan Kemper: [C: 03+1] Update prometheus-blazegraph-exporter for python 3.9 [puppet] - 10https://gerrit.wikimedia.org/r/921081 (owner: 10Ebernhardson) [19:00:34] (03CR) 10Ryan Kemper: [C: 03+2] Update prometheus-blazegraph-exporter for python 3.9 [puppet] - 10https://gerrit.wikimedia.org/r/921081 (owner: 10Ebernhardson) [19:01:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) @cmooney i am available tomorrow if you would like to address it that quickly. otherwise monday [19:13:43] (03PS1) 10Andrew Bogott: move nova-fullstack test to cloudcontrol2005-dev and cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/921087 (https://phabricator.wikimedia.org/T336963) [19:15:35] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:01] (03CR) 10Andrew Bogott: [C: 03+2] move nova-fullstack test to cloudcontrol2005-dev and cloudcontrol1007 [puppet] - 10https://gerrit.wikimedia.org/r/921087 (https://phabricator.wikimedia.org/T336963) (owner: 10Andrew Bogott) [19:19:57] PROBLEM - Check systemd state on elastic2065 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:14] (SystemdUnitFailed) firing: (3) elasticsearch-disable-readahead.service Failed on elastic2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:22:11] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:01] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:03] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:32:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:38:41] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:33] RECOVERY - Check systemd state on elastic2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:14] (SystemdUnitFailed) firing: (3) elasticsearch-disable-readahead.service Failed on elastic2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:46:31] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:36] (03CR) 10Herron: [C: 03+2] mwlog: keep/reuse /srv filesystem across reimages [puppet] - 10https://gerrit.wikimedia.org/r/920698 (https://phabricator.wikimedia.org/T333614) (owner: 10Herron) [19:52:43] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr thanks yeah I just had a word with @ssingh and I think tomorrow if probably possible. What time suits you to be on site? [19:59:18] (03CR) 10Jdrewniak: Enable zebra ab test in hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920386 (https://phabricator.wikimedia.org/T335972) (owner: 10Kimberly Sarabia) [20:00:06] brennen and TheresNoTime: May I have your attention please! UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230518T2000) [20:00:06] kimberly_sarabia: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:31] hello [20:00:34] hey! [20:00:49] part the way through recovering a filesystem, if someone else is around to deploy..? [20:00:55] ok if i do some tests for an UBN first? [20:00:59] Vector zebra sounds very fancy [20:01:03] ^^ [20:01:08] i can deploy in ~5 [20:01:11] (after the tests) [20:01:12] no problem [20:01:23] it is very fancy [20:03:35] okay, deploying [20:03:44] (03PS3) 10Urbanecm: Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [20:03:47] (03CR) 10Urbanecm: [C: 03+2] Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [20:04:43] ty [20:04:50] (03Merged) 10jenkins-bot: Reverts hewiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921059 (https://phabricator.wikimedia.org/T335309) (owner: 10Kimberly Sarabia) [20:06:11] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:921059|Reverts hewiki A/B test (T335309)]] [20:06:15] T335309: Add skin key for mediawiki_web_ab_test_enrollment schema firing events - https://phabricator.wikimedia.org/T335309 [20:07:42] !log urbanecm@deploy1002 ksarabia and urbanecm: Backport for [[gerrit:921059|Reverts hewiki A/B test (T335309)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [20:07:55] kimberly_sarabia: can you test at mwdebug1002 please? [20:08:03] sure thing. one moment. [20:09:05] (03PS4) 10BCornwall: wikidough: bind hc to pdns-recursor and dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) [20:10:45] LGTM! [20:11:06] syncing! [20:11:13] ty [20:11:32] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41235/console" [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) (owner: 10BCornwall) [20:11:56] (03CR) 10BCornwall: wikidough: bind hc to pdns-recursor and dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) (owner: 10BCornwall) [20:12:10] (03CR) 10BCornwall: [V: 03+1] wikidough: bind hc to pdns-recursor and dnsdist [puppet] - 10https://gerrit.wikimedia.org/r/920794 (https://phabricator.wikimedia.org/T336792) (owner: 10BCornwall) [20:15:32] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:36] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:921059|Reverts hewiki A/B test (T335309)]] (duration: 10m 25s) [20:16:41] T335309: Add skin key for mediawiki_web_ab_test_enrollment schema firing events - https://phabricator.wikimedia.org/T335309 [20:16:44] kimberly_sarabia: should be done! [20:17:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:18:01] urbanecm: i appreciate it! [20:18:07] any time [20:18:33] (03PS1) 10BCornwall: dnsbox: bind hc to pdns-recursor and gdnsd [puppet] - 10https://gerrit.wikimedia.org/r/921095 (https://phabricator.wikimedia.org/T336973) [20:22:16] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:29:31] (03PS1) 10Urbanecm: Silently ignore istype-depicts image suggestion type [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) [20:29:41] (03CR) 10Urbanecm: [C: 03+2] Silently ignore istype-depicts image suggestion type [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) (owner: 10Urbanecm) [20:29:43] (03PS2) 10GergΕ‘ Tisza: Silently ignore istype-depicts image suggestion type [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) (owner: 10Urbanecm) [20:31:10] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:57] (03CR) 10GergΕ‘ Tisza: [C: 03+2] Silently ignore istype-depicts image suggestion type [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) (owner: 10Urbanecm) [20:33:18] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade - bking@cumin1001 - T332355 [20:33:23] T332355: Deploy Turkish Analyzer Plugin - https://phabricator.wikimedia.org/T332355 [20:34:23] (03PS1) 10Brennen Bearnes: cache: Do not throw on empty set in LinkBatch::constructSet [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920744 (https://phabricator.wikimedia.org/T336964) [20:36:01] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade - bking@cumin1001 - T332355 [20:38:56] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:59] (03PS7) 10Ebernhardson: search: Add alert based on age of titlesuggest indices [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) [20:41:01] (03CR) 10Ebernhardson: search: Add alert based on age of titlesuggest indices (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/911945 (https://phabricator.wikimedia.org/T327199) (owner: 10Ebernhardson) [20:41:53] brennen: do you want to deploy the .9 backport you uploaded too? [20:43:18] urbanecm: i can go ahead and do that one once you're clear [20:43:31] and then i think we can probably safely roll the train forward [20:43:42] sure, was thinking about +2'ing it, since core's ci takes forever [20:43:48] yeah, good idea [20:44:04] (03CR) 10Urbanecm: [C: 03+2] cache: Do not throw on empty set in LinkBatch::constructSet [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920744 (https://phabricator.wikimedia.org/T336964) (owner: 10Brennen Bearnes) [20:44:10] i'll ping you once i'm done :) [20:45:06] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:06] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:46:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:47:26] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49993 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:48:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.287 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:49:22] sounds good [20:49:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10CDanis) a:05jbondβ†’03CDanis [20:50:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) (owner: 10Urbanecm) [20:52:48] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:51] (03Merged) 10jenkins-bot: Silently ignore istype-depicts image suggestion type [extensions/GrowthExperiments] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920743 (https://phabricator.wikimedia.org/T336962) (owner: 10Urbanecm) [20:53:20] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:920743|Silently ignore istype-depicts image suggestion type (T336962)]] [20:53:25] T336962: UnexpectedValueException: Unknown image suggestions API kind: istype-depicts - https://phabricator.wikimedia.org/T336962 [20:54:49] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:920743|Silently ignore istype-depicts image suggestion type (T336962)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:57:16] okay, no error logged at mwdebug1002 with cswiki promoted to .9 via wikiversions.php, proceeding [20:59:50] (03Merged) 10jenkins-bot: cache: Do not throw on empty set in LinkBatch::constructSet [core] (wmf/1.41.0-wmf.9) - 10https://gerrit.wikimedia.org/r/920744 (https://phabricator.wikimedia.org/T336964) (owner: 10Brennen Bearnes) [21:00:28] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:30] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:920743|Silently ignore istype-depicts image suggestion type (T336962)]] (duration: 08m 09s) [21:01:34] T336962: UnexpectedValueException: Unknown image suggestions API kind: istype-depicts - https://phabricator.wikimedia.org/T336962 [21:01:42] brennen: okay, i'm done, over to you [21:02:26] urbanecm: thanks, will deploy the linkbatch one. [21:02:40] brennen: no problem. could you ping me after train's promoted again please? i'd like to double check few things (this feature is hard to test on non-Wikipedia). [21:03:41] !log brennen@deploy1002 Started scap: Backport for [[gerrit:920744|cache: Do not throw on empty set in LinkBatch::constructSet (T336964)]] [21:03:46] T336964: InvalidArgumentException: Data for lt_namespace and lt_title must be non-empty - https://phabricator.wikimedia.org/T336964 [21:03:49] urbanecm: yeah - although i'm now wondering about https://phabricator.wikimedia.org/T330215#8862937 [21:04:17] though i assume that wouldn't get any _worse_ with train rolled forward, if it is fallout of .9 [21:05:13] !log brennen@deploy1002 brennen: Backport for [[gerrit:920744|cache: Do not throw on empty set in LinkBatch::constructSet (T336964)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:08:12] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:13:19] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:920744|cache: Do not throw on empty set in LinkBatch::constructSet (T336964)]] (duration: 09m 38s) [21:13:24] T336964: InvalidArgumentException: Data for lt_namespace and lt_title must be non-empty - https://phabricator.wikimedia.org/T336964 [21:14:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:15:54] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:19:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:23:06] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921104 (https://phabricator.wikimedia.org/T330215) [21:23:12] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921104 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [21:23:32] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:13] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921104 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [21:28:58] (03PS1) 10Dzahn: planet: add wikimediastatus.net to English feeds [puppet] - 10https://gerrit.wikimedia.org/r/921105 [21:29:49] (03PS2) 10Dzahn: planet: add wikimediastatus.net to English feeds [puppet] - 10https://gerrit.wikimedia.org/r/921105 (https://phabricator.wikimedia.org/T336701) [21:31:15] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.9 refs T330215 [21:31:18] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:21] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [21:37:30] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:34] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/917918/41236/contint1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [21:45:12] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:03] !log maintenance for zuul (CI) on contint servers [21:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:16] urbanecm: train is at all wikis, at least for the moment. [21:50:37] thanks [21:52:54] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:18] brennen: congratulations :) [22:00:32] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:45] (03CR) 10EoghanGaffney: [C: 03+1] httpbb: add simple test for VRTS [puppet] - 10https://gerrit.wikimedia.org/r/921068 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [22:00:52] (03CR) 10EoghanGaffney: [C: 03+1] httpbb: add simple tests for gitlab.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [22:02:00] (03CR) 10Dzahn: [V: 03+1] "> sudo chown -R 923:923 /srv/zuul/git /var/lib/zuul" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [22:04:01] jouncebot: nowandnext [22:04:01] No deployments scheduled for the next 7 hour(s) and 55 minute(s) [22:04:01] In 7 hour(s) and 55 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230519T0600) [22:08:18] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:24] (03CR) 10Dzahn: [V: 03+1] "puppet disabled on contint1002 and contint2001, testing on inactive host contint2002 first" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [22:08:39] (03CR) 10Dzahn: [V: 03+1 C: 03+2] zuul: switch to fixed uid/gid 923 [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [22:14:43] (03PS5) 10EoghanGaffney: doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [22:16:02] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:05] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41238/console" [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [22:18:15] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "in addition to sudo chown -R 923:923 /srv/zuul/git /var/lib/zuul this also needs:" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [22:20:16] !log short down-time for zuul-merger on contint2001 [22:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:34] !log contint2001 - moving files owned by zuul to new UID/GID - in progress [22:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:26] PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [22:23:42] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:09] ACKNOWLEDGEMENT - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-merger daniel_zahn maintenance work https://www.mediawiki.org/wiki/Continuous_integration/Zuul [22:31:03] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "there isn't directly a problem it's just that the recursive chmod on a server that actually has data in /srv/zuul/git will take a much lon" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [22:31:26] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:30] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "chown of course, not chmod" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [22:32:35] (03PS6) 10EoghanGaffney: doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [22:33:40] (03CR) 10CI reject: [V: 04-1] doc: add password-protected rsync module for publishing from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [22:34:35] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41239/console" [puppet] - 10https://gerrit.wikimedia.org/r/920669 (https://phabricator.wikimedia.org/T336168) (owner: 10Jaime Nuche) [22:37:36] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:18] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:01] (NodeTextfileStale) firing: (2) Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:52:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:52:21] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "on contint2001 there was a small extra problem, that was even though zuul-merger service was stopped there was still a process running own" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [22:52:48] RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [22:53:00] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:34] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "the service name is "zuul" not "zuul-server", the user running it is called "zuul-server" though, so the systemctl commands had not stoppe" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [22:57:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:59:12] PROBLEM - zuul_gearman_service on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused https://www.mediawiki.org/wiki/Continuous_integration/Zuul [22:59:22] PROBLEM - zuul_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-server https://www.mediawiki.org/wiki/Continuous_integration/Zuul [22:59:31] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade - bking@cumin1001 - T332355 [22:59:36] T332355: Deploy Turkish Analyzer Plugin - https://phabricator.wikimedia.org/T332355 [23:00:14] ACKNOWLEDGEMENT - zuul_gearman_service on contint2001 is CRITICAL: connect to address 127.0.0.1 and port 4730: Connection refused daniel_zahn maintenance https://www.mediawiki.org/wiki/Continuous_integration/Zuul [23:00:14] ACKNOWLEDGEMENT - zuul_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args bin/zuul-server daniel_zahn maintenance https://www.mediawiki.org/wiki/Continuous_integration/Zuul [23:00:44] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:02:20] RECOVERY - zuul_gearman_service on contint2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4730 https://www.mediawiki.org/wiki/Continuous_integration/Zuul [23:02:28] RECOVERY - zuul_service_running on contint2001 is OK: PROCS OK: 2 processes with regex args bin/zuul-server https://www.mediawiki.org/wiki/Continuous_integration/Zuul [23:02:57] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921109 (https://phabricator.wikimedia.org/T330215) [23:02:59] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921109 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [23:08:19] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921109 (https://phabricator.wikimedia.org/T330215) (owner: 10TrainBranchBot) [23:08:30] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:39] (currently doing a train rollback.) [23:16:14] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:37] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/921065 (https://phabricator.wikimedia.org/T326891) (owner: 10Dzahn) [23:24:00] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:44] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.9 refs T330215 [23:26:49] T330215: 1.41.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T330215 [23:28:12] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "deployed on all 3 hosts, so yea, took a bit longer, zuul-server is just zuul, so I first didnt stop it right and killed it, added the comm" [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [23:29:23] (03CR) 10Dzahn: [V: 03+1 C: 03+2] zuul: switch to fixed uid/gid 923 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/917918 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [23:30:14] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:33:31] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336949 (10wiki_willy) a:03Jhancock.wm [23:37:58] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:21] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) after deploying the change above carefully on all 3 contint* servers, stopping services, running manual chown commands ,... [23:45:42] RECOVERY - Check systemd state on stat1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:46:14] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service Failed on wdqs2022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:46:43] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) - disable puppet - stop services - chown -R 923:923 /srv/zuul/git /var/lib/zuul - chown -R 923:923 /var/log/zuul_repack/... [23:53:20] PROBLEM - Check systemd state on stat1009 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state