[00:00:59] (03PS3) 10Andrew Bogott: Remove scripts for cross-region migration [puppet] - 10https://gerrit.wikimedia.org/r/956926 [00:01:01] (03PS3) 10Andrew Bogott: mwopenstackclients: add methods to correlate project id with name [puppet] - 10https://gerrit.wikimedia.org/r/956927 (https://phabricator.wikimedia.org/T343158) [00:01:03] (03PS3) 10Andrew Bogott: wmcs-cold-migrate: remove instance_fqdn output hint [puppet] - 10https://gerrit.wikimedia.org/r/956928 (https://phabricator.wikimedia.org/T343158) [00:01:05] (03PS3) 10Andrew Bogott: wmcs-instance-fqdns: support cases where project_name != project_id [puppet] - 10https://gerrit.wikimedia.org/r/956929 (https://phabricator.wikimedia.org/T343158) [00:01:07] (03PS4) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158) [00:01:09] (03PS3) 10Andrew Bogott: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) [00:01:11] (03PS1) 10Andrew Bogott: wmf_sink: correct calls to get_keystone_session [puppet] - 10https://gerrit.wikimedia.org/r/956981 [00:02:31] (03CR) 10CI reject: [V: 04-1] wmf_sink: correct calls to get_keystone_session [puppet] - 10https://gerrit.wikimedia.org/r/956981 (owner: 10Andrew Bogott) [00:03:24] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:27] (03PS2) 10Andrew Bogott: wmf_sink: correct calls to get_keystone_session [puppet] - 10https://gerrit.wikimedia.org/r/956981 [00:03:31] (03PS4) 10Andrew Bogott: Remove scripts for cross-region migration [puppet] - 10https://gerrit.wikimedia.org/r/956926 [00:03:35] (03PS4) 10Andrew Bogott: mwopenstackclients: add methods to correlate project id with name [puppet] - 10https://gerrit.wikimedia.org/r/956927 (https://phabricator.wikimedia.org/T343158) [00:03:39] (03PS4) 10Andrew Bogott: wmcs-cold-migrate: remove instance_fqdn output hint [puppet] - 10https://gerrit.wikimedia.org/r/956928 (https://phabricator.wikimedia.org/T343158) [00:03:43] (03PS4) 10Andrew Bogott: wmcs-instance-fqdns: support cases where project_name != project_id [puppet] - 10https://gerrit.wikimedia.org/r/956929 (https://phabricator.wikimedia.org/T343158) [00:03:47] (03PS5) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158) [00:03:51] (03PS4) 10Andrew Bogott: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) [00:04:36] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:05] (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: correct calls to get_keystone_session [puppet] - 10https://gerrit.wikimedia.org/r/956981 (owner: 10Andrew Bogott) [00:08:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:08:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:09:22] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:11:10] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:12:22] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:12:47] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:20:10] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 4.413 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:20:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.686 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:20:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:34:40] 10SRE, 10Traffic: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10nshahquinn-wmf) a:05nshahquinn-wmf→03Fabfur Thanks for the reminder! The list is at P52488. A few wikis listed seem not to have a mobile site, either lacking the footer link entirely or giving a D... [00:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956853 [00:38:40] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956853 (owner: 10TrainBranchBot) [00:47:19] (03PS1) 10Tim Starling: sshd: Disable keyboard-interactive authentication [puppet] - 10https://gerrit.wikimedia.org/r/956983 [00:48:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:54:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/956853 (owner: 10TrainBranchBot) [01:08:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [02:03:46] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: generate_os_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:33] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:56] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:33] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:33] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:46] (03CR) 10Krinkle: [C: 03+1] Remove PHP 7.2 fallback for array_key_first() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956364 (owner: 10Tim Starling) [02:51:59] (03PS2) 10Krinkle: Enable source maps on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954878 (https://phabricator.wikimedia.org/T47514) (owner: 10Tim Starling) [02:52:15] (03CR) 10Krinkle: [C: 03+1] "Approved for deployment!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954878 (https://phabricator.wikimedia.org/T47514) (owner: 10Tim Starling) [04:10:20] (03CR) 10Tim Starling: [C: 03+2] Do not enable entire OOUI in PHP on page load [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956814 (https://phabricator.wikimedia.org/T345414) (owner: 10Jdlrobson) [04:10:22] (03CR) 10Tim Starling: [C: 03+2] Do not enable entire OOUI in PHP on page load [extensions/Phonos] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956815 (https://phabricator.wikimedia.org/T345414) (owner: 10Jdlrobson) [04:12:26] (03Merged) 10jenkins-bot: Do not enable entire OOUI in PHP on page load [extensions/Phonos] (wmf/1.41.0-wmf.25) - 10https://gerrit.wikimedia.org/r/956814 (https://phabricator.wikimedia.org/T345414) (owner: 10Jdlrobson) [04:12:40] (03Merged) 10jenkins-bot: Do not enable entire OOUI in PHP on page load [extensions/Phonos] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956815 (https://phabricator.wikimedia.org/T345414) (owner: 10Jdlrobson) [04:16:11] !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:956815|Do not enable entire OOUI in PHP on page load (T345414)]] [04:16:19] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [04:17:46] !log hmonroy@deploy1002 hmonroy and jdlrobson: Backport for [[gerrit:956815|Do not enable entire OOUI in PHP on page load (T345414)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [04:19:04] !log hmonroy@deploy1002 hmonroy and jdlrobson: Continuing with sync [04:26:07] !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:956815|Do not enable entire OOUI in PHP on page load (T345414)]] (duration: 09m 56s) [04:26:12] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [04:27:50] !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:956814|Do not enable entire OOUI in PHP on page load (T345414)]] [04:29:26] !log hmonroy@deploy1002 hmonroy and jdlrobson: Backport for [[gerrit:956814|Do not enable entire OOUI in PHP on page load (T345414)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [04:29:33] !log hmonroy@deploy1002 hmonroy and jdlrobson: Continuing with sync [04:35:48] !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:956814|Do not enable entire OOUI in PHP on page load (T345414)]] (duration: 07m 58s) [04:35:52] T345414: Enabling Phonos on all projects increased JavaScript and CSS size, Phonos should not use OOUI on page load - https://phabricator.wikimedia.org/T345414 [04:42:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:43:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:53:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:02:33] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/956997 (https://phabricator.wikimedia.org/T344309) [05:03:17] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/956997 (https://phabricator.wikimedia.org/T344309) (owner: 10Marostegui) [05:03:50] (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: Update version [software] - 10https://gerrit.wikimedia.org/r/956997 (https://phabricator.wikimedia.org/T344309) (owner: 10Marostegui) [05:06:14] (03CR) 10Tim Starling: [C: 03+2] Remove PHP 7.2 fallback for array_key_first() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956364 (owner: 10Tim Starling) [05:06:56] (03Merged) 10jenkins-bot: Remove PHP 7.2 fallback for array_key_first() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956364 (owner: 10Tim Starling) [05:09:55] (03PS1) 10Marostegui: install_server: Do not reimage db2194 [puppet] - 10https://gerrit.wikimedia.org/r/956998 [05:10:28] (03PS2) 10Marostegui: install_server: Do not reimage db2193 [puppet] - 10https://gerrit.wikimedia.org/r/956998 [05:11:58] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2193 [puppet] - 10https://gerrit.wikimedia.org/r/956998 (owner: 10Marostegui) [05:13:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:15:14] !log tstarling@deploy1002 Synchronized wmf-config/etcd.php: Remove PHP 7.2 fallback for array_key_first g 956364 (duration: 07m 03s) [05:30:13] (03PS3) 10Tim Starling: Enable source maps on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954878 (https://phabricator.wikimedia.org/T47514) [05:31:08] (03CR) 10Tim Starling: [C: 03+2] Enable source maps on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954878 (https://phabricator.wikimedia.org/T47514) (owner: 10Tim Starling) [05:31:47] (03Merged) 10jenkins-bot: Enable source maps on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954878 (https://phabricator.wikimedia.org/T47514) (owner: 10Tim Starling) [05:40:24] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable source maps on group0 wikis T47514 (duration: 07m 14s) [05:40:27] T47514: ResourceLoader: Implement support for Source Maps - https://phabricator.wikimedia.org/T47514 [05:55:25] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable source maps on group0 wikis attempt 2 (duration: 07m 37s) [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230913T0600) [06:03:00] (HelmReleaseBadStatus) firing: (2) Helm release mw-api-ext/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:06:39] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Running again following connection refused errors from kubemaster (duration: 07m 24s) [06:08:00] (HelmReleaseBadStatus) resolved: (2) Helm release mw-api-ext/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:36:18] PROBLEM - Check systemd state on kubemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:48] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:38:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:39:34] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST configmaps) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:43:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST configmaps) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:52:20] RECOVERY - Check systemd state on kubemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Amir1, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230913T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:09:12] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:19:40] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:21:13] (03PS5) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [07:22:58] (03CR) 10Brouberol: [C: 03+2] Add: roll restart/reboot command for opensearch cluster nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:36:45] (03CR) 10Filippo Giunchedi: [C: 03+2] Don't require dummy 'team' label for multi-owner alerts [alerts] - 10https://gerrit.wikimedia.org/r/956794 (owner: 10Filippo Giunchedi) [07:37:42] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:38:59] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan1001.eqiad.wmnet with OS bookworm [07:39:04] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan1001... [07:42:27] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [07:42:38] (03CR) 10Volans: [C: 03+1] Add: roll restart/reboot command for opensearch cluster nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/955717 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [07:43:07] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=thanos-fe1004.eqiad.wmnet,service=thanos-web [07:43:25] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan1001.eqiad.wmnet,service=thanos-web [07:43:28] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:43:38] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan2001.codfdw.wmnet,service=thanos-web [07:43:51] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=thanos-fe2004.codfdw.wmnet,service=thanos-web [07:43:52] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:44:04] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=thanos-fe2004.codfw.wmnet,service=thanos-web [07:45:36] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:45:43] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:45:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [07:46:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T343198)', diff saved to https://phabricator.wikimedia.org/P52491 and previous config saved to /var/cache/conftool/dbconfig/20230913-074602-arnaudb.json [07:46:06] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan2001.codfw.wmnet with OS bookworm [07:46:06] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [07:46:12] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan2001... [07:46:28] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:48:08] (03PS1) 10D3r1ck01: rdbms: Use `debugSql` instead of `debugDumpSql` which is unuset [core] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956818 (https://phabricator.wikimedia.org/T318272) [07:49:48] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:50:54] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:51:06] (03PS1) 10Filippo Giunchedi: reimage: use install-console [cookbooks] - 10https://gerrit.wikimedia.org/r/957238 [07:51:10] yes that's me [07:51:18] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:51:25] !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=titan2001.codfw.wmnet,service=thanos-web [07:51:58] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host titan1001.eqiad.wmnet with OS bookworm [07:52:02] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan1001.eqi... [07:53:55] !log repool cp1075 && cp1076 [07:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:12] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan1001.eqiad.wmnet with OS bookworm [07:54:17] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan1001... [07:56:36] hmmm logmsgbot is down? [07:56:57] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [07:57:07] maybe just really lagged? :) [08:00:06] jnuche and hashar: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230913T0800). [08:00:22] 👋 Andre and I will deploy in a bit [08:01:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [08:01:47] vgutierrez: the state of logmsgbot might be checkable via alert1001 using `journalctl -u tcpircbot-logmsgbot.service` (based on 2021 comment https://phabricator.wikimedia.org/T284123#7128231 ) [08:02:22] (03CR) 10Volans: [C: 03+1] "LGTM, thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/957238 (owner: 10Filippo Giunchedi) [08:02:31] at least messages land on https://sal.toolforge.org/ [08:02:49] looks like the issue is stashbot not replying despite indeed logging to the SAL [08:02:59] hashar: nope... [08:03:11] I've repooled cp1075 and cp1076 via confctl and that never got logged [08:03:19] (03CR) 10Filippo Giunchedi: [C: 03+2] reimage: use install-console [cookbooks] - 10https://gerrit.wikimedia.org/r/957238 (owner: 10Filippo Giunchedi) [08:03:37] at least https://sal.toolforge.org/ has: 07:53 repool cp1075 && cp1076 [08:03:47] hashar: yeah.. that's my manual !log entry [08:03:52] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:03:54] WARNING:conftool.announce:conftool action : set/pooled=yes; selector: name=cp1075.* [08:03:57] that's missing [08:04:08] and WARNING:conftool.announce:conftool action : set/pooled=yes; selector: name=cp1076.* too [08:04:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:04:39] stashbot was intentionally changed some time ago not to reply to logmsgbot !logs to avoid spamming this channel [08:04:39] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [08:04:58] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:05:15] ahhh less spam from stashbot, great :-] [08:05:31] (03PS2) 10Cathal Mooney: Do not try to configure DHCP relay on L3 switches without IRB ints [homer/public] - 10https://gerrit.wikimedia.org/r/956908 (https://phabricator.wikimedia.org/T322937) [08:06:02] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43237/console" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [08:07:22] (03PS2) 10Brouberol: Improve sre.opensearch.roll-restart-reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) [08:07:34] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host titan1001.eqiad.wmnet with OS bookworm [08:07:38] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan1001.eqi... [08:08:19] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan1001.eqiad.wmnet with OS bookworm [08:08:24] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan1001... [08:09:09] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957240 (https://phabricator.wikimedia.org/T343728) [08:09:11] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957240 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot) [08:09:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:09:52] (03PS3) 10Brouberol: Define the opensearch service name as a pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) [08:09:55] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957240 (https://phabricator.wikimedia.org/T343728) (owner: 10TrainBranchBot) [08:10:20] (03PS1) 10Clément Goubert: mw-on-k8s: Fix test-commons redirect [puppet] - 10https://gerrit.wikimedia.org/r/957241 (https://phabricator.wikimedia.org/T290536) [08:12:05] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, thanks for adding the variables for the different domain names" [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [08:14:16] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host titan2001.codfw.wmnet with OS bookworm [08:14:20] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan2001.cod... [08:14:52] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan2001.codfw.wmnet with OS bookworm [08:14:56] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan2001... [08:15:11] (03CR) 10Jelto: [C: 03+2] Add lucaswerkmeister.de to Planet [puppet] - 10https://gerrit.wikimedia.org/r/948203 (owner: 10Amire80) [08:15:18] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:16:20] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:18:12] !log aklapper@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.26 refs T343728 [08:18:16] T343728: 1.41.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T343728 [08:22:43] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on titan1001.eqiad.wmnet with reason: host reimage [08:27:43] (03PS4) 10Brouberol: Define the opensearch service name as a pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) [08:28:47] (03PS3) 10Cathal Mooney: Do not try to configure DHCP relay on L3 switches without IRB ints [homer/public] - 10https://gerrit.wikimedia.org/r/956908 (https://phabricator.wikimedia.org/T322937) [08:30:18] (03CR) 10Brouberol: Define the opensearch service name as a pattern (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [08:30:47] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on titan2001.codfw.wmnet with reason: host reimage [08:33:44] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on titan2001.codfw.wmnet with reason: host reimage [08:35:34] (03CR) 10Volans: Define the opensearch service name as a pattern (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [08:36:07] (03PS3) 10AikoChou: helmfile.d: Add config bits to move readability isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/951461 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [08:36:31] (03PS2) 10Jelto: Add Wikimedia Deutschland's tech news blog [puppet] - 10https://gerrit.wikimedia.org/r/955941 (owner: 10Amire80) [08:37:24] (03CR) 10Volans: Define the opensearch service name as a pattern (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [08:37:56] (03CR) 10Elukey: [C: 03+1] Add systemd dependencies to kube-apiserver (031 comment) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/956900 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:38:28] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/955941 (owner: 10Amire80) [08:39:46] (03CR) 10Jelto: [C: 03+2] Add Wikimedia Deutschland's tech news blog [puppet] - 10https://gerrit.wikimedia.org/r/955941 (owner: 10Amire80) [08:40:07] @Lucas_WMDE: Hi, https://phabricator.wikimedia.org/T345856 is creating a lot of deprecation logspam in group1 on the train. If you're already around and log into this we could quickly backport; otherwise we might roll back [08:40:17] *look [08:40:23] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10Vgutierrez) >>! In T345370#9147408, @BCornwall wrote: > @Vgutierrez Is this something that should be addressed in the cookbook? > > Your idea... [08:41:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Fix test-commons redirect [puppet] - 10https://gerrit.wikimedia.org/r/957241 (https://phabricator.wikimedia.org/T290536) (owner: 10Clément Goubert) [08:42:19] (03CR) 10Elukey: k8s::apiserver: Use a separate systemd service for safe restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:42:31] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Fix test-commons redirect [puppet] - 10https://gerrit.wikimedia.org/r/957241 (https://phabricator.wikimedia.org/T290536) (owner: 10Clément Goubert) [08:43:09] (03PS7) 10Ayounsi: Update mappings for subregions of CA/US based on the Probenet data [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [08:46:04] (03CR) 10Elukey: [C: 03+1] kubernetes::master: control-plane components should use the local api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:46:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:46:30] !log Running puppet on cp-text P:trafficserver::backend - T290536 [08:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:34] T290536: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 [08:47:31] (03CR) 10AikoChou: "@klausmann I rebased the commit and made two modifications: 1) updated the model uri to reflect its move out of the experimental stage. 2)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/951461 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [08:47:45] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host titan1001.eqiad.wmnet with OS bookworm [08:47:49] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan1001.eqi... [08:48:51] (03PS1) 10JMeybohm: Remove conf2* from etcd client srv records [dns] - 10https://gerrit.wikimedia.org/r/957246 (https://phabricator.wikimedia.org/T332010) [08:48:55] (03CR) 10Elukey: "Looks good! What is the follow up needed for all the clusters that get the new option enabled?" [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:49:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:49:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [08:51:08] (03CR) 10Klausman: helmfile.d: Add config bits to move readability isvc to prod (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951461 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [08:51:52] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43238/console" [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall) [08:51:57] MichaelG_WMDE: hi, you helped review the patches for https://phabricator.wikimedia.org/T345856 and we need a revert for it, is it something maybe you could help with? [08:53:06] jnuche: yes, I think reverting the one that hard-deprecates those methods should be fine [08:54:17] jnuche do you want to create the revert and I see if I can give it its +2? [08:55:09] MichaelG_WMDE: thanks! doing that now [08:55:53] (03PS1) 10JMeybohm: Switch pybals from conf2 to conf1 [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) [08:56:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:56:17] (03CR) 10CI reject: [V: 04-1] Switch pybals from conf2 to conf1 [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [08:56:54] (03PS2) 10JMeybohm: Switch pybals from conf2 to conf1 [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) [08:56:55] 10SRE, 10serviceops, 10Patch-For-Review: Migrate conf2* hosts to bullseye - https://phabricator.wikimedia.org/T332010 (10JMeybohm) a:03JMeybohm [08:58:01] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43239/console" [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [08:58:19] MichaelG_WMDE: the errors spike seems to correspond to WikibaseMediaInfo only, this is the only revert we would need, correct? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/955900/ [08:58:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:59:02] (03PS4) 10Jelto: gitlab: Fix conditional end in gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall) [08:59:51] jnuche: Yes, only the change that adds the hard deprecation needs to be reverted. We apparently overlooked some of the usages in MediaInfo [09:00:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:01:04] (03PS5) 10Brouberol: sre.opensearch.roll-restart-reboot: Define the opensearch service name as a pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) [09:01:11] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43240/console" [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall) [09:01:26] MichaelG_WMDE: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/956819 [09:02:53] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: Fix conditional end in gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/956515 (owner: 10Dduvall) [09:03:10] jnuche: mh, I was more thinking of this one: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/955898/ [09:03:32] sorry, I only glanced at the change number [09:03:51] it is a property of a statement that gets serialized with the old method [09:03:52] MichaelG_WMDE: Can you access Logstash? Our interpretation is that it's this one about WikibaseMediaInfo [09:04:18] because: PHP Deprecated: Use of Wikibase\DataModel\Entity\NumericPropertyId::serialize was deprecated in MediaWiki 1.41. [Called from Wikibase\MediaInfo\WikibaseMediaInfoHooks::doBeforePageDisplay] [09:04:39] Yes, but that logstash stack trace leads us to: `$existingPropertyTypes[$qualifierPropertyId->serialize()]` and `$existingPropertyTypes[$propertyId->serialize()] ` [09:05:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:05:25] MichaelG_WMDE: Ah thanks, we'll prepare another revert to backport then for 955898. Two minutes, please :) [09:05:26] though having both probably doesn't hurt. And when we do them again, we should probably make sure the message includes something about which entity type is is on [09:05:45] Thank you 🙏 [09:07:13] MichaelG_WMDE: reverts: [09:07:13] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/956820 [09:07:13] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/956819 [09:07:37] (03PS3) 10JMeybohm: Switch pybals from conf2 to conf1 [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) [09:10:07] andre: jnuche: do you want revert adding the hard deprecation MediaInfo as well? I'm not sure it is necessary, but would be fine with it if you think it is better to be on the safe side? [09:11:07] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 10 CORE_DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43241/console" [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [09:11:16] *to MediaInfo [09:11:30] MichaelG_WMDE: I'd say it's up to you, you know best :) [09:11:35] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host titan2001.codfw.wmnet with OS bookworm [09:11:46] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan2001.cod... [09:12:03] MichaelG_WMDE, if https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/956820 is sufficient and you think https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/956819 is not needed that's fine [09:12:30] (03PS1) 10Ladsgroup: mariadb: Add grants for testreduce1002 [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) [09:13:12] at least from my looking at logstash, they seem to be coming all from the exact same place [09:13:20] MichaelG_WMDE: ok, we'll do https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/956820 only [09:13:41] (03PS4) 10AikoChou: helmfile.d: Add config bits to move readability isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/951461 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:14:21] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan2002.codfw.wmnet with OS bookworm [09:14:25] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan2002... [09:14:53] !log aklapper@deploy1002 backport Cancelled [09:15:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:15:25] (03CR) 10AikoChou: [C: 03+1] helmfile.d: Add config bits to move readability isvc to prod (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951461 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:16:47] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan1002.eqiad.wmnet with OS bookworm [09:16:52] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan1002... [09:17:48] (03CR) 10Elukey: [C: 03+1] "LGTM, this also needs Traffic involved though (so they know etc..)" [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [09:18:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [09:20:31] (03CR) 10Klausman: [C: 03+2] helmfile.d: Add config bits to move readability isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/951461 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:22:25] (03PS7) 10Func: SiteConfiguration: Make sure the array is a list before appending [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956060 (https://phabricator.wikimedia.org/T346052) [09:23:10] (03Merged) 10jenkins-bot: helmfile.d: Add config bits to move readability isvc to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/951461 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:24:23] (03PS3) 10Klausman: profile::k8s::deployment_server: Add config for readability isvc [puppet] - 10https://gerrit.wikimedia.org/r/951460 (https://phabricator.wikimedia.org/T334182) [09:24:42] (03CR) 10Klausman: [V: 03+2 C: 03+2] profile::k8s::deployment_server: Add config for readability isvc [puppet] - 10https://gerrit.wikimedia.org/r/951460 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:25:05] (03PS1) 10Majavah: dynamicproxy: improve connection error pages [puppet] - 10https://gerrit.wikimedia.org/r/957254 (https://phabricator.wikimedia.org/T200616) [09:26:16] Alright, the revert has been merged, let's hope it works out this time 🤞 [09:26:26] * MichaelG_WMDE watches logstash [09:27:07] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43243/console" [puppet] - 10https://gerrit.wikimedia.org/r/957254 (https://phabricator.wikimedia.org/T200616) (owner: 10Majavah) [09:32:19] (03CR) 10Majavah: [V: 03+1] "This has been tested on toolsbeta and on proxy-codf1dev." [puppet] - 10https://gerrit.wikimedia.org/r/957254 (https://phabricator.wikimedia.org/T200616) (owner: 10Majavah) [09:33:08] (03CR) 10Elukey: [C: 03+1] Remove conf2* from etcd client srv records [dns] - 10https://gerrit.wikimedia.org/r/957246 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [09:34:15] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:34:43] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:35:12] !log klausman@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [09:35:26] (03CR) 10Vgutierrez: "looks good, assuming that we haven't done any active effort in the past to move traffic from eqiad to codfw." [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [09:35:30] (03PS1) 10Jaime Nuche: Revert "EntityId: Hard-deprecate Serializable methods" [extensions/Wikibase] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957257 [09:35:41] !log klausman@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [09:36:22] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952047 (https://phabricator.wikimedia.org/T344884) (owner: 10Aklapper) [09:37:43] MichaelG_WMDE: sorry, we had some trouble creating the revert for the right branch, can you please take a look at this one? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/957257 [09:38:06] (the previous revert was targeting master) [09:38:43] (03CR) 10Michael Große: [C: 03+1] "Looks good to me!" [extensions/Wikibase] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957257 (owner: 10Jaime Nuche) [09:39:42] jnuche: That should be the right change for the right branch, as far as I can tell [09:41:26] (I have only +1 rights there) [09:41:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by aklapper@deploy1002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957257 (owner: 10Jaime Nuche) [09:48:24] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [09:51:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] jobqueue, thumbor: attempt to limit impact of thumbnailrender job [deployment-charts] - 10https://gerrit.wikimedia.org/r/956370 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [09:51:32] !log klausman@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:56:51] (03Merged) 10jenkins-bot: Revert "EntityId: Hard-deprecate Serializable methods" [extensions/Wikibase] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957257 (owner: 10Jaime Nuche) [09:57:28] !log aklapper@deploy1002 Started scap: Backport for [[gerrit:957257|Revert "EntityId: Hard-deprecate Serializable methods"]] [09:58:33] (03PS6) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [09:58:57] (03CR) 10CI reject: [V: 04-1] P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [09:59:02] !log aklapper@deploy1002 aklapper and jnuche: Backport for [[gerrit:957257|Revert "EntityId: Hard-deprecate Serializable methods"]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [09:59:45] !log aklapper@deploy1002 aklapper and jnuche: Continuing with sync [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230913T1000) [10:00:14] hi, we're still working on the train [10:00:44] ack :) [10:00:51] 10SRE, 10ops-codfw, 10serviceops: mw2444 down - https://phabricator.wikimedia.org/T345884 (10Clement_Goubert) I'm putting mw2444 back into `pooled=no` (instead of `pooled=inactive`) so it gets scap updates and stops warning, however I'll wait until we're sure it's stable before actually putting it back in pr... [10:01:33] jnuche: I'm just going to put mw2444 back to pulled=no so it gets updates from scap (it's inactive right now). I'm not actually pooling it back into production until we know it's stable. [10:01:39] I can wait until sync's done though [10:02:14] claime: yeah, we should be close to finishing, if you don't mind waiting a few mins [10:02:15] (03PS7) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [10:02:22] yep, no worries [10:02:26] danke [10:02:30] bitte [10:02:40] (03CR) 10CI reject: [V: 04-1] P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [10:05:35] (03PS8) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [10:06:01] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host titan2002.codfw.wmnet with OS bookworm [10:06:05] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan2002.cod... [10:06:06] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host titan1002.eqiad.wmnet with OS bookworm [10:06:10] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan1002.eqi... [10:06:17] !log aklapper@deploy1002 Finished scap: Backport for [[gerrit:957257|Revert "EntityId: Hard-deprecate Serializable methods"]] (duration: 08m 49s) [10:06:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:06:37] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan2002.codfw.wmnet with OS bookworm [10:06:39] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan1002.eqiad.wmnet with OS bookworm [10:06:43] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan2002... [10:06:45] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan1002... [10:09:35] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) I had a chat with Brandon about this on irc. He confirmed that glue records were not strictly needed for toolforge.org / wiki... [10:09:41] MichaelG_WMDE: backport is done now, errors are back to normal :) [10:09:54] claime: you're free to go [10:10:00] Train to group1 is done [10:10:01] Thanks <3 [10:10:11] Thank you! <3 [10:10:21] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [10:10:25] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: name=mw2444.codfw.wmnet [10:11:06] !log set/pooled=no; selector: name=mw2444.codfw.wmnet - T345884 [10:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:09] T345884: mw2444 down - https://phabricator.wikimedia.org/T345884 [10:11:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:15:03] (03PS9) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [10:15:28] (03CR) 10CI reject: [V: 04-1] P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [10:17:17] (03PS10) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [10:17:42] (03CR) 10jenkins-bot: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [10:18:21] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:19:12] (03PS11) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [10:19:32] (03CR) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:19:37] (03CR) 10CI reject: [V: 04-1] P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [10:20:57] (03PS12) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [10:21:02] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on titan1002.eqiad.wmnet with reason: host reimage [10:22:03] (03Abandoned) 10Stang: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [10:24:09] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on titan1002.eqiad.wmnet with reason: host reimage [10:24:50] (03CR) 10JMeybohm: [V: 03+1] k8s::apiserver: Use a separate systemd service for safe restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:25:06] (03PS3) 10JMeybohm: kubernetes::master: control-plane components should use the local api [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) [10:25:08] (03PS9) 10JMeybohm: kubernetes::master: Switch staging to use PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) [10:25:26] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10aborrero) >>! In T346177#9162954, @cmooney wrote: > Decommissioning the old servers/IPs before everything has updated / propagated is a... [10:25:45] (03CR) 10JMeybohm: Add systemd dependencies to kube-apiserver (031 comment) [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/956900 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:25:47] (03CR) 10JMeybohm: [C: 03+2] Add systemd dependencies to kube-apiserver [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/956900 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:25:51] 10SRE, 10Cloud-VPS: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10aborrero) [10:25:53] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) [10:26:29] (03PS13) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [10:26:57] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host titan2002.codfw.wmnet with OS bookworm [10:27:03] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan2002.cod... [10:27:17] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan2002.codfw.wmnet with OS bookworm [10:27:23] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan2002... [10:28:44] !log klausman@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:28:57] !log klausman@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:29:26] !log klausman@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [10:31:54] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) [10:32:01] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10aborrero) [10:32:05] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10aborrero) [10:34:16] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host titan2002.codfw.wmnet with OS bookworm [10:34:21] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan2002.cod... [10:34:42] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host titan2002.codfw.wmnet with OS bookworm [10:34:48] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host titan2002... [10:36:30] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43246/console" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [10:37:48] (JobUnavailable) firing: (2) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:40:47] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host titan1002.eqiad.wmnet with OS bookworm [10:40:51] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan1002.eqi... [10:44:05] (03PS1) 10Ladsgroup: Fix typo in Jade content type name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) [10:46:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:46:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [10:49:41] !log imported kubernetes_1.23.14-3 to bullseye-wikimedia component/kubernetes123 - T329826 [10:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:45] T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 [10:51:51] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on titan2002.codfw.wmnet with reason: host reimage [10:54:53] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on titan2002.codfw.wmnet with reason: host reimage [10:55:39] PROBLEM - puppet last run on kubestagemaster2001 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:00:43] RECOVERY - puppet last run on kubestagemaster2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:01:18] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:04:49] (03CR) 10Hnowlan: [C: 03+2] jobqueue, thumbor: attempt to limit impact of thumbnailrender job [deployment-charts] - 10https://gerrit.wikimedia.org/r/956370 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [11:05:43] (03Merged) 10jenkins-bot: jobqueue, thumbor: attempt to limit impact of thumbnailrender job [deployment-charts] - 10https://gerrit.wikimedia.org/r/956370 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [11:06:18] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:53] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host titan2002.codfw.wmnet with OS bookworm [11:11:58] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host titan2002.cod... [11:15:06] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:15:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:16:20] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:17:18] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:18:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T343198)', diff saved to https://phabricator.wikimedia.org/P52495 and previous config saved to /var/cache/conftool/dbconfig/20230913-111834-arnaudb.json [11:18:39] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:18:45] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:19:18] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:24:55] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [11:25:09] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:26:46] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:27:23] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:29:17] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:30:44] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:43:15] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:29] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, and 2 others: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Volans) I found a bit of time to play with some of the above mentioned solutions and those are my findings. ####... [11:49:07] (03CR) 10Aklapper: "Thanks! Who's in a position to +2?" [puppet] - 10https://gerrit.wikimedia.org/r/952047 (https://phabricator.wikimedia.org/T344884) (owner: 10Aklapper) [11:53:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P52497 and previous config saved to /var/cache/conftool/dbconfig/20230913-115314-ladsgroup.json [12:04:45] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Idle - NTT, AS2914/IPv6: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:06:31] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS13030/IPv6: Connect - Init7, AS6939/IPv4: Connect - HE, AS13030/IPv4: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:07:49] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:08:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P52498 and previous config saved to /var/cache/conftool/dbconfig/20230913-120818-ladsgroup.json [12:08:31] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10aborrero) >>! In T346177#9161543, @cmooney wrote: > I'd not noticed in my initial comment above, but *neither* IP th... [12:09:17] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:13:19] (03PS14) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [12:13:35] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:15:07] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10cmooney) After speaking to @ayounsi I have a better idea of how we intend to use the "routed mode" ganeti. In many ways it's similar to what I propose above: * Both ha... [12:15:30] (03CR) 10CI reject: [V: 04-1] P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [12:17:39] !log pool only titan hosts for thanos-web and thanos-query services - T341488 [12:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:42] T341488: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 [12:17:54] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [12:18:09] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10fgiunchedi) 05Open→03Resolved Hosts reimaged with raid0, resolving [12:18:15] (03PS15) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [12:19:02] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Split Thanos components from thanos-fe hosts into titan hosts - https://phabricator.wikimedia.org/T341488 (10fgiunchedi) [12:23:16] (03PS16) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [12:23:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P52499 and previous config saved to /var/cache/conftool/dbconfig/20230913-122323-ladsgroup.json [12:24:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43252/console" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [12:26:15] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43253/console" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [12:28:07] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:28:29] (03PS1) 10Muehlenhoff: Switch os-reports to the new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/957283 [12:31:37] (03CR) 10Muehlenhoff: [C: 03+2] Switch os-reports to the new puppetdb servers [puppet] - 10https://gerrit.wikimedia.org/r/957283 (owner: 10Muehlenhoff) [12:33:31] (03PS1) 10Muehlenhoff: Update example command to point to new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/957284 [12:34:01] (03CR) 10Muehlenhoff: [C: 03+2] Update example command to point to new puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/957284 (owner: 10Muehlenhoff) [12:45:01] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) so here is the plan: 1. we will let the markmonitor updates in {T346177} get applied, for both ns1 and ns0 (as of this writing, we are... [12:46:48] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10aborrero) How to continue with this DNS migration is described in T346042#9163506. I will close this ticket when w... [12:47:49] (03CR) 10Muehlenhoff: [C: 03+2] nftables::sets: Don't add elements if no addresses are passed [puppet] - 10https://gerrit.wikimedia.org/r/956869 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:52:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] jobqueue, thumbor: attempt to limit impact of thumbnailrender job [deployment-charts] - 10https://gerrit.wikimedia.org/r/956370 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [12:55:18] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) Update: the plan outlined above may not work after all. We just did step 2 and things broke, notably the resolver stopped working. We a... [12:56:18] andre, jnuche, MichaelG_WMDE: sorry I wasn’t around for the deprecation earlier, thanks a lot for taking care of it! [12:56:34] no problem, thanks! [12:57:19] yeah, no worries, thanks for responding on the ticket! :) [12:57:24] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) To clarify: we won't be shutting down cloudservices1005 today. [12:57:42] (03PS2) 10Mhorsey: Enable Campaign Events email feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956447 (https://phabricator.wikimedia.org/T345704) [12:58:33] (03PS1) 10Brouberol: Define dataceneter-local cumin aliases for the logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/957287 (https://phabricator.wikimedia.org/T344798) [12:59:32] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10akosiaris) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230913T1300). [13:00:04] xSavitar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:39] o/ [13:01:16] If anyone is around, they can backport, otherwise I can do it. [13:01:38] * TheresNoTime will be around in 15 [13:01:46] xSavitar: but feel free to self-deploy if you want :) [13:01:50] I’m around [13:02:00] but if you want to self-serve that’s also fine by me [13:02:04] Lucas_WMDE, nice! You can deploy and I test :) [13:02:06] TheresNoTime, thanks1 [13:02:10] *! [13:02:11] (03PS17) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [13:02:12] ok ^^ [13:02:27] (03PS1) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [13:02:54] (03CR) 10CI reject: [V: 04-1] dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [13:03:07] (03CR) 10Brouberol: "Resolve review comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [13:03:30] (03CR) 10Brouberol: sre.opensearch.roll-restart-reboot: Define the opensearch service name as a pattern (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [13:03:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:03:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "reviewed, makes sense (the corresponding code is in ServiceWiring.php)" [core] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956818 (https://phabricator.wikimedia.org/T318272) (owner: 10D3r1ck01) [13:03:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956818 (https://phabricator.wikimedia.org/T318272) (owner: 10D3r1ck01) [13:04:26] (I was confused for a second because I didn’t see debugSql or debugDumpSql in mediawiki-config, but that’s not where those $params come from directly) [13:06:52] (03PS2) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [13:07:15] (03CR) 10CI reject: [V: 04-1] dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [13:08:38] Lucas_WMDE, this is available only for logging purposes: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/9c63e5a279c13c66db49f16b2221e167affd2408/wmf-config/logging.php#67 [13:09:04] So it's hiding behind an environment flag. It's not enabled by default in prod. [13:09:29] https://gerrit.wikimedia.org/g/mediawiki/core/+/refs/changes/18/956818/1/includes/ServiceWiring.php#635 is the line I was looking for [13:09:35] where it maps that config to the 'debugSql' key [13:09:41] So one way of triggering it is when making a request via WikimediaDebug browser extension with "verbose log" selected. [13:09:45] (rather than 'debugDumpSql') [13:09:47] Oh yeah, you're right about that. [13:09:54] Absolutely [13:10:04] (03PS3) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [13:12:22] (03CR) 10CI reject: [V: 04-1] dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [13:15:25] PROBLEM - SSH on sretest1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:16:46] (03PS4) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [13:18:01] (03Merged) 10jenkins-bot: rdbms: Use `debugSql` instead of `debugDumpSql` which is unuset [core] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/956818 (https://phabricator.wikimedia.org/T318272) (owner: 10D3r1ck01) [13:18:11] (03PS2) 10Brouberol: Define datacenter-local cumin aliases for the logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/957287 (https://phabricator.wikimedia.org/T344798) [13:18:31] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:956818|rdbms: Use `debugSql` instead of `debugDumpSql` which is unuset (T318272)]] [13:18:35] T318272: MultiWriteBagOStuff caches are missing DI defaults - https://phabricator.wikimedia.org/T318272 [13:20:13] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and d3r1ck01: Backport for [[gerrit:956818|rdbms: Use `debugSql` instead of `debugDumpSql` which is unuset (T318272)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:20:19] xSavitar: ^ [13:20:25] Testing now... [13:20:27] Thanks [13:20:29] ok [13:20:39] (03PS5) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [13:20:54] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [13:23:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:26:47] (03PS6) 10Jcrespo: dbbackups: Add new check (focused on ES) of long running backups [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) [13:27:11] Lucas_WMDE, it works 🎉 [13:27:15] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and d3r1ck01: Continuing with sync [13:27:16] You can burn the world now :D [13:27:18] ok, thanks for testing! [13:27:20] what D: [13:27:27] Heh, I'm just kidding [13:27:32] :) [13:28:05] But yeah, queries are being logged now as expected. [13:28:43] yay ^^ [13:28:55] log ALL teh queries [13:28:57] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957288 (https://phabricator.wikimedia.org/T346233) (owner: 10Jcrespo) [13:29:17] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:33] restart ALL the php-fpms [13:33:33] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:34:13] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:956818|rdbms: Use `debugSql` instead of `debugDumpSql` which is unuset (T318272)]] (duration: 15m 42s) [13:34:17] T318272: MultiWriteBagOStuff caches are missing DI defaults - https://phabricator.wikimedia.org/T318272 [13:34:34] !log UTC afternoon backport+config window done [13:34:35] * Lucas_WMDE done [13:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:48] Thank you so much Lucas_WMDE, I appreciate. [13:34:53] have a great rest of your day [13:35:28] you too! [13:36:17] (03CR) 10Herron: [C: 03+1] thanos: move rule evaluation to titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/956904 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [13:36:33] (03CR) 10Herron: [C: 03+1] thanos: move thanos-compact to titan host [puppet] - 10https://gerrit.wikimedia.org/r/956905 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [13:37:11] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:25] (03PS2) 10Cparle: Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) [13:38:19] RECOVERY - SSH on sretest1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:38:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:39:09] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10Jclark-ctr) @aborrero thanks for updated please advise when we are good to proceed [13:44:28] (03PS1) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [13:48:08] (03PS18) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [13:48:22] 10SRE, 10Traffic, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) Thanks @nshahquinn-wmf , I've started working on this, obviously we will add rules for mobile domain redirect only for domains that have a real mobile counterpart... [13:48:33] (03PS3) 10Cparle: Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) [13:50:02] (03CR) 10Ssingh: [V: 03+1 C: 03+2] varnish:common: add bookworm version for Python [puppet] - 10https://gerrit.wikimedia.org/r/956914 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [13:50:43] (03PS4) 10Cparle: Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) [13:51:29] (03PS1) 10Jclark-ctr: new backup node kubernetes10[27-56] [puppet] - 10https://gerrit.wikimedia.org/r/957296 (https://phabricator.wikimedia.org/T342533) [13:51:35] PROBLEM - SSH on sretest1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:51:41] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=esams%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:51:51] (03CR) 10CI reject: [V: 04-1] new backup node kubernetes10[27-56] [puppet] - 10https://gerrit.wikimedia.org/r/957296 (https://phabricator.wikimedia.org/T342533) (owner: 10Jclark-ctr) [13:51:59] (03PS1) 10Muehlenhoff: nftables sets: Fix the template to properly wrap the element block [puppet] - 10https://gerrit.wikimedia.org/r/957297 (https://phabricator.wikimedia.org/T336497) [13:52:09] (03PS19) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [13:52:13] restbse? checking [13:52:55] !log jmm@cumin2002 START - Cookbook sre.aqs.roll-restart-reboot rolling reboot on A:aqs-codfw [13:53:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:53:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957297 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:53:46] there was any deployment related to restbase? [13:54:27] 5xx sarted increasing at 13:34 [13:54:49] I see some logs in the ECS dashboard like: ResponseError: Server timeout during read query at consistency LOCAL_QUORUM (2 replica(s) responded over 3 required) [13:55:01] (03PS1) 10Bking: flink-app: Correct ZK config for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/957298 (https://phabricator.wikimedia.org/T344614) [13:55:17] looks like restbase1030 recovered just as those errors started to happen [13:55:22] urandom_ might have some context [13:55:28] https://logstash.wikimedia.org/goto/2e560a87a71dfc4c70786655bf2cd0f7 [13:55:33] volans: --^ [13:55:43] it also aligns with teh last scap [13:55:44] (03CR) 10Marco Fossati: [C: 03+1] Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) (owner: 10Cparle) [13:55:47] but could be unrelated [13:55:51] thx elukey [13:55:55] restbase1030 has been reimaged a few times [13:56:13] (03CR) 10Vgutierrez: varnish: add more domains for mobile redirect (*.wikimedia.org) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [13:56:15] Lucas_WMDE: could it have any connection? [13:56:16] Eric was testing some partman recipes on it [13:56:40] although seems more related to restbase so far [13:57:05] https://phabricator.wikimedia.org/T331713 is the task for restbase1030 [13:57:14] (03PS2) 10Bking: flink-app: Correct ZK config for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/957298 (https://phabricator.wikimedia.org/T344614) [13:57:24] from a cassandra perspective restbase1030-[ab] are down, but restbase1030-c is trying to join the cluster [13:57:53] volans: I would be very surprised [13:57:59] hnowlan: what does the LOCAL_QUORUM mean in this scenario? [13:58:02] sounds very unrelated to my deploy [13:58:06] could it be that a single instance down causes failures? [13:58:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:58:21] from restbase1030's perspective a and b are joined to the cluster but only have a load of ~88GB when it should have 1TB+ [13:58:30] Lucas_WMDE: ack thx [13:58:37] to me too, but just in case [13:58:41] I was checking [13:58:47] if I understand the patch I deployed correctly, it would be okay-ish to revert, we’d just lose some debug logging (cc xSavitar) – if you want to try it [13:58:58] nah for now not needed [13:58:59] but it seems more likely that any deploy might have caused something, and a revert won’t fix it [13:59:00] ok [13:59:03] hnowlan: how can I help? [13:59:09] the URI seems always the same, /en.wikipedia.org/sys/table/title_revisions-ng [13:59:31] 10SRE, 10ops-codfw, 10decommission-hardware: Decommission furud - https://phabricator.wikimedia.org/T345867 (10Jhancock.wm) 05Open→03Resolved [13:59:50] (03CR) 10DCausse: flink-app: Correct ZK config for dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957298 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [13:59:51] elukey: that means the operation (read I'm guessing? could be writes) can't get the right number of responses - which I suspect means restbase1030 is getting hit somehow [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230913T1400) [14:00:11] volans: not certain, this is a bit of an odd state. I am tempted to err on the side of shutting down restbase1030 [14:00:19] but it would be nice to know *what* caused it to recover like this [14:00:26] hnowlan: yeah but what is the quorum number? It seems 3, why? [14:01:41] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=esams%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:02:13] although the page resolved the 5xx dind't recover yet [14:02:57] elukey: afair the NetworkTopologyStrategy is 3 [14:03:29] hnowlan: shutting down or just stopping cassandra? [14:03:37] hmmm that's impacting just eqiad side of things [14:03:46] or block the network to it [14:04:05] things that can be reverted quickly if needed, as opposed to a full reboot [14:04:21] volans: yeah just stopping cassandra [14:04:26] hnowlan: yeah I wanted to check the config of the title_revisions-ng schema, but can't find the cassandra pass for cqlsh [14:04:46] the c instance started at 13:32 which explains the timing [14:04:49] why it started I don't know [14:04:54] elukey: /etc/cassandra-a/cqlshrc should have it [14:04:55] (03PS2) 10Jclark-ctr: new backup node kubernetes10[27-56] [puppet] - 10https://gerrit.wikimedia.org/r/957296 (https://phabricator.wikimedia.org/T342533) [14:05:06] or just c-cqlsh a [14:05:13] yeah it seems reasonable to kill restbase1030 for now and see if it recovers [14:05:17] just going to read logs for a sec [14:05:21] (03CR) 10CI reject: [V: 04-1] new backup node kubernetes10[27-56] [puppet] - 10https://gerrit.wikimedia.org/r/957296 (https://phabricator.wikimedia.org/T342533) (owner: 10Jclark-ctr) [14:05:28] takes a second with cassandra's verbosity [14:05:32] eheh [14:05:50] (03PS3) 10Bking: flink-app: Correct ZK config for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/957298 (https://phabricator.wikimedia.org/T344614) [14:07:33] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:39] alright, I'll stop cassandra so? [14:07:55] +1 for me [14:07:58] +1 [14:08:47] !log stopping cassandra on restbase1030-c [14:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:51] if needed disable puppet temporarily [14:08:52] (03CR) 10DCausse: [C: 03+1] flink-app: Correct ZK config for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/957298 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [14:08:59] if it will start it again [14:09:03] (03CR) 10Bking: [C: 03+2] flink-app: Correct ZK config for dse-k8s (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957298 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [14:09:11] (03CR) 10MVernon: [C: 03+1] thanos: move thanos-compact to titan host [puppet] - 10https://gerrit.wikimedia.org/r/956905 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [14:09:18] volans: already stopped [14:09:22] k [14:09:35] (03PS1) 10Muehlenhoff: Drop the rsync migration setup for testreduce [puppet] - 10https://gerrit.wikimedia.org/r/957301 (https://phabricator.wikimedia.org/T345831) [14:09:47] hnowlan: ok got it, network topology 3 and [14:09:48] WITH replication = {'class': 'NetworkTopologyStrategy', 'eqiad': '3', 'codfw': '3'} [14:09:52] (03CR) 10MVernon: [C: 03+1] thanos: move rule evaluation to titan hosts [puppet] - 10https://gerrit.wikimedia.org/r/956904 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [14:09:59] (03Merged) 10jenkins-bot: flink-app: Correct ZK config for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/957298 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [14:10:03] it is crazy that this fails with a single instance down [14:10:13] so far no big change in 5xx, monitoring [14:10:42] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:10:47] (03CR) 10MVernon: [C: 03+1] conftool-data: split thanos-fe / titan hosts' services [puppet] - 10https://gerrit.wikimedia.org/r/956888 (https://phabricator.wikimedia.org/T346143) (owner: 10Filippo Giunchedi) [14:10:56] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:11:09] the stop is hanging but the service itself is trying to drain it seems [14:11:16] which should mean no new connections but who knows :/ [14:11:34] we can probably kill it at this point, if it hangs [14:11:58] stopped [14:12:20] elukey: we've hit stuff somewhat like this before where an instance isn't really down but is up and impaired (sessionstore reboots caused an outage through this) [14:13:38] graph going slightly down, giving it another minute to be sure [14:13:47] (03CR) 10Muehlenhoff: [C: 03+2] Drop the rsync migration setup for testreduce [puppet] - 10https://gerrit.wikimedia.org/r/957301 (https://phabricator.wikimedia.org/T345831) (owner: 10Muehlenhoff) [14:14:01] yeah it looks recovered [14:14:40] yep definitely recovering [14:15:23] need to attend a meeting, ping me if needed [14:15:37] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,cassandra-b.service,cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:07] thanks hnowlan! [14:16:46] np! Hopefully urandom_ can fill us in on what happened [14:17:00] notable data point on this is that restbase1030 is the first bullseye host [14:17:00] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:17:06] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:17:07] *bullseye restbase host [14:17:09] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:17:33] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:36] hnowlan: solution: skip bullseye and go to bookworm directly [14:17:38] :D [14:17:45] :D [14:18:01] I'm not certain as to whether it was supposed to join the cluster [14:20:06] (03CR) 10Jameel Kaisar: "good to see some activity on this 😄" [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [14:20:09] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:22:37] (03PS1) 10Bking: flink-app: remove high-availability.cluster-id [deployment-charts] - 10https://gerrit.wikimedia.org/r/957303 (https://phabricator.wikimedia.org/T344614) [14:23:08] (03CR) 10Muehlenhoff: [C: 03+2] nftables sets: Fix the template to properly wrap the element block [puppet] - 10https://gerrit.wikimedia.org/r/957297 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:23:19] (03CR) 10DCausse: [C: 03+1] flink-app: remove high-availability.cluster-id [deployment-charts] - 10https://gerrit.wikimedia.org/r/957303 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [14:23:52] (03CR) 10Bking: [C: 03+2] flink-app: remove high-availability.cluster-id [deployment-charts] - 10https://gerrit.wikimedia.org/r/957303 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [14:24:35] (03PS3) 10Jclark-ctr: new backup node kubernetes10[27-56] [puppet] - 10https://gerrit.wikimedia.org/r/957296 (https://phabricator.wikimedia.org/T342533) [14:24:40] (03Merged) 10jenkins-bot: flink-app: remove high-availability.cluster-id [deployment-charts] - 10https://gerrit.wikimedia.org/r/957303 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [14:25:02] (03CR) 10CI reject: [V: 04-1] new backup node kubernetes10[27-56] [puppet] - 10https://gerrit.wikimedia.org/r/957296 (https://phabricator.wikimedia.org/T342533) (owner: 10Jclark-ctr) [14:25:31] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:26:01] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:26:05] (03PS2) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [14:29:10] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:29:20] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:30:04] (03PS4) 10Jclark-ctr: new backup node kubernetes10[27-56] [puppet] - 10https://gerrit.wikimedia.org/r/957296 (https://phabricator.wikimedia.org/T342533) [14:30:36] (03PS3) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [14:31:25] (03CR) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [14:31:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [14:31:30] (03CR) 10Jclark-ctr: [C: 03+2] new backup node kubernetes10[27-56] [puppet] - 10https://gerrit.wikimedia.org/r/957296 (https://phabricator.wikimedia.org/T342533) (owner: 10Jclark-ctr) [14:36:08] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on sretest1001.eqiad.wmnet with reason: WIP towards puppetised nftables firewall [14:36:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on sretest1001.eqiad.wmnet with reason: WIP towards puppetised nftables firewall [14:37:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jclark-ctr) [14:39:40] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [14:41:39] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [14:41:45] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: route to wikifeeds via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/956895 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [14:42:00] (03CR) 10RobH: [C: 03+2] new backup node kubernetes10[27-56] [puppet] - 10https://gerrit.wikimedia.org/r/957296 (https://phabricator.wikimedia.org/T342533) (owner: 10Jclark-ctr) [14:42:23] (03CR) 10Vgutierrez: [C: 03+1] Switch pybals from conf2 to conf1 [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [14:45:34] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/957287 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [14:46:25] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jhancock.wm) raids configured [14:49:34] (03CR) 10Brouberol: [C: 03+2] Define datacenter-local cumin aliases for the logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/957287 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [14:50:25] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10cmooney) Thanks @aborrero, that plan is how I expected based on our chat earlier. Re step 2, we should retry when we think we've ironed out our... [14:50:28] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [14:51:27] !log updated kubernetes-* packages fleet wide to 1.23.14-3 - T329826 [14:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:30] T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 [14:51:45] !log depooled service=ats-be,name=cp2037.codfw.wmnet [14:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:10] (03PS6) 10Brouberol: sre.opensearch.roll-restart-reboot: Define the opensearch service name as a pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) [14:52:39] (03CR) 10Brouberol: "We can leverage datacenter-level aliases for logstash now that https://gerrit.wikimedia.org/r/c/operations/puppet/+/957287 was merged" [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [14:52:47] !log disable puppet on A:cp [14:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:41] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) @taavi wrote: ` NEXT STEPS I think the steps to complete this migration without any further user impact are roughly the following:... [14:54:34] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route to wikifeeds via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/956895 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [14:55:25] PROBLEM - Check systemd state on kubestagemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:36] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [14:56:30] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [15:00:01] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:00:31] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:01:15] RECOVERY - Check systemd state on kubestagemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:18] !log repooling cp2037 and enabling puppet on A:cp [15:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:43] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [15:03:45] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: control-plane components should use the local api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956889 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:03:48] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s::apiserver: Use a separate systemd service for safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/956842 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [15:04:12] !log stopped puppet on all k8s control planes for 956842 rollout [15:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Switch pybals from conf2 to conf1 [puppet] - 10https://gerrit.wikimedia.org/r/957248 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [15:06:33] (03PS1) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 [15:08:23] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove conf2* from etcd client srv records [dns] - 10https://gerrit.wikimedia.org/r/957246 (https://phabricator.wikimedia.org/T332010) (owner: 10JMeybohm) [15:09:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:11:39] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:12:09] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:12:35] (03CR) 10Ebernhardson: [C: 04-1] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [15:14:16] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [15:14:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:14:30] (03CR) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [15:14:32] (03CR) 10Ebernhardson: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [15:14:34] (03PS2) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 [15:17:01] (03CR) 10CI reject: [V: 04-1] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [15:17:53] !log Starting LibreNMS upgrade in codfw. [15:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:01] !log Start reimage of netmon2002 [15:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:52] !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host netmon2002.wikimedia.org with OS bookworm [15:21:48] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:22:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:23:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:24:47] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:24:59] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:25:07] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:25:27] (03PS1) 10Muehlenhoff: firewall::service: Fix logic error in passing srange/drange to nftables [puppet] - 10https://gerrit.wikimedia.org/r/957313 (https://phabricator.wikimedia.org/T336497) [15:26:15] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, and 2 others: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10cmooney) The problem getting them by ASN is that there may be "collateral damage" sometimes. i.e. If you pull th... [15:26:25] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:26:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling reboot on A:aqs-codfw [15:26:37] !log re-enabled puppet on all k8s control planes [15:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:48] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:27:33] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:28:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:28:12] (03PS3) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 [15:29:32] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:29:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957313 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [15:29:50] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [15:30:31] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:30:35] (03CR) 10CI reject: [V: 04-1] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [15:31:46] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [15:31:48] (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (LIST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:32:55] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:33:57] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:34:31] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:34:55] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:35:48] (03PS4) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 [15:36:45] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:36:45] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:37:49] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:38:23] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:38:52] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage [15:38:55] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:39:23] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:39:29] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:40:01] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:40:01] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:40:16] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1024:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1024 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:40:26] Hmm restbase acting out again hnowlan [15:40:31] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:41:03] claime: we just migrated wikifeeds away from it - trying to debug this atm [15:41:11] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:41:11] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:41:16] feeds is not being served to the public via restbase atm [15:41:19] hnowlan: Ah so semi-expected, sorry for the distraction [15:41:32] not expected unfortunately :( [15:41:36] :/ [15:41:54] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon2002.wikimedia.org with reason: host reimage [15:42:17] (03CR) 10JMeybohm: [C: 04-1] "admin_ng services don't usually include global defaults (as the services do). I think you need to add those to helmfile.d/admin_ng/flink-o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [15:42:30] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:42:40] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:26] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:26] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:26] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:32] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:42] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:50] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:08] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:44:42] claime hnowlan : I'm here if there's anything I can do to help with the restbase alerts. [15:45:30] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:45:35] denisse: thanks! this is mostly noise afaict, we're not serving any 5xx from restbase (although please corret me if I'm wrong) [15:45:47] Trying to figure out what's causing this in restbase and if I can't in the next few I'll roll the change back [15:45:54] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:46:12] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:46:23] (03PS10) 10JMeybohm: kubernetes::master: Switch to PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/956449 (https://phabricator.wikimedia.org/T329826) [15:46:36] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:46:56] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:47:28] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:47:48] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:48:02] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:48:30] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:48:44] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:49:56] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:00] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:14] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:52:02] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:52:14] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:52:14] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:52:20] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:53:14] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:53:26] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:53:32] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:53:38] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST clusterroles) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:54:16] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:55:14] (03PS5) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 [15:55:36] (03CR) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [15:56:46] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:58:04] (03PS6) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 [15:58:38] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (LIST clusterroles) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:34] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:00:34] (03CR) 10CI reject: [V: 04-1] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [16:00:52] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:01:01] (03CR) 10AOkoth: [C: 03+2] vrts: apply role and setup hiera values [puppet] - 10https://gerrit.wikimedia.org/r/953631 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [16:03:43] (03CR) 10Bking: [C: 03+2] flink-app: Allow declaring zookeeper clusters by name [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033 (owner: 10Ebernhardson) [16:04:11] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon2002.wikimedia.org with OS bookworm [16:04:16] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on vrts1002.eqiad.wmnet with reason: Testing [16:04:34] (03Merged) 10jenkins-bot: flink-app: Allow declaring zookeeper clusters by name [deployment-charts] - 10https://gerrit.wikimedia.org/r/955033 (owner: 10Ebernhardson) [16:04:40] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on vrts1002.eqiad.wmnet with reason: Testing [16:07:33] (JobUnavailable) firing: (3) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:11:00] 10SRE, 10ops-codfw, 10serviceops: Fail event on /dev/md/0:kubernetes2028 - https://phabricator.wikimedia.org/T345853 (10Jhancock.wm) 05Open→03Resolved @Vgutierrez faulty disk has been replaced and I see two disks on the server now. returning the bad disk to dell under 783662118185 [16:15:37] (03PS7) 10DCausse: [WIP] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 [16:16:03] (03PS1) 10AOkoth: wmnet: add ticket-test -> vrts1002 [dns] - 10https://gerrit.wikimedia.org/r/957322 [16:18:19] (03CR) 10Vgutierrez: [C: 03+2] "after double checking with bblack it looks like we (traffic) haven't made in the past any effort to steer traffic from eqiad to other US D" [dns] - 10https://gerrit.wikimedia.org/r/931992 (https://phabricator.wikimedia.org/T337318) (owner: 10Jameel Kaisar) [16:23:48] (03CR) 10CI reject: [V: 04-1] [WIP] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeepter [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [16:25:14] (03CR) 10BCornwall: varnish: add more domains for mobile redirect (*.wikimedia.org) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [16:27:05] (03CR) 10BCornwall: varnish: add more domains for mobile redirect (*.wikimedia.org) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [16:28:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10VRiley-WMF) db1244 - C 6. U 09. port 6 CableID 3188 db1245 - C 6. U 10. port 7 CableID 3189 db1246 - D 3. U 03. port 0 CableID 3377 db1247 - D 3. U 04. port 1 CableID 3378 db12... [16:32:42] (03PS9) 10Brouberol: sre.opensearch.roll-restart-reboot: Define the opensearch service name as a pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) [16:33:33] (03PS1) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 [16:33:42] (03PS10) 10Brouberol: sre.opensearch.roll-restart-reboot: Define the opensearch service name as a pattern [cookbooks] - 10https://gerrit.wikimedia.org/r/956916 (https://phabricator.wikimedia.org/T344798) [16:33:44] (03CR) 10Subramanya Sastry: mariadb: Add grants for testreduce1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup) [16:34:03] (03PS2) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) [16:34:10] !log denisse@deploy1002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 23.8.2 - T344136 [16:34:17] T344136: Upgrade LibreNMS to 23.7.0 or higher - https://phabricator.wikimedia.org/T344136 [16:34:27] !log denisse@deploy1002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 23.8.2 - T344136 (duration: 00m 16s) [16:34:48] (03PS4) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [16:35:43] (03PS3) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) [16:38:02] (03CR) 10CI reject: [V: 04-1] Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [16:40:37] (03CR) 10Ladsgroup: mariadb: Add grants for testreduce1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup) [16:43:38] 10SRE, 10Traffic, 10Patch-For-Review: Varnish mobile redirection misses some sites - https://phabricator.wikimedia.org/T344175 (10Fabfur) We started adding the `*.wikimedia.org` domains to the Varnish configuration, some notes: Currently we have these domains without a mobile (m..wikimedia.org) coun... [16:44:25] (03PS3) 10Andrea Denisse: wikimedia: Failover LibreNMS from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/956455 (https://phabricator.wikimedia.org/T344136) [16:44:52] (03CR) 10Subramanya Sastry: mariadb: Add grants for testreduce1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup) [16:44:54] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Failover from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/956452 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [16:46:45] (03CR) 10Andrea Denisse: [C: 03+2] wikimedia: Failover LibreNMS from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/956455 (https://phabricator.wikimedia.org/T344136) (owner: 10Andrea Denisse) [16:50:12] (03PS8) 10Ryan Kemper: [WIP] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [16:54:34] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-poller-all.service,librenms-poller-all.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:33] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:59:15] (03CR) 10CI reject: [V: 04-1] [WIP] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230913T1700) [17:00:17] PROBLEM - Check systemd state on netmon2002 is CRITICAL: CRITICAL - degraded: The following units failed: rancid-differ.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:51] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:03:53] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:07:33] (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:08:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:14:12] (03PS9) 10DCausse: [WIP] flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 [17:14:51] (03PS10) 10DCausse: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 [17:22:33] (JobUnavailable) firing: (7) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:22:34] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2006-dev.codfw.wmnet with OS bookworm [17:23:32] (03CR) 10Cwhite: "LGTM modulo appeasing Jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [17:26:46] (03PS4) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) [17:27:39] (03PS5) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) [17:27:55] (03CR) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [17:28:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:28:44] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [17:28:56] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm [17:37:26] (03PS20) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [17:39:06] (03PS21) 10Slyngshede: P:idm allow for installation via Debian packages. [puppet] - 10https://gerrit.wikimedia.org/r/956836 [17:40:11] (03CR) 10Cwhite: [C: 03+1] "LGTM! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [17:40:13] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43260/console" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [17:42:09] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage [17:42:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:43:23] (03PS1) 10Ahmon Dancy: Sync ldap/ops into GitLab repos/sre group [puppet] - 10https://gerrit.wikimedia.org/r/957329 (https://phabricator.wikimedia.org/T343035) [17:43:47] (03CR) 10CI reject: [V: 04-1] Sync ldap/ops into GitLab repos/sre group [puppet] - 10https://gerrit.wikimedia.org/r/957329 (https://phabricator.wikimedia.org/T343035) (owner: 10Ahmon Dancy) [17:45:10] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage [17:45:50] (03PS2) 10Ahmon Dancy: Sync ldap/ops into GitLab repos/sre group [puppet] - 10https://gerrit.wikimedia.org/r/957329 (https://phabricator.wikimedia.org/T343035) [17:47:36] (03PS1) 10Ayounsi: Add esams RIPE Atlas to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/957330 (https://phabricator.wikimedia.org/T307021) [18:00:05] jnuche and hashar: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230913T1800). [18:01:20] (03PS4) 10Btullis: Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) [18:01:38] (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/957322 (owner: 10AOkoth) [18:02:23] (03PS5) 10Btullis: Refactor spark support to build multiple minor versions [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) [18:02:47] (03CR) 10Muehlenhoff: [C: 04-1] Enable cumin hosts to reach the opensearch API on logstash clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [18:03:33] PROBLEM - MariaDB Replica Lag: m1 on db2132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 738.95 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:05:02] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2006-dev.codfw.wmnet with OS bookworm [18:07:27] (03CR) 10Btullis: "I'm going to proceed with this method for now." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [18:08:31] (03CR) 10Btullis: Refactor spark support to build multiple minor versions (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [18:16:03] RECOVERY - MariaDB Replica Lag: m1 on db2132 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:17:06] (03CR) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [18:18:42] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) @aborerro I think it's probably relatively safe to close this in the morning. Changes went through in the... [18:19:36] !log resuming bootstrap of restbase1030-c — [18:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:47] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:24:13] PROBLEM - MariaDB Replica Lag: m1 on db2160 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1978.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:25:13] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:31:06] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) p:05High→03Low [18:33:38] !log restarting restbase service (restbase1031) — T331713 [18:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:48] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [18:35:10] (03CR) 10Slyngshede: [V: 03+1] P:idm allow for installation via Debian packages. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [18:35:17] PROBLEM - MariaDB Replica Lag: m1 on db1217 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 876.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:35:47] PROBLEM - MariaDB Replica Lag: m1 on db2132 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 907.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:36:20] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43266/console" [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [18:38:04] (03CR) 10Slyngshede: [V: 03+1] P:idm allow for installation via Debian packages. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956836 (owner: 10Slyngshede) [18:38:08] !log run schema migrations for librenms on m1 (backdated, started ~1h ago) [18:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:13] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqiad%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:41:17] (03CR) 10Neil Shah-Quinn (WMF): varnish: add more domains for mobile redirect (*.wikimedia.org) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [18:42:19] RECOVERY - MariaDB Replica Lag: m1 on db1217 is OK: OK slave_sql_lag Replication lag: 0.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:42:38] !log stopping bootstrap of restbase1030-c — T331713 [18:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:42] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [18:44:13] (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:44:15] RECOVERY - MariaDB Replica Lag: m1 on db2132 is OK: OK slave_sql_lag Replication lag: 36.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:44:32] !incidents [18:44:32] 4035 (UNACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) [18:44:32] 4033 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet esams) [18:44:50] !ack 4035 [18:44:50] 4035 (ACKED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) [18:44:57] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/956374 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [18:45:49] denisse: I guess that was me... though for the life of me I do not understand the problem here [18:46:21] 10SRE, 10Infrastructure-Foundations, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10bking) [18:46:30] urandom: No worries, I'm taking a look at it. :) [18:47:00] it should start clearing up [18:47:01] 10SRE, 10Infrastructure-Foundations, 10vm-requests: 1 codfw VM requested for search-loader - https://phabricator.wikimedia.org/T346272 (10bking) [18:47:11] again...why? no clue. [18:47:40] (03CR) 10Cwhite: [C: 03+1] Enable cumin hosts to reach the opensearch API on logstash clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [18:49:12] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4052.ulsfo.wmnet with OS bookworm [18:49:13] (ATSBackendErrorsHigh) resolved: (2) ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [18:49:20] 🤔 [18:49:21] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm executed with errors: - cp4052 (**FAIL**) - Downtimed on Ic... [18:49:28] sorry for the delay, my laptop was having wifi woes [18:51:26] !log initiating rebuild of restbase1018-a [18:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:56] urandom: do we think that the pages for restbase 5xx will continue as you rebuild nodes? [18:52:57] cdanis: I don't think so, no. But I don't know why the bootstrap of 1030-c is causing them [18:53:08] I love distributed systems [18:53:09] so I'll proceed cautiously [18:53:11] yeah. [18:54:15] the error that's causing these errors is a "failure to a achieve local_quorum", where it is "only" getting 2 answers [18:54:17] (03PS5) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [18:54:30] but two *is* a quorum for three replicas [18:55:14] (03CR) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [19:00:47] !log initiating rebuild of restbase1025-a [19:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:45] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM requested for search-loader - https://phabricator.wikimedia.org/T346273 (10bking) [19:08:04] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM requested for search-loader - https://phabricator.wikimedia.org/T346273 (10bking) [19:08:19] !log initiating rebuild of restbase1026-a [19:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:11] !log initiating rebuild of restbase1027-a & restbase1033-a [19:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:49] !log denisse@cumin1001 START - Cookbook sre.hosts.reimage for host netmon1003.wikimedia.org with OS bookworm [19:10:46] (03PS1) 10Bking: site.pp: add new search-loader hostnames [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039) [19:11:55] (03PS2) 10Bking: site.pp: add new search-loader hostnames [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039) [19:14:54] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [19:15:04] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm [19:16:26] (03CR) 10Bking: [C: 03+2] wdqs: silence alerts on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/954350 (https://phabricator.wikimedia.org/T345475) (owner: 10Ryan Kemper) [19:16:48] (03CR) 10Thcipriani: [C: 03+1] Sync ldap/ops into GitLab repos/sre group [puppet] - 10https://gerrit.wikimedia.org/r/957329 (https://phabricator.wikimedia.org/T343035) (owner: 10Ahmon Dancy) [19:17:33] (JobUnavailable) firing: (9) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:21:46] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on netmon1003.wikimedia.org with reason: host reimage [19:22:33] (JobUnavailable) firing: (9) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:24:16] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netmon1003.wikimedia.org with reason: host reimage [19:30:25] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:09] (03PS1) 10Ladsgroup: Enable pagelinks write both in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957339 (https://phabricator.wikimedia.org/T345732) [19:35:14] ACKNOWLEDGEMENT - MD RAID on netmon1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.141. Check system logs on 208.80.154.141 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T346275 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:35:23] 10SRE, 10ops-eqiad: Degraded RAID on netmon1003 - https://phabricator.wikimedia.org/T346275 (10ops-monitoring-bot) [19:35:52] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:37:00] RECOVERY - MariaDB Replica Lag: m1 on db2160 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:40:16] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1024:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1024 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:40:21] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [19:43:27] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [19:43:45] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:45] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netmon1003.wikimedia.org with OS bookworm [19:44:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:45:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:46:33] (03PS1) 10BBlack: fe_mem_gb_reserved: merge esams settings [nop] [puppet] - 10https://gerrit.wikimedia.org/r/957343 [19:46:35] (03PS1) 10BBlack: fe_mem_gb_reserved:170 for all single-backend [puppet] - 10https://gerrit.wikimedia.org/r/957344 [19:46:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.479 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:46:53] PROBLEM - LibreNMS HTTPS on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LibreNMS [19:47:09] (03PS1) 10BBlack: beta: haproxy->varnish single UDS config [puppet] - 10https://gerrit.wikimedia.org/r/957345 (https://phabricator.wikimedia.org/T333965) [19:47:11] (03PS1) 10BBlack: Varnish: listen on only 1x UDS [puppet] - 10https://gerrit.wikimedia.org/r/957346 (https://phabricator.wikimedia.org/T333965) [19:47:35] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 1.448 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:48:35] PROBLEM - librenms.wikimedia.org tls expiry on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:48:53] RECOVERY - LibreNMS HTTPS on netmon2002 is OK: HTTP OK: HTTP/1.1 302 Found - 661 bytes in 6.711 second response time https://wikitech.wikimedia.org/wiki/LibreNMS [19:50:39] RECOVERY - librenms.wikimedia.org tls expiry on netmon2002 is OK: OK - Certificate librenms.wikimedia.org will expire on Sat 25 Nov 2023 11:20:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:51:27] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:51:33] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:56:43] (03CR) 10Brouberol: Enable cumin hosts to reach the opensearch API on logstash clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [19:59:01] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43267/console" [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [20:00:07] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230913T2000). [20:00:07] No Gerrit patches in the queue for this window AFAICS. [20:00:17] indeed! [20:00:20] mhm [20:00:27] hey TheresNoTime :) [20:00:29] o/ [20:03:47] PROBLEM - librenms.wikimedia.org requires authentication on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:07:41] RECOVERY - librenms.wikimedia.org requires authentication on netmon2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 661 bytes in 5.191 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:10:25] (03PS1) 10BBlack: varnish: only listen on a single TCP port [puppet] - 10https://gerrit.wikimedia.org/r/957348 (https://phabricator.wikimedia.org/T333965) [20:10:27] (03PS1) 10BBlack: varnish: remove TCP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/957349 (https://phabricator.wikimedia.org/T333965) [20:10:29] (03PS1) 10BBlack: varnish: limit TCP listener to localhost [puppet] - 10https://gerrit.wikimedia.org/r/957350 (https://phabricator.wikimedia.org/T333965) [20:13:51] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43268/console" [puppet] - 10https://gerrit.wikimedia.org/r/957343 (owner: 10BBlack) [20:14:05] (03CR) 10BCornwall: [V: 03+1 C: 03+1] fe_mem_gb_reserved: merge esams settings [nop] [puppet] - 10https://gerrit.wikimedia.org/r/957343 (owner: 10BBlack) [20:15:52] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43269/console" [puppet] - 10https://gerrit.wikimedia.org/r/957344 (owner: 10BBlack) [20:16:33] (03CR) 10BCornwall: [V: 03+1 C: 03+1] fe_mem_gb_reserved:170 for all single-backend [puppet] - 10https://gerrit.wikimedia.org/r/957344 (owner: 10BBlack) [20:20:29] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43270/console" [puppet] - 10https://gerrit.wikimedia.org/r/957346 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [20:20:54] (03PS2) 10BBlack: varnish: remove TCP monitoring [puppet] - 10https://gerrit.wikimedia.org/r/957349 (https://phabricator.wikimedia.org/T333965) [20:20:55] (03PS2) 10BBlack: varnish: only listen on a single TCP port [puppet] - 10https://gerrit.wikimedia.org/r/957348 (https://phabricator.wikimedia.org/T333965) [20:20:58] (03PS2) 10BBlack: varnish: limit TCP listener to localhost [puppet] - 10https://gerrit.wikimedia.org/r/957350 (https://phabricator.wikimedia.org/T333965) [20:21:46] (03CR) 10Fabfur: [C: 03+1] beta: haproxy->varnish single UDS config [puppet] - 10https://gerrit.wikimedia.org/r/957345 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [20:21:48] (03CR) 10BCornwall: [V: 03+1 C: 03+1] Varnish: listen on only 1x UDS [puppet] - 10https://gerrit.wikimedia.org/r/957346 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [20:21:52] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43271/console" [puppet] - 10https://gerrit.wikimedia.org/r/957345 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [20:22:05] (03PS1) 10Andrew Bogott: nova_fullstack_test: minor spelling fix [puppet] - 10https://gerrit.wikimedia.org/r/957351 [20:22:11] PROBLEM - LibreNMS HTTPS on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LibreNMS [20:22:28] (03CR) 10BCornwall: [V: 03+1 C: 03+1] beta: haproxy->varnish single UDS config [puppet] - 10https://gerrit.wikimedia.org/r/957345 (https://phabricator.wikimedia.org/T333965) (owner: 10BBlack) [20:22:44] (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack_test: minor spelling fix [puppet] - 10https://gerrit.wikimedia.org/r/957351 (owner: 10Andrew Bogott) [20:24:44] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: add methods to correlate project id with name [puppet] - 10https://gerrit.wikimedia.org/r/956927 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [20:24:51] RECOVERY - LibreNMS HTTPS on netmon2002 is OK: HTTP OK: HTTP/1.1 302 Found - 661 bytes in 0.738 second response time https://wikitech.wikimedia.org/wiki/LibreNMS [20:26:45] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:29:37] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cold-migrate: remove instance_fqdn output hint [puppet] - 10https://gerrit.wikimedia.org/r/956928 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [20:33:27] PROBLEM - LibreNMS HTTPS on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LibreNMS [20:37:15] (03Abandoned) 10Andrew Bogott: wmcs-instance-fqdns: support cases where project_name != project_id [puppet] - 10https://gerrit.wikimedia.org/r/956929 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [20:37:39] RECOVERY - LibreNMS HTTPS on netmon2002 is OK: HTTP OK: HTTP/1.1 302 Found - 661 bytes in 7.717 second response time https://wikitech.wikimedia.org/wiki/LibreNMS [20:37:46] (03PS1) 10BBlack: fe_mem_gb_reserved:170 for test hosts in other dcs [puppet] - 10https://gerrit.wikimedia.org/r/957352 [20:39:15] (03CR) 10Cwhite: [C: 03+1] "NOOP puppet compiler is unexpected..." [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [20:41:09] (03CR) 10Muehlenhoff: [C: 04-1] Enable cumin hosts to reach the opensearch API on logstash clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [20:45:47] (03CR) 10Cwhite: Enable cumin hosts to reach the opensearch API on logstash clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957324 (https://phabricator.wikimedia.org/T344798) (owner: 10Brouberol) [20:50:45] PROBLEM - LibreNMS HTTPS on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LibreNMS [20:51:47] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:03] RECOVERY - LibreNMS HTTPS on netmon2002 is OK: HTTP OK: HTTP/1.1 302 Found - 661 bytes in 0.326 second response time https://wikitech.wikimedia.org/wiki/LibreNMS [20:53:15] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:00:07] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230913T2100) [21:13:38] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:18:18] (03PS11) 10Ebernhardson: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [21:18:20] (03CR) 10Ebernhardson: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [21:21:06] (03CR) 10Ebernhardson: flink-kubernetes-operator: use networkpolicy_1.2.0 and configure zookeeper (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957311 (owner: 10DCausse) [21:21:11] !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: 0.3.129 use allowlist T344284 [21:21:18] T344284: Rename usages of whitelist to allowlist in query service rdf repo - https://phabricator.wikimedia.org/T344284 [21:22:10] !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: 0.3.129 use allowlist T344284 (duration: 00m 59s) [21:24:09] !log bking@deploy1002 Started deploy [wdqs/wdqs@3e0a913]: 0.3.129 use allowlist T344284 [21:27:39] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:29:01] ^ Working on it. [21:33:38] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:35:37] !log bking@deploy1002 Finished deploy [wdqs/wdqs@3e0a913]: 0.3.129 use allowlist T344284 (duration: 11m 27s) [21:35:41] T344284: Rename usages of whitelist to allowlist in query service rdf repo - https://phabricator.wikimedia.org/T344284 [21:36:24] PROBLEM - MariaDB Replica Lag: s1 #page on db1128 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 875.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:37:39] (KeyholderUnarmed) resolved: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:41:49] !incidents [21:41:49] 4036 (UNACKED) db1128 (paged)/MariaDB Replica Lag: s1 (paged) [21:41:49] 4035 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet eqiad) [21:41:50] 4033 (RESOLVED) ATSBackendErrorsHigh cache_text sre (restbase.discovery.wmnet esams) [21:41:59] !ack 4036 [21:41:59] 4036 (ACKED) db1128 (paged)/MariaDB Replica Lag: s1 (paged) [21:43:17] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [21:43:17] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [21:43:19] PROBLEM - haproxy process on cp4052 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [21:43:19] PROBLEM - Check systemd state on cp4052 is CRITICAL: CRITICAL - degraded: The following units failed: haproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:47] * denisse looking at 4036 [21:45:59] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:46:21] denisse: yo need a hand? [21:46:32] cwhite: Yes, please!! [21:46:42] I'm looking at our docs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Replication_lag [21:47:00] denisse: you want this one: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica [21:47:23] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:47:34] cwhite: Looking at it, thank you. :) [21:48:03] PROBLEM - LibreNMS HTTPS sl expiry on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LibreNMS [21:48:33] !log depooling db1128 [21:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:21] RECOVERY - LibreNMS HTTPS sl expiry on netmon2002 is OK: OK - Certificate librenms.wikimedia.org will expire on Sat 25 Nov 2023 11:20:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LibreNMS [21:49:30] !log denisse@cumin1001 dbctl commit (dc=all): 'Depool db1128', diff saved to https://phabricator.wikimedia.org/P52504 and previous config saved to /var/cache/conftool/dbconfig/20230913-214930-denisse.json [21:50:03] !log downtiming db1128 [21:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:29] yeah, depooling db1128 is enough for now [21:50:38] thanks [21:50:41] !log denisse@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1128.eqiad.wmnet with reason: HW issues [21:50:53] that confirms that 10.4.31 has issues [21:50:55] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1128.eqiad.wmnet with reason: HW issues [21:51:31] bblack: are you looking at cp4052? [21:51:36] T345509 [21:51:37] T345509: db1128 mariadb crashed - https://phabricator.wikimedia.org/T345509 [21:51:52] go get rest Manuel! [21:52:22] maybe brett? [21:52:41] ? [21:52:54] no db work from me! [21:52:54] I was about to open task but there is one now. [21:52:57] brett: cp4052? [21:53:04] oh yeah [21:53:07] cwhite: Thanks!! :D [21:53:37] (03PS1) 10Ladsgroup: db1128: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/957361 (https://phabricator.wikimedia.org/T345509) [21:53:49] denisse: Sorry for not claiming the errors [21:54:02] brett: new host needing something for ocsp? [21:54:10] brett: What errors? :o [21:54:15] yeah :( [21:54:24] No worries, feel free to let me know if there's anything I can do to help. :) [21:54:40] ack, sounds like you got it in hand :) [21:54:44] Make sure to resolve the db1128 page otherwise it pages tomorrow [21:54:47] not just ack it [21:55:08] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1128: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/957361 (https://phabricator.wikimedia.org/T345509) (owner: 10Ladsgroup) [21:55:18] Amir1: Sure, that'd be from inside SplunkOnCall, right? [21:55:25] yup [21:55:33] (03PS1) 10Fabfur: add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362 [21:55:41] Resolved, thanks!! [21:55:59] I disabled notification for now [21:56:06] * Amir1 goes back to playing RE4 [21:56:16] call me if things go bad [21:59:12] (03PS1) 10Andrew Bogott: designate-sink nova_fixed_multi: update _create context [puppet] - 10https://gerrit.wikimedia.org/r/957363 [22:02:49] (03PS6) 10Srishakatux: Add Akan language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955007 (https://phabricator.wikimedia.org/T333765) [22:13:59] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:15:27] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:23:21] PROBLEM - librenms.wikimedia.org requires authentication on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:24:41] RECOVERY - librenms.wikimedia.org requires authentication on netmon2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 661 bytes in 1.763 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:27:29] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:34:01] (03PS1) 10BBlack: OpenSSL 3 compat for update-ocsp script [draft] [puppet] - 10https://gerrit.wikimedia.org/r/957368 (https://phabricator.wikimedia.org/T342154) [22:34:39] (03CR) 10Ryan Kemper: "LGTM, minus super minor nitpick regarding redundant brackets" [puppet] - 10https://gerrit.wikimedia.org/r/957336 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [22:36:25] (03CR) 10CI reject: [V: 04-1] OpenSSL 3 compat for update-ocsp script [draft] [puppet] - 10https://gerrit.wikimedia.org/r/957368 (https://phabricator.wikimedia.org/T342154) (owner: 10BBlack) [22:38:45] (03PS2) 10BBlack: OpenSSL 3 compat for update-ocsp script [draft] [puppet] - 10https://gerrit.wikimedia.org/r/957368 (https://phabricator.wikimedia.org/T342154) [22:47:17] Maybe gonna get a page in a sec.. [22:48:35] Intermittent 503s [22:49:04] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:49:07] (ProbeDown) firing: (7) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:49:29] TheresNoTime: ACK'd. :) [22:49:30] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1351.eqiad.wmnet, mw1389.eqiad.wmnet, mw1433.eqiad.wmnet, mw1414.eqiad.wmnet, mw1417.eqiad.wmnet, mw1371.eqiad.wmnet, mw1365.eqiad.wmnet, mw1367.eqiad.wmnet, mw1453.eqiad.wmnet, mw1442.eqiad.wmnet, mw1434.eqiad.wmnet, mw1432.eqiad.wmnet, mw1478.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1454.eqiad.wmnet, mw [22:49:30] ad.wmnet, mw1413.eqiad.wmnet, mw1456.eqiad.wmnet, mw1407.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, mw1476.eqiad.wmnet, mw1480.eqiad.wmnet, mw1441.eqiad.wmnet, mw1416.eqiad.wmnet, mw1405.eqiad.wmnet, mw1352.eqiad.wmnet, mw1399.eqiad.wmnet, mw1391.eqiad.wmnet, mw1435.eqiad.wmnet, mw1420.eqiad.wmnet, mw1366.eqiad.wmnet, mw1431.eqiad.wmnet, mw1355.eqiad.wmnet, mw1393.eqiad.wmnet, mw1488.eqiad.wmnet, mw1481.eqiad.wmnet, mw1418.eqiad [22:49:30] mw1487.eqiad.wmnet, mw1372.eqiad.wmnet, mw1370.eqiad.wmnet, mw1397.eqiad.wmnet, mw1477.eqiad.wmnet, mw1479.eqiad.wmnet, mw1451.eqiad.wmnet, mw1496.eqiad.wmnet, mw1473.eqiad.wmnet, mw140 https://wikitech.wikimedia.org/wiki/PyBal [22:49:44] (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [22:49:45] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [22:50:07] (ProbeDown) firing: (7) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:50:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:50:29] Looking at it... [22:50:30] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5018.eqsin.wmnet, cp5019.eqsin.wmnet, cp5023.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5017.eqsin.wmnet, cp5018.eqsin.wmnet, cp5021.eqsin.wmnet, cp5019.eqsin.wmnet, cp5023.eqsin.wmnet, cp5020.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5024.eqsin.wmne [22:50:30] 3.eqsin.wmnet, cp5017.eqsin.wmnet, cp5022.eqsin.wmnet, cp5020.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5017.eqsin.wmnet, cp5024.eqsin.wmnet, cp5021.eqsin.wmnet, cp5018.eqsin.wmnet, cp5020.eqsin.wmnet, cp5023.eqsin.wmnet, cp5019.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:50:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:50:54] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:51:02] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:51:16] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:51:50] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:52:09] (MediaWikiLatencyExceeded) firing: (2) Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:52:09] (HttpdUnreachable) firing: httpd unavailable for deployment mw-web at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [22:52:10] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:52:16] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [22:52:16] (MediaWikiLatencyExceeded) firing: (3) Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:52:20] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:54:07] (ProbeDown) resolved: (21) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:07] (ProbeDown) resolved: (19) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:56:12] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:56:16] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [22:56:32] (03PS3) 10BCornwall: OpenSSL 3 compat for update-ocsp script [puppet] - 10https://gerrit.wikimedia.org/r/957368 (https://phabricator.wikimedia.org/T342154) (owner: 10BBlack) [22:57:09] (MediaWikiLatencyExceeded) resolved: (2) Average latency high: codfw mw-web (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:57:09] (HttpdUnreachable) resolved: httpd unavailable for deployment mw-web at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [22:57:16] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [22:57:16] (MediaWikiLatencyExceeded) resolved: (3) Average latency high: codfw appserver GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:57:24] RECOVERY - haproxy process on cp4052 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [22:57:26] RECOVERY - Check systemd state on cp4052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:44] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bookworm [22:57:54] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4052.ulsfo.wmnet with OS bookworm executed with errors: - cp4052 (**FAIL**) - Removed from Pu... [22:58:26] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4052 is OK: SSL OK - OCSP staple validity for wikipedia.org has 471694 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-11-20 06:51:11 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [22:58:26] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4052 is OK: SSL OK - OCSP staple validity for wikipedia.org has 471694 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-11-20 06:23:26 +0000 (expires in 67 days) https://wikitech.wikimedia.org/wiki/HTTPS [22:59:26] * TheresNoTime predicted those pages 😌 [22:59:38] You can call me the page whisperer /s [22:59:44] (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [22:59:45] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [23:00:06] Thanks page whisperer!! [23:00:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [23:06:52] !log starting Cassandra node rebuilds, restbase/row D — T331713 [23:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:55] T331713: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 [23:09:15] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "Tested on a bookworm (3.0.9-1) release and an existing bullseye (1.1.1n-0+deb11u5) release." [puppet] - 10https://gerrit.wikimedia.org/r/957368 (https://phabricator.wikimedia.org/T342154) (owner: 10BBlack) [23:09:51] (03CR) 10Andrew Bogott: [C: 03+2] designate-sink nova_fixed_multi: update _create context [puppet] - 10https://gerrit.wikimedia.org/r/957363 (owner: 10Andrew Bogott) [23:10:56] (03PS6) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158) [23:10:58] (03PS5) 10Andrew Bogott: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) [23:11:00] (03PS1) 10Andrew Bogott: designate nova_fixed_multi: create A record using project_id and project_name [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) [23:11:02] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43273/console" [puppet] - 10https://gerrit.wikimedia.org/r/957368 (https://phabricator.wikimedia.org/T342154) (owner: 10BBlack) [23:14:35] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [23:14:56] (03PS7) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: Support project_id != project_name [puppet] - 10https://gerrit.wikimedia.org/r/956930 (https://phabricator.wikimedia.org/T343158) [23:14:58] (03PS2) 10Andrew Bogott: designate nova_fixed_multi: create A record using project_id and project_name [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) [23:15:00] (03PS6) 10Andrew Bogott: dynamicproxy: clarify that 'project name' was actually project_id all along [puppet] - 10https://gerrit.wikimedia.org/r/956925 (https://phabricator.wikimedia.org/T343158) [23:18:40] (03CR) 10Andrew Bogott: "@fnegri, I realized that since this code already supports adding multiple A records per host it would be a lot simpler to create two A rec" [puppet] - 10https://gerrit.wikimedia.org/r/957371 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [23:22:48] (JobUnavailable) firing: (8) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:29:08] PROBLEM - LibreNMS HTTPS sl expiry on netmon2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LibreNMS [23:30:30] RECOVERY - LibreNMS HTTPS sl expiry on netmon2002 is OK: OK - Certificate librenms.wikimedia.org will expire on Sat 25 Nov 2023 11:20:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/LibreNMS [23:40:16] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1024:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1024 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:56:14] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:57:40] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase