[00:01:50] (03PS1) 10Andrea Denisse: netmon: Remove netmon2001 from DSH node group. [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695) [00:02:26] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-fetchimage.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:30] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:04:07] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:05:06] !log tgr@deploy1002 Synchronized php-1.40.0-wmf.14/extensions/GrowthExperiments/: Backport: [[gerrit:868052|User impact: read edit count from primary db in save complete hook (T324930)]] (duration: 07m 03s) [00:05:10] T324930: NewImpact: Cannot read properties of undefined (reading 'days') - https://phabricator.wikimedia.org/T324930 [00:05:36] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38795/console" [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [00:05:44] !log EU late backports done [00:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:39] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/868204/38795/" [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [00:09:00] PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:11] (03PS1) 10Andrea Denisse: netmon: Remove netmon1002 from DSH node group [puppet] - 10https://gerrit.wikimedia.org/r/868207 (https://phabricator.wikimedia.org/T322321) [00:10:10] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:10:36] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38796/console" [puppet] - 10https://gerrit.wikimedia.org/r/868207 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [00:12:34] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:14:15] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash2026.codfw.wmnet with OS bullseye [00:15:08] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2026.codfw.wmnet with OS bullseye [00:16:31] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/868207/38796/" [puppet] - 10https://gerrit.wikimedia.org/r/868207 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [00:17:28] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:17:44] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:19:30] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash2026.codfw.wmnet with OS bullseye [00:19:44] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2026.codfw.wmnet with OS bullseye [00:20:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Dzahn) Thanks for adding docs! That's the perfect reaction. I just wanted to create awareness originally. Your edit https://wikitech.wikimedia.org/w/index.php... [00:20:52] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10AnnWF) @jhathaway Hi there, sorry for the late reply, I am still not able to login to the https://turnilo.wikimedia.org/ as getting "Service access denied due to missing privileges." when re... [00:27:44] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:30:05] !log releases2002 - rebooting [00:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:27] !log releases1002 - rebooting [00:32:28] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:46] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:40:16] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:41:46] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:42:20] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:49:58] PROBLEM - Check systemd state on parse1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:48] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2026.codfw.wmnet with reason: host reimage [00:58:53] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2026.codfw.wmnet with reason: host reimage [01:05:10] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:05:12] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:06:16] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:06:18] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:13:16] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:28] PROBLEM - Check systemd state on mw2271 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:00] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2026.codfw.wmnet with OS bullseye [01:57:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:59:48] cccccclrjgugfkgfegifdcgblcgbntukllfiegikivlt [01:59:54] uh I mean, hi [02:01:51] didn't expect to see keysmashing in -operations [02:01:54] >:D [02:02:06] yubikey smashing [02:05:46] PROBLEM - Check systemd state on mw1416 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:18:04] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [03:44:38] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:18] PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:28] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:29:05] (03PS1) 10RLazarus: httpbb: Add tests for test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/868211 (https://phabricator.wikimedia.org/T290536) [04:29:07] (03PS1) 10RLazarus: httpbb: Run hourly tests from the cumin hosts against mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536) [04:31:22] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38797/console" [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536) (owner: 10RLazarus) [05:08:38] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 89, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:17:01] (03PS3) 10PleaseStand: Remove obsolete setting $wgAutoloadAttemptLowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) [05:42:16] PROBLEM - Check systemd state on mw1447 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:26] PROBLEM - Check systemd state on mw1448 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:24] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:08] (03PS1) 10Marostegui: Revert "production-m2.sql.erb: Add new user" [puppet] - 10https://gerrit.wikimedia.org/r/868062 [06:17:13] (03CR) 10Marostegui: [C: 03+2] Revert "production-m2.sql.erb: Add new user" [puppet] - 10https://gerrit.wikimedia.org/r/868062 (owner: 10Marostegui) [06:30:41] (03CR) 10Marostegui: [C: 04-1] mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [06:36:42] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [06:38:12] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [06:44:02] PROBLEM - Check systemd state on mw1417 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:04] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T0700). [07:00:30] (03PS1) 10Marostegui: phabricator.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868213 (https://phabricator.wikimedia.org/T325154) [07:00:56] (03CR) 10Marostegui: [C: 03+2] phabricator.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868213 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [07:05:18] (03PS1) 10KartikMistry: Enable Section Translation on 6 WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868215 (https://phabricator.wikimedia.org/T319177) [07:24:19] (03PS5) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) [07:24:24] (03CR) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [07:24:49] (03PS6) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) [07:25:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [07:34:46] (03PS2) 10Ryan Kemper: [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [07:36:22] (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [07:42:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [07:47:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [07:52:56] (03CR) 10Muehlenhoff: [C: 03+2] Don't install quickstack on Bookworm, revisit later [puppet] - 10https://gerrit.wikimedia.org/r/868078 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [07:56:46] PROBLEM - Check systemd state on mw1415 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb2003.codfw.wmnet [08:00:04] Amir1, apergos, and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:13] morning! there are no trainees signed up for the window, and one patch scheduled for deployment. kart_ I assume you wil self-deploy? [08:00:14] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:22] you can self serve? [08:00:30] Yeah. I can self deploy. [08:00:40] apergos: Amir1 ^^ [08:00:48] it's all you, take it away, kart_! [08:00:53] :) [08:00:53] awesome, less work for me :P_ [08:01:00] :D [08:01:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868215 (https://phabricator.wikimedia.org/T319177) (owner: 10KartikMistry) [08:02:37] (03Merged) 10jenkins-bot: Enable Section Translation on 6 WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868215 (https://phabricator.wikimedia.org/T319177) (owner: 10KartikMistry) [08:03:03] !log kartik@deploy1002 Started scap: Backport for [[gerrit:868215|Enable Section Translation on 6 WPs (T319177)]] [08:03:07] T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177 [08:04:57] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:868215|Enable Section Translation on 6 WPs (T319177)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [08:08:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb2003.codfw.wmnet [08:13:59] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:868215|Enable Section Translation on 6 WPs (T319177)]] (duration: 10m 55s) [08:14:03] T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177 [08:17:18] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:28] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:18:37] (03PS1) 10Majavah: base: puppet_alert: don't advertise the disable file [puppet] - 10https://gerrit.wikimedia.org/r/868221 [08:20:36] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:21:34] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10akosiaris) 05Open→03Resolved >>! In T238090#8469500, @mpopov wrote: > I just updated @Fuzzy's permissions for he.m.wikisource. U... [08:22:07] kart_: how's it looking? still testing? [08:24:24] (03CR) 10Marostegui: [C: 03+1] mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [08:27:29] heads up, I am reboot rdb1009 for kernel upgrades. [08:28:19] !log reboot rdb1009 for kernel upgrades. possibly (but probably not) affected applications: changeprop, cpjobqueue, api-gateway, redisLockManager [08:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:29] good morning [08:30:47] (03CR) 10David Caro: "I really think that it's as useful not advertising it, with the downside that then people will start sending those emails to spam/trash au" [puppet] - 10https://gerrit.wikimedia.org/r/868221 (owner: 10Majavah) [08:30:54] PROBLEM - Host rdb1009 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:10] RECOVERY - Host rdb1009 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [08:33:04] morning [08:38:23] (03CR) 10Elukey: [C: 03+1] "Let's wait for Chris' approval before proceeding but it looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) (owner: 10AikoChou) [08:43:11] (03CR) 10JMeybohm: [C: 03+1] echostore: Tighten egress to explit host/port list [deployment-charts] - 10https://gerrit.wikimedia.org/r/868146 (owner: 10Eevans) [08:44:29] (03PS1) 10Matthias Mullie: [SearchVue] Change protocol-less urls to https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446) [08:47:55] kart_: have you completed deployment, or still working? (I don't want to rush you - just want to merge a non-urgent beta-only config patch & want to make sure I'm staying out of your way!) [08:50:34] since no reply 20 minutes after I pinged for a check-in, I am assuming they completed and forgot to mention it here [08:51:10] !log nothing noticed with rdb1007 reboot for mw, jobqueue, api-gateway. changeprop had a minor backlog increase, but everything appears fine now. [08:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:22] !log reboot rdb1007 for kernel upgrades [08:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:33] !log correction: reboot rdb1011 for kernel upgrades [08:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:25] !log reboot rdb2009 for kernel upgrades [08:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:48] (03PS1) 10Marostegui: parsercache.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868225 (https://phabricator.wikimedia.org/T325154) [08:54:54] PROBLEM - Host rdb1011 is DOWN: PING CRITICAL - Packet loss = 100% [08:55:06] RECOVERY - Host rdb1011 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [08:55:18] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:55:32] (03CR) 10Marostegui: [C: 03+2] parsercache.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868225 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [08:55:50] PROBLEM - Host rdb2009 is DOWN: PING CRITICAL - Packet loss = 100% [08:56:18] docker registry is probably the rdb2009 reboot, it should resolve quickly [08:56:18] RECOVERY - Host rdb2009 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [08:57:18] (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:58:35] akosiaris: ack [08:58:47] akosiaris: I got an ORES alert for workers down, now resolved IIUC [08:58:55] I will promote all wikis to 1.40.0-wmf.14 in a few minutes [08:59:26] acked the alert in VOps [08:59:40] thanks :) [08:59:45] are you getting those topranks? [09:00:05] hashar and ^demon: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T0900). [09:00:13] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868226 (https://phabricator.wikimedia.org/T320519) [09:00:16] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868226 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot) [09:00:18] (ProbeDown) resolved: (2) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:00:52] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868226 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot) [09:02:18] (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:06:21] we had a little outage for ORES, nothing big though: [09:06:22] https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1671094279394&to=1671094818210 [09:08:18] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.14 refs T320519 [09:08:22] T320519: 1.40.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T320519 [09:10:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1001.wikimedia.org [09:11:01] elukey: I 'll have to repeat it, got more hosts I need to reboot [09:11:21] actually only 1 is left [09:12:21] !log reboot rdb2007 for kernel upgrades [09:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:09] akosiaris: next time we can do a failover if you want, I can prep the code reviews in advance etc.. [09:13:12] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:12] PROBLEM - Host rdb2007 is DOWN: PING CRITICAL - Packet loss = 100% [09:13:28] elukey: we could, but is it worth it? [09:13:32] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:14:20] my understanding from last time we did this is was that no, but I maybe I misremember [09:15:04] RECOVERY - Host rdb2007 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [09:15:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1001.wikimedia.org [09:15:09] I am trying to remember as well, did it break the same way? IIRC no, this time an alert fired (never seen it to be honest) [09:17:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2002.wikimedia.org [09:17:36] I remember something similar in the graphs, but I don't remember if an alert fired or not. [09:18:08] uptime on the redis hosts was 155 days, so we can pin it down and figure out if it fired or not [09:21:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2002.wikimedia.org [09:27:59] !log elukey@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka test-eqiad cluster: Reboot kafka nodes [09:30:14] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet [09:30:38] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host acmechief-test1001.eqiad.wmnet [09:31:28] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:31:50] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet [09:32:40] hmmm is that dashboard broken ^^? [09:33:35] vgutierrez: the URL is cut after some amount of bytes, probably by the IRC bot [09:34:01] that alarms also keeps triggering but there is no real spike showing up in the graph [09:34:16] IIRC claime said he is aware of it / investigating [09:34:40] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:34:42] we have a few spikes on POSTs [09:34:45] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=now-3h&orgId=1&to=now&var-cluster=api_appserver&var-datasource=codfw%20prometheus%2Fops&var-method=POST&viewPanel=9&var-site=codfw&var-code=200&var-php_version=All [09:34:49] and there when it recovers it has the proper URL [09:35:02] the PROBLEM alarm lacks var-method=POST ;) [09:35:17] especially if compare it against eqiad [09:36:28] and the alarm triggers 6 minutes after the initial spike (I guess cause the observed window is a few minutes wide AND Icinga might recheck it 3 times before turning the alarm in hard state which triggers the notification) [09:36:36] but that is a different issue ;) [09:37:32] So we really need to do something about this one [09:37:41] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet [09:37:41] Because there's like 2 POST/s [09:38:02] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet [09:38:09] (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/868229 [09:38:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt1001.wikimedia.org [09:38:33] (03PS1) 10Marostegui: db1206: Testing RAID controller [puppet] - 10https://gerrit.wikimedia.org/r/868230 (https://phabricator.wikimedia.org/T324181) [09:38:56] (03CR) 10Marostegui: [C: 03+2] db1206: Testing RAID controller [puppet] - 10https://gerrit.wikimedia.org/r/868230 (https://phabricator.wikimedia.org/T324181) (owner: 10Marostegui) [09:40:14] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet [09:41:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief2001.codfw.wmnet [09:42:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1001.wikimedia.org [09:42:53] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: remove translatewiki and frwikisource isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) (owner: 10AikoChou) [09:43:37] (03PS7) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) [09:43:56] (03CR) 10CI reject: [V: 04-1] mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [09:45:08] (03CR) 10Btullis: Backing up HDFS FSImage to HDFS on Monday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [09:45:33] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief-test2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:45:40] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [09:48:08] (03CR) 10Muehlenhoff: [C: 03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/868229 (owner: 10Muehlenhoff) [09:49:49] (03CR) 10Jcrespo: [C: 03+2] mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [09:51:23] !log vgutierrez@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host acmechief2001.codfw.wmnet [09:52:34] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38799/console" [puppet] - 10https://gerrit.wikimedia.org/r/868035 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto) [09:53:07] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet [09:54:03] !log stopping and masking nutcracker on mw servers - T277183 [09:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:07] T277183: Phase out nutcracker from mediawiki servers - https://phabricator.wikimedia.org/T277183 [09:55:33] (KeyholderUnarmed) firing: (3) 1 unarmed Keyholder key(s) on acmechief-test1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:56:50] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet [10:01:52] PROBLEM - Host acmechief2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:04:36] (03PS2) 10Effie Mouzeli: Redis sessions: Goodbye [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581) [10:05:36] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:17] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add trusted tag to Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/868035 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto) [10:06:37] (03CR) 10Jelto: [C: 03+2] gitlab_runner: add wmcs tag to Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/868036 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto) [10:06:41] ^^ acmechief2002 is down [10:07:28] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:07:34] (03PS1) 10Sergio Gimeno: Vue components: react to binding updates of v-click-outside directive [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868063 (https://phabricator.wikimedia.org/T325041) [10:08:14] (03PS1) 10Sergio Gimeno: User impact: instrument info tooltip clicks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868064 (https://phabricator.wikimedia.org/T325041) [10:08:18] (03PS2) 10Jelto: gitlab_runner: add trusted tag to Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/868035 (https://phabricator.wikimedia.org/T325069) [10:08:22] (03PS2) 10Jelto: gitlab_runner: add wmcs tag to Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/868036 (https://phabricator.wikimedia.org/T325069) [10:12:26] PROBLEM - Check systemd state on mw1351 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:28] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:13:28] 10SRE, 10Infrastructure-Foundations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815 (10akosiaris) [10:13:32] 10SRE, 10Striker, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 (10akosiaris) 05Open→03Stalled This is open since 2016 with minimal updates, probably inaccurate now (as far as I know the corp LDAP doesn't exist now) and it is unclear, at... [10:17:52] PROBLEM - Check systemd state on mw2286 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:34] (03PS1) 10JMeybohm: If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236) [10:19:22] PROBLEM - Check systemd state on mw2367 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:48] PROBLEM - Check systemd state on mw2382 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:00] 10SRE, 10Striker, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 (10MoritzMuehlenhoff) >>! In T148048#8470183, @akosiaris wrote: > This is open since 2016 with minimal updates, probably inaccurate now (as far as I know the corp LDAP doesn't ex... [10:21:56] 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10MoritzMuehlenhoff) [10:22:01] 10SRE, 10Striker, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 (10MoritzMuehlenhoff) [10:22:32] (03CR) 10CI reject: [V: 04-1] If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236) (owner: 10JMeybohm) [10:23:22] PROBLEM - Check systemd state on mw1441 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:34] (03PS1) 10DCausse: team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234 [10:27:34] PROBLEM - Check systemd state on mw2259 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:11] (03CR) 10CI reject: [V: 04-1] team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234 (owner: 10DCausse) [10:28:50] PROBLEM - Check systemd state on mw2371 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:34] (03PS1) 10Marostegui: orchestrator.conf.json.erb: Add Jaime to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/868235 [10:31:24] (03CR) 10Marostegui: orchestrator.conf.json.erb: Add Jaime to powerusers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868235 (owner: 10Marostegui) [10:31:52] PROBLEM - Check systemd state on mw1401 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:28] (KubernetesAPILatency) resolved: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1002.wikimedia.org [10:34:08] !log restarted istiod pods in aux-k8s because of T303184 [10:34:08] PROBLEM - MariaDB Replica SQL: backup1-eqiad on db1205 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:11] T303184: High API server request latencies (LIST) - https://phabricator.wikimedia.org/T303184 [10:34:25] jynus: ^ should I downtime those? [10:34:47] in theory they should have notifications disabled [10:35:00] PROBLEM - Check systemd state on mw1475 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:24] jynus: I just noticed you used profile::base::notifications: disabled and I always use profile::monitoring::notifications_enabled: false [10:35:26] PROBLEM - Check systemd state on mw1489 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:44] PROBLEM - MariaDB read only backup1-codfw on db2184 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:35:55] I see [10:36:02] I used the old syntax [10:36:13] I missed that in the review :( [10:36:48] PROBLEM - Check systemd state on mw2406 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1002.wikimedia.org [10:37:15] I disabled manually on icinga [10:37:24] do you want me to send a patch? [10:37:25] but alertmanager will complain, I guess [10:37:50] I have a meeting, but I will do a proper patch after that [10:37:59] don't worry, I will do it [10:38:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2001.wikimedia.org [10:39:40] (03PS1) 10Marostegui: mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582) [10:39:48] jynus: going to merge ^ [10:40:18] RECOVERY - Host acmechief2001 is UP: PING OK - Packet loss = 0%, RTA = 31.85 ms [10:41:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582) (owner: 10Marostegui) [10:41:14] (03CR) 10Jcrespo: [C: 03+1] mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582) (owner: 10Marostegui) [10:41:29] jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/868235/1 is that jcrespo or jynus? [10:41:31] thank you, I have too many things on my plate right now [10:41:38] jynus: don't worry, I will take care of it [10:42:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2001.wikimedia.org [10:43:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2001.codfw.wmnet [10:43:38] (03PS2) 10Marostegui: mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582) [10:44:14] (03PS3) 10Marostegui: mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582) [10:45:16] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:33] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:46:16] (03PS1) 10Marostegui: mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868338 (https://phabricator.wikimedia.org/T313582) [10:46:24] (03Abandoned) 10Marostegui: mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582) (owner: 10Marostegui) [10:47:00] !log disable ping offload in eqiad [10:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:04] (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868338 (https://phabricator.wikimedia.org/T313582) (owner: 10Marostegui) [10:48:40] PROBLEM - Check systemd state on mw1497 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:33] (03PS1) 10Marostegui: orchestrator.conf.json.erb: Add Jaime to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/868339 [10:49:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2001.codfw.wmnet [10:49:39] (03Abandoned) 10Marostegui: orchestrator.conf.json.erb: Add Jaime to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/868235 (owner: 10Marostegui) [10:50:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1002.eqiad.wmnet [10:50:15] (03CR) 10Marostegui: "Let me know if it is jcrespo or jynus" [puppet] - 10https://gerrit.wikimedia.org/r/868339 (owner: 10Marostegui) [10:50:33] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:51:20] :? [10:51:44] PROBLEM - Check systemd state on mw1402 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:23] (03PS2) 10DCausse: team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234 [10:53:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1002.eqiad.wmnet [10:53:57] (03PS2) 10Daniel Kinzler: Increase PC writes from parsoid API to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127 [10:54:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant ssh access to analytics-admins to mnz - https://phabricator.wikimedia.org/T325072 (10akosiaris) @odimitrijevic, @Ottomata, we need the approval of one of you on this one. [10:55:18] PROBLEM - Check systemd state on mw2388 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:40] PROBLEM - Check systemd state on parse2018 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris) @odimitrijevic, @Ottomata, we need the approval of one of you on this one. [10:59:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris) analytics-product-users doesn't have a approver listed, let me chase that one down. [10:59:34] PROBLEM - Check systemd state on mw1439 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:58] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris) [11:00:04] mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1100). [11:00:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2002.codfw.wmnet [11:01:32] PROBLEM - Check systemd state on mw1454 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2002.codfw.wmnet [11:08:19] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/868343 (https://phabricator.wikimedia.org/T314472) [11:10:02] PROBLEM - Check systemd state on mw1443 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:22] PROBLEM - Check systemd state on mw2401 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping3002.esams.wmnet [11:14:04] PROBLEM - Check systemd state on mw2387 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:25] (03PS1) 10Slyngshede: P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345 [11:14:44] (03CR) 10CI reject: [V: 04-1] P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345 (owner: 10Slyngshede) [11:15:22] (03PS2) 10Effie Mouzeli: tegola-vector-tiles: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/868343 (https://phabricator.wikimedia.org/T314472) [11:15:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka test-eqiad cluster: Reboot kafka nodes [11:15:55] (03PS2) 10Slyngshede: P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345 [11:16:19] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/868343 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [11:16:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping3002.esams.wmnet [11:18:17] (03CR) 10Jcrespo: [C: 03+1] orchestrator.conf.json.erb: Add Jaime to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/868339 (owner: 10Marostegui) [11:19:28] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:19:49] (03PS3) 10Slyngshede: P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345 [11:20:38] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/868343 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [11:21:09] (03PS4) 10Slyngshede: P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345 [11:21:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host flowspec1001.eqiad.wmnet [11:21:54] PROBLEM - Check systemd state on mw1435 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:12] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:22:24] PROBLEM - Check systemd state on mw1474 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:20] (03CR) 10Effie Mouzeli: [C: 03+1] maps: remove redis [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [11:23:41] (03CR) 10Effie Mouzeli: [C: 03+2] Redis sessions: Goodbye [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [11:24:52] PROBLEM - Check systemd state on mw1423 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:43] (03Merged) 10jenkins-bot: tegola-vector-tiles: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/868343 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [11:26:07] (03CR) 10Slyngshede: [C: 03+2] P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345 (owner: 10Slyngshede) [11:27:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flowspec1001.eqiad.wmnet [11:27:56] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:30:16] (03CR) 10Marostegui: [C: 03+2] orchestrator.conf.json.erb: Add Jaime to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/868339 (owner: 10Marostegui) [11:34:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet [11:34:15] PROBLEM - Check systemd state on mw2385 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:36] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@00c9a16] (codfw): codfw: Disable traffic mirroring [11:39:20] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@00c9a16] (codfw): codfw: Disable traffic mirroring (duration: 01m 43s) [11:39:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet [11:42:03] !log switching maps/kartotherian from codfw to eqiad [11:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:43] PROBLEM - Check systemd state on mw1431 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:06] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [11:44:06] the failed nutcracker things are mine, I will deal with them in a bit [11:44:23] PROBLEM - Check systemd state on mw2311 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:05] (03PS1) 10Jbond: P:installserver::proxy: allow production to use squid to proxy ssh [puppet] - 10https://gerrit.wikimedia.org/r/868370 [11:50:53] PROBLEM - Check systemd state on mw1464 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:40] (03CR) 10Hashar: [C: 04-1] "Should be done after the role has been applied on the host since scap will run scripts upon deployment (such as restarting php-fpm when de" [puppet] - 10https://gerrit.wikimedia.org/r/867670 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [11:53:55] PROBLEM - Check systemd state on mw2316 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:55] PROBLEM - Check systemd state on mw2350 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org [11:55:28] (03Abandoned) 10Hashar: build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 (owner: 10Hashar) [11:55:30] (03Abandoned) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [11:55:31] PROBLEM - Check systemd state on mw1378 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:33] (03Abandoned) 10Hashar: Implement REST API and Ssh commands [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [11:55:36] (03Abandoned) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [11:58:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org [11:58:47] PROBLEM - Check systemd state on mw2276 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:09] PROBLEM - Check systemd state on mw2269 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:49] (03PS1) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) [12:03:30] (03PS1) 10Btullis: Increase max_connections on analytics_meta MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/868373 (https://phabricator.wikimedia.org/T325278) [12:07:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1002.wikimedia.org [12:07:12] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@00c9a16] (eqiad): codfw: Disable traffic mirroring [12:08:12] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@00c9a16] (eqiad): codfw: Disable traffic mirroring (duration: 01m 00s) [12:08:22] (03PS2) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) [12:10:07] (03PS3) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) [12:10:09] (03CR) 10Marostegui: [C: 03+1] "I have not much to say here as we do not maintain this DB. My recommendation would be to closely monitor connections and if needing more a" [puppet] - 10https://gerrit.wikimedia.org/r/868373 (https://phabricator.wikimedia.org/T325278) (owner: 10Btullis) [12:10:39] PROBLEM - Check systemd state on mw1482 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:17] PROBLEM - Check systemd state on mw2273 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:46] (03PS1) 10Jcrespo: orchestrator: Change poweruser jcrespo to use the shell name: jynus [puppet] - 10https://gerrit.wikimedia.org/r/868376 [12:12:05] (03CR) 10Marostegui: [C: 03+1] orchestrator: Change poweruser jcrespo to use the shell name: jynus [puppet] - 10https://gerrit.wikimedia.org/r/868376 (owner: 10Jcrespo) [12:12:09] PROBLEM - Check systemd state on mw1436 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:28] (03PS1) 10Volans: spicerack: update config for v6.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) [12:12:30] (03PS1) 10Volans: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868378 (https://phabricator.wikimedia.org/T325168) [12:12:32] (03CR) 10Jcrespo: "I added the puppet comment so we don't get confused again :-)" [puppet] - 10https://gerrit.wikimedia.org/r/868376 (owner: 10Jcrespo) [12:12:34] (03PS4) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) [12:12:45] PROBLEM - Check systemd state on parse1018 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1002.wikimedia.org [12:13:02] (03CR) 10Jcrespo: [C: 03+2] orchestrator: Change poweruser jcrespo to use the shell name: jynus [puppet] - 10https://gerrit.wikimedia.org/r/868376 (owner: 10Jcrespo) [12:13:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38806/console" [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) (owner: 10Jbond) [12:14:40] (03PS5) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) [12:15:25] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/868379 (owner: 10L10n-bot) [12:15:33] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:16:02] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [12:16:13] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868378 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [12:16:33] (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) (owner: 10Jbond) [12:16:52] (03CR) 10Volans: [C: 04-1] "This must be deployend in conjunction with the deploy of Spicerack v6.0.0 on the fleet." [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [12:18:19] (03PS6) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) [12:19:33] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=kartotherian-ssl,name=maps1010.eqiad.wmnet [12:19:33] (03CR) 10Btullis: [C: 03+2] Increase max_connections on analytics_meta MariaDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868373 (https://phabricator.wikimedia.org/T325278) (owner: 10Btullis) [12:19:39] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=kartotherian,name=maps1010.eqiad.wmnet [12:19:57] (03CR) 10Volans: "This is a draft proposal to setup the two spicerack/cookbooks environments in the different hosts." [puppet] - 10https://gerrit.wikimedia.org/r/868378 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [12:20:32] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [12:20:46] (03PS2) 10Volans: spicerack: update config for v6.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) [12:20:53] (03CR) 10Volans: "check_experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [12:23:14] (03PS2) 10JMeybohm: If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236) [12:23:53] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) (owner: 10AikoChou) [12:23:55] (03CR) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) (owner: 10Jbond) [12:25:54] (03PS1) 10JMeybohm: WIP: Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943) [12:27:05] PROBLEM - Check systemd state on mw1450 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:38] ottomata: any hints as to who should be added as approver for "analytics-product-users" ? See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/data/data.yaml#918 [12:29:25] PROBLEM - puppet last run on idp-test1002 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:30:27] (03PS4) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [12:32:52] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [12:34:59] RECOVERY - puppet last run on idp-test1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:36:23] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye [12:36:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye [12:36:37] PROBLEM - Check systemd state on mw2356 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [12:40:37] (03CR) 10Volans: [C: 04-1] "This must be deployend in conjunction with the deploy of Spicerack v6.0.0 on the fleet." [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [12:42:41] (03PS10) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [12:43:10] (03PS1) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 [12:43:53] (03PS1) 10Muehlenhoff: profile::spicerack: Stop writing Redis sessions data [puppet] - 10https://gerrit.wikimedia.org/r/868393 (https://phabricator.wikimedia.org/T267581) [12:44:07] (03Abandoned) 10Jbond: P:installserver::proxy: allow production to use squid to proxy ssh [puppet] - 10https://gerrit.wikimedia.org/r/868370 (owner: 10Jbond) [12:44:21] (03PS2) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 [12:44:27] (03CR) 10Jaime Nuche: mwdebug_deploy: remove configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche) [12:46:03] (03CR) 10Jcrespo: "Sending you the issues I had to setup the new backup databases from 0, in the form of the patch. Feel free to alter the patch in any way y" [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo) [12:46:16] (03CR) 10Jbond: [C: 03+1] tools-webservice: create /etc/toolforge/webservice.yaml with puppet [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [12:47:14] (03PS3) 10Hnowlan: maps: remove redis [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) [12:47:16] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch Moritz, we were chatting about this in -serviceops" [puppet] - 10https://gerrit.wikimedia.org/r/868393 (https://phabricator.wikimedia.org/T267581) (owner: 10Muehlenhoff) [12:47:17] PROBLEM - Check systemd state on parse2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:39] PROBLEM - Check systemd state on mw1404 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:48] (03PS2) 10Jbond: rsyslog: use ensure_resource for package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [12:48:11] (03PS3) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 [12:48:36] (03PS4) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 [12:48:47] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-stretch1001.eqiad.wmnet with reason: host reimage [12:49:27] (03PS5) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 [12:49:54] (03PS3) 10Jbond: rsyslog: use ensure_resource for package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [12:50:07] (03PS1) 10Effie Mouzeli: P:spicerack: Remove redis_sessions leftovers [puppet] - 10https://gerrit.wikimedia.org/r/868395 [12:50:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38808/console" [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [12:50:57] (03CR) 10Effie Mouzeli: [C: 03+1] profile::spicerack: Stop writing Redis sessions data [puppet] - 10https://gerrit.wikimedia.org/r/868393 (https://phabricator.wikimedia.org/T267581) (owner: 10Muehlenhoff) [12:51:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38809/console" [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [12:51:17] (03Abandoned) 10Effie Mouzeli: P:spicerack: Remove redis_sessions leftovers [puppet] - 10https://gerrit.wikimedia.org/r/868395 (owner: 10Effie Mouzeli) [12:51:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-stretch1001.eqiad.wmnet with reason: host reimage [12:52:08] (03CR) 10Muehlenhoff: [C: 03+2] profile::spicerack: Stop writing Redis sessions data [puppet] - 10https://gerrit.wikimedia.org/r/868393 (https://phabricator.wikimedia.org/T267581) (owner: 10Muehlenhoff) [12:52:19] (03PS1) 10Volans: cumin: fix email address for insetup role audit [puppet] - 10https://gerrit.wikimedia.org/r/868396 [12:53:49] (03PS1) 10Aqu: Fix systemd syntax in hadoop-namenode-backup-fetchimage [puppet] - 10https://gerrit.wikimedia.org/r/868397 (https://phabricator.wikimedia.org/T324850) [12:56:33] PROBLEM - Check systemd state on mw1390 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:53] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38810/console" [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [13:02:06] <_joe_> effie: ^^ see the alert, we need to remove nutcracker from auto-restarts [13:02:13] <_joe_> it *always* gets me [13:03:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [13:03:24] (03CR) 10Volans: "See comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [13:03:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/868396 (owner: 10Volans) [13:04:08] _joe_: I gave seen the alert, I wrote earlier that I need to mend something else first [13:04:20] we are in the middle of a maps thing [13:04:23] thank you ! [13:04:31] <_joe_> oh sorry [13:04:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/868088 (owner: 10Volans) [13:04:46] too much noise it this channel :/ [13:05:15] PROBLEM - Check systemd state on mw1414 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:05:28] (03PS1) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) [13:06:03] (03CR) 10David Caro: [C: 03+2] tools-webservice: create /etc/toolforge/webservice.yaml with puppet [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [13:06:22] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [13:07:53] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536) (owner: 10RLazarus) [13:07:59] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [13:09:13] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [13:09:24] (03CR) 10Elukey: [C: 03+1] If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236) (owner: 10JMeybohm) [13:11:20] hashar: vgutierrez, fyi about the latency alarm https://phabricator.wikimedia.org/T325277 I won't be making anymore headway on it before next week tho, I'm ooo starting about now until monday [13:11:33] (03PS2) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) [13:13:29] PROBLEM - Check systemd state on mw1394 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:15] RECOVERY - Check systemd state on mw1414 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:29] PROBLEM - Check systemd state on mw1462 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:22] (03CR) 10Muehlenhoff: [C: 03+2] Add initial support for webauthn [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff) [13:15:24] (03CR) 10JMeybohm: [C: 03+2] If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236) (owner: 10JMeybohm) [13:19:11] PROBLEM - Check systemd state on mw1356 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:24] claime: nice ;) enjoy the week-end! [13:19:37] Thanks, I will :D [13:20:34] (03Merged) 10jenkins-bot: If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236) (owner: 10JMeybohm) [13:20:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868396 (owner: 10Volans) [13:22:16] (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) (owner: 10Arturo Borrero Gonzalez) [13:23:55] PROBLEM - Check systemd state on mw2331 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:17] PROBLEM - Check systemd state on mw2372 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:19] PROBLEM - Check systemd state on parse2003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:23] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] puppetmaster: git-sync-upstream: use the gitpuppet user for git operations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) (owner: 10Arturo Borrero Gonzalez) [13:33:20] (03PS1) 10Effie Mouzeli: mediawiki: remove prometheus-nutcracker-exporter auto restart [puppet] - 10https://gerrit.wikimedia.org/r/868403 [13:34:39] PROBLEM - Check systemd state on mw1494 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [13:34:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-stretch1001.eqiad.wmnet with OS bullseye [13:34:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye co... [13:35:00] (03CR) 10CI reject: [V: 04-1] mediawiki: remove prometheus-nutcracker-exporter auto restart [puppet] - 10https://gerrit.wikimedia.org/r/868403 (owner: 10Effie Mouzeli) [13:35:42] (03CR) 10Clément Goubert: "Sorry, I should have caught the package and docker::credential removal issue when reviewing I37bd41014b77" [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche) [13:38:35] (03CR) 10Ottomata: [C: 03+2] Backing up HDFS FSImage to HDFS on Monday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [13:39:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant ssh access to analytics-admins to mnz - https://phabricator.wikimedia.org/T325072 (10Ottomata) Approved. [13:40:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10Ottomata) Approved for analytics-privatedata-users. analytics-product-users approver could be @mpopov? [13:41:35] PROBLEM - Check systemd state on mw2376 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:44] (03CR) 10Ottomata: [C: 03+2] Fix systemd syntax in hadoop-namenode-backup-fetchimage [puppet] - 10https://gerrit.wikimedia.org/r/868397 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [13:44:25] PROBLEM - Check systemd state on mw2310 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:47] (03PS2) 10Effie Mouzeli: mediawiki: remove prometheus-nutcracker-exporter auto restart [puppet] - 10https://gerrit.wikimedia.org/r/868403 [13:48:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) kafka-stretch1001 worked ok with the new raid config. I'm just going to rebuild kafka-stretch1002 because although the drives are in t... [13:49:11] PROBLEM - Check systemd state on mw2295 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:31] (03CR) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [13:49:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/868403 (owner: 10Effie Mouzeli) [13:49:55] (03PS5) 10Muehlenhoff: Make puppetdb[12]003 puppetdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/863255 [13:50:31] PROBLEM - Check systemd state on mw1361 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:09] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/868379 (owner: 10L10n-bot) [13:51:25] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1002.eqiad.wmnet with OS bullseye [13:51:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye [13:53:19] (03PS1) 10Ilias Sarantopoulos: ml-services: use the same image for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/868407 (https://phabricator.wikimedia.org/T323586) [13:56:14] (03PS2) 10Ilias Sarantopoulos: ml-services: use the same image for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/868407 (https://phabricator.wikimedia.org/T323586) [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1400) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1400). nyaa~ [14:00:05] arlolra, PleaseStand, and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] hello [14:00:23] I'm also here [14:01:07] I’m here but only for the first half hour [14:01:19] (03PS2) 10Matthias Mullie: [SearchVue] Change protocol-less urls to https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446) [14:01:26] (03PS3) 10Matthias Mullie: [beta] [SearchVue] Change protocol-less urls to https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446) [14:02:13] arlolra doesn’t seem to be online yet, so I’ll start with PleaseStand [14:03:20] Lucas_WMDE: do you mind me merging this beta-only config patch? (LMK when appropriate so I don't interfere - don't want to cause confusion with an undeployed patch :p) [14:03:32] *this: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/868224 [14:03:44] idk how scap backport will handle that so I’d prefer if you could wait ^^ [14:03:53] o/ [14:04:12] Lucas_WMDE: I can wait; LMK when you're done :p [14:04:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) (owner: 10PleaseStand) [14:04:17] matthiasmullie: ok, thanks [14:04:34] (03PS4) 10Lucas Werkmeister (WMDE): Remove obsolete setting $wgAutoloadAttemptLowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) (owner: 10PleaseStand) [14:04:40] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) (owner: 10PleaseStand) [14:04:40] FYI - I encountered one last week; scap backport actually warns you before deploying that there's another patch [14:04:48] Lucas_WMDE: scap backport will work properly with beta only patches, so just merge and fetch but not sync [14:05:25] Ah, I can wait until the rest of the deployment work is done :) [14:05:51] (03PS3) 10Effie Mouzeli: mediawiki: remove prometheus-nutcracker-exporter auto restart [puppet] - 10https://gerrit.wikimedia.org/r/868403 [14:05:53] (03Merged) 10jenkins-bot: Remove obsolete setting $wgAutoloadAttemptLowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) (owner: 10PleaseStand) [14:06:08] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:867307|Remove obsolete setting $wgAutoloadAttemptLowercase (T231412)]] [14:06:13] T231412: Deprecate and remove $wgAutoloadAttemptLowercase - https://phabricator.wikimedia.org/T231412 [14:06:23] RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:57] taavi: I assume “just merge and fetch but not sync” is what would happen if I ran `scap backport`? [14:07:11] whereas matthiasmullie just wanted to +2 the patch himself IIUC ^^ [14:07:15] PROBLEM - Check systemd state on mw2298 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:49] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and ki: Backport for [[gerrit:867307|Remove obsolete setting $wgAutoloadAttemptLowercase (T231412)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:08:14] PleaseStand: is it possible to test this change? [14:08:25] (03CR) 10Muehlenhoff: [C: 03+2] Make puppetdb[12]003 puppetdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/863255 (owner: 10Muehlenhoff) [14:08:33] * Lucas_WMDE quickly checks that https://en.wikipedia.org/wiki/Special:Version doesn’t explode on mwdebug [14:09:52] Lucas_WMDE: The removed config setting is obsolete, so there should be no visible change. [14:10:57] ok, thanks [14:11:01] syncing [14:12:05] (03CR) 10Elukey: [C: 03+2] "\o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/868407 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos) [14:12:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Jclark-ctr) @Cmjohnson reseated [14:12:20] (03CR) 10Klausman: [C: 03+1] ml-services: remove translatewiki and frwikisource isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) (owner: 10AikoChou) [14:12:48] (03CR) 10Effie Mouzeli: "pcc https://puppet-compiler.wmflabs.org/output/868403/38813/" [puppet] - 10https://gerrit.wikimedia.org/r/868403 (owner: 10Effie Mouzeli) [14:12:53] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: remove prometheus-nutcracker-exporter auto restart [puppet] - 10https://gerrit.wikimedia.org/r/868403 (owner: 10Effie Mouzeli) [14:14:01] (03PS1) 10Jgiannelos: Exclude OSM tag that causes a failing import [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) [14:14:20] (03CR) 10CI reject: [V: 04-1] Exclude OSM tag that causes a failing import [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos) [14:14:50] (03PS2) 10Jgiannelos: Exclude OSM tag that causes a failing import [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) [14:14:51] PROBLEM - Check systemd state on mw1440 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:36] (03PS2) 10Lucas Werkmeister (WMDE): Disable wgParserEnableLegacyMediaDOM on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866653 (https://phabricator.wikimedia.org/T297984) (owner: 10Arlolra) [14:16:19] (03CR) 10Jgiannelos: "MSantos:" [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos) [14:17:06] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:867307|Remove obsolete setting $wgAutoloadAttemptLowercase (T231412)]] (duration: 10m 57s) [14:17:10] T231412: Deprecate and remove $wgAutoloadAttemptLowercase - https://phabricator.wikimedia.org/T231412 [14:17:28] (03PS1) 10Jbond: Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 [14:17:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866653 (https://phabricator.wikimedia.org/T297984) (owner: 10Arlolra) [14:17:47] (03CR) 10Jbond: [C: 03+2] Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 (owner: 10Jbond) [14:17:51] (03CR) 10CI reject: [V: 04-1] Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 (owner: 10Jbond) [14:18:05] sergi0: I don’t think I’ll have time to deploy your backport before I have to go into a meeting, sorry [14:18:11] maybe someone else is around who can deploy them [14:18:16] (*backports, plural) [14:18:18] (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866653 (https://phabricator.wikimedia.org/T297984) (owner: 10Arlolra) [14:18:24] (03PS2) 10Jbond: Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 [14:18:31] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:866653|Disable wgParserEnableLegacyMediaDOM on cawiki (T297984 T314318)]] [14:18:36] T297984: Media html read view considerations - https://phabricator.wikimedia.org/T297984 [14:18:36] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [14:18:53] (03CR) 10CI reject: [V: 04-1] Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 (owner: 10Jbond) [14:20:04] (03PS3) 10Jbond: Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 [14:20:15] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and arlolra: Backport for [[gerrit:866653|Disable wgParserEnableLegacyMediaDOM on cawiki (T297984 T314318)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:20:20] (03CR) 10Jbond: [C: 03+2] Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 (owner: 10Jbond) [14:20:32] arlolra_: can you test the cawiki change on mwdebug? [14:20:40] Yup, one sec [14:21:55] Looks good [14:21:59] ok thanks [14:23:25] PROBLEM - Check systemd state on mw2399 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:30] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1003.eqiad.wmnet [14:24:28] Lucas_WMDE: no worries, I'll wait see if someone is around and re-schedule for later if not, ty [14:24:43] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:29] hmm, the deployment calendar still has most of the non-train windows for the next week, afaict [14:25:30] What is left to deploy? [14:25:38] but the yearly calendar says “no train or deploys” [14:25:43] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudweb2002-dev.wikimedia.org [14:25:47] Reedy: sergi0’s two backports for GrowthExperiments [14:26:01] also matthiasmullie still needs to merge a config-only change [14:26:10] (php-fpm-restart at 40% on my end btw) [14:26:25] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:26:52] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-stretch1002.eqiad.wmnet with reason: host reimage [14:27:15] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:27:49] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:866653|Disable wgParserEnableLegacyMediaDOM on cawiki (T297984 T314318)]] (duration: 09m 18s) [14:27:54] T297984: Media html read view considerations - https://phabricator.wikimedia.org/T297984 [14:27:54] T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318 [14:28:06] matthiasmullie: I’m done, you can merge your change [14:28:25] (03PS4) 10Matthias Mullie: [beta] [SearchVue] Change protocol-less urls to https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446) [14:28:29] thanks [14:29:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446) (owner: 10Matthias Mullie) [14:29:58] (03PS1) 10Bking: wdqs: add nofail to NFS mount options [puppet] - 10https://gerrit.wikimedia.org/r/868418 (https://phabricator.wikimedia.org/T325114) [14:30:00] (03Merged) 10jenkins-bot: [beta] [SearchVue] Change protocol-less urls to https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446) (owner: 10Matthias Mullie) [14:30:27] PROBLEM - Check systemd state on mw1389 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:29] I'm done; anyone else? [14:30:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1003.eqiad.wmnet [14:30:38] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1004.eqiad.wmnet [14:31:03] o/ [14:31:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-stretch1002.eqiad.wmnet with reason: host reimage [14:31:46] Anyone around can help with the remaining GrowthExperiments patches? [14:32:40] jouncebot: next [14:32:40] In 2 hour(s) and 27 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1700) [14:32:55] Yeah, I'm just having a look [14:32:56] I might be able to deploy later (since there’ll be some time before the puppet window) [14:33:00] ok [14:33:24] sergi0: I presume both can go out together? [14:33:38] that's right [14:33:40] (makes it quicker for everyone) [14:33:49] Lucas_WMDE: thanks [14:34:23] Hmm [14:34:32] They've not actually merged into master (but Gergo has +2'd them) [14:35:40] (03PS2) 10Reedy: User impact: instrument info tooltip clicks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868064 (https://phabricator.wikimedia.org/T325041) (owner: 10Sergio Gimeno) [14:35:44] (03CR) 10Reedy: [C: 03+2] User impact: instrument info tooltip clicks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868064 (https://phabricator.wikimedia.org/T325041) (owner: 10Sergio Gimeno) [14:35:48] (03CR) 10Reedy: [C: 03+2] Vue components: react to binding updates of v-click-outside directive [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868063 (https://phabricator.wikimedia.org/T325041) (owner: 10Sergio Gimeno) [14:36:01] just rebase to stack them to get rid of the merge commit chance [14:37:27] PROBLEM - Check systemd state on mw1379 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:02] (03PS1) 10JMeybohm: Update cert-manager to 1.10.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868422 (https://phabricator.wikimedia.org/T325292) [14:38:03] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1004.eqiad.wmnet [14:38:31] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [14:39:25] Reedy: sorry, I didn't get your last comment. They show up to date with wmf.14 branch. Do you mean I need to rebase the master changes? [14:39:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10mpopov) Yes, please list myself and @kzimmerman as approvers for analytics-product-users. And approved :) [14:40:53] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [14:42:11] PROBLEM - Disk space on puppetdb2003 is CRITICAL: DISK CRITICAL - /run/credentials/systemd-sysctl.service is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=puppetdb2003&var-datasource=codfw+prometheus/ops [14:43:59] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudweb1004.wikimedia.org [14:45:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-stretch1002.eqiad.wmnet with OS bullseye [14:45:30] sergi0: The changes they are cherry picked from (ie in master) weren't merged into master [14:45:35] Even though Gergo +2'd them hours ago [14:45:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye co... [14:45:47] I've just rebased them and reapplied +2s and they seem to be going through the gate [14:46:13] (While it's not always the case, it's a usual expectation that cherry picks to deployment branches would have been reviwed and merged into master first) [14:46:14] oh, got you. Thank you [14:46:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack nova: add a default for profile::openstack::base::nova::instance_dev [puppet] - 10https://gerrit.wikimedia.org/r/868167 (owner: 10Andrew Bogott) [14:47:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) [14:47:38] (03CR) 10Andrew Bogott: "This breaks puppet runs on wikitech hosts" [puppet] - 10https://gerrit.wikimedia.org/r/868403 (owner: 10Effie Mouzeli) [14:47:47] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: add a default for profile::openstack::base::nova::instance_dev [puppet] - 10https://gerrit.wikimedia.org/r/868167 (owner: 10Andrew Bogott) [14:51:37] (03PS1) 10Andrew Bogott: Revert "mediawiki: remove prometheus-nutcracker-exporter auto restart" [puppet] - 10https://gerrit.wikimedia.org/r/868354 [14:51:45] (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:23] (03CR) 10CI reject: [V: 04-1] Revert "mediawiki: remove prometheus-nutcracker-exporter auto restart" [puppet] - 10https://gerrit.wikimedia.org/r/868354 (owner: 10Andrew Bogott) [14:54:04] (03PS1) 10Hashar: wm-checks-api: add support for Puppet Catalogue Compiler [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868424 [14:54:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) 05Open→03Resolved I think that there are all done now. [14:56:06] (03PS2) 10Andrew Bogott: Revert "mediawiki: remove prometheus-nutcracker-exporter auto restart" [puppet] - 10https://gerrit.wikimedia.org/r/868354 [14:57:17] CI is nearly done... [14:58:10] (03CR) 10Hashar: "Hi John, this patch is for the Gerrit Checks API UI enhancement which I have deployed this week. The code should be able to recognize mess" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868424 (owner: 10Hashar) [14:58:38] (03Merged) 10jenkins-bot: Vue components: react to binding updates of v-click-outside directive [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868063 (https://phabricator.wikimedia.org/T325041) (owner: 10Sergio Gimeno) [14:58:41] (03Merged) 10jenkins-bot: User impact: instrument info tooltip clicks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868064 (https://phabricator.wikimedia.org/T325041) (owner: 10Sergio Gimeno) [14:58:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10BTullis) Now kafka-stretch2001 is the only one of these four kafka-stretch hosts left with the drive order reversed. ` btullis@cumin1001:~$ sudo... [14:59:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch2001.codfw.wmnet with OS bullseye [14:59:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye [14:59:48] (03PS3) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) [15:00:13] (03PS4) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) [15:00:27] yay [15:01:27] sergi0: Do you need/want to test them on a staging host first? Or not fussed if they just go straight out? [15:02:10] (03PS2) 10JMeybohm: Update cert-manager to 1.10.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868422 (https://phabricator.wikimedia.org/T325292) [15:02:18] I think they can go straight out, I'll check there aren't invalid events request after [15:02:37] (03CR) 10Jbond: "lgtm see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [15:05:02] !log imported prometheus-jmx-exporter for bookworm-wikimedia T321783 [15:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:06] T321783: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 [15:05:11] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:06:14] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:06:39] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:06:45] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:39] (03PS1) 10Giuseppe Lavagetto: php-multiversion-base: add sendmail [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868428 (https://phabricator.wikimedia.org/T325131) [15:12:11] (03PS5) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [15:12:15] (03CR) 10Elukey: "LGTM! Left a little nit for the build's changelog, the rest looks good." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868422 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm) [15:12:23] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:12:45] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:15:03] !log reedy@deploy1002 Synchronized php-1.40.0-wmf.14/extensions/GrowthExperiments/: Two backports (duration: 06m 57s) [15:15:11] sergi0: ^^ [15:15:30] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:15:44] Reedy: is it synced? [15:15:57] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:16:14] yeah [15:16:47] Reedy: All good. I don't see errors. [15:17:01] (03PS6) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) [15:17:03] (03PS1) 10Jbond: sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) [15:17:48] Reedy: thank you very much! [15:17:53] (03CR) 10Jbond: "thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [15:18:00] * Lucas_WMDE back [15:18:03] just too late, it seems ^^ [15:18:08] thanks Reedy! [15:18:52] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10jbond) [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1026163 | upstream bug re java 11 ]] [15:18:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:22:30] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:22:30] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:23:16] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:23:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:24:03] (03PS19) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [15:24:05] !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum1001.eqiad.wmnet [15:24:48] (03CR) 10JMeybohm: "Another thing I forgot (sorry): For prod, you currently need support for overwriting k8s apiserver env variables, like https://gerrit.wiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:24:52] (03PS9) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [15:24:57] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:25:41] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:28:07] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:28:43] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:29:53] ^ please ignore these, durum reboots in progress. they should resolve. [15:29:56] 10.64.48.95 Down 0.900 2.000 3 [15:29:59] this is durum1001 [15:30:39] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:31:43] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:32:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [15:34:40] !log temporary disabling puppet on A:cumin-all to deploy spicerack v6.0.0 [15:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:28] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:37:56] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:38:03] (03CR) 10Volans: [C: 04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [15:38:12] (03PS3) 10Volans: cookbooks.sre: add title for the group [cookbooks] - 10https://gerrit.wikimedia.org/r/868088 [15:39:16] (03PS1) 10Jbond: nginx: let puppet pick the correct provider [puppet] - 10https://gerrit.wikimedia.org/r/868431 (https://phabricator.wikimedia.org/T321783) [15:40:08] (03CR) 10Volans: [C: 03+2] cumin: fix email address for insetup role audit [puppet] - 10https://gerrit.wikimedia.org/r/868396 (owner: 10Volans) [15:41:16] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2009.codfw.wmnet [15:41:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/868431 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:42:20] (03CR) 10DCausse: [C: 03+1] wdqs: add nofail to NFS mount options [puppet] - 10https://gerrit.wikimedia.org/r/868418 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [15:43:34] RECOVERY - Disk space on puppetdb2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=puppetdb2003&var-datasource=codfw+prometheus/ops [15:44:08] PROBLEM - Host durum1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:53] it gave up! [15:45:54] nic renamed? [15:47:10] not sure, but unlikely I guess given I have had no issues in the previous reboots. looking! [15:48:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38815/console" [puppet] - 10https://gerrit.wikimedia.org/r/868431 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [15:51:22] (03PS1) 10Giuseppe Lavagetto: mediawiki: add configuration for sendmail [deployment-charts] - 10https://gerrit.wikimedia.org/r/868432 (https://phabricator.wikimedia.org/T325131) [15:52:22] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [15:52:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868418 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [15:54:24] (03PS10) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [15:54:45] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-stretch2001.codfw.wmnet with reason: host reimage [15:55:13] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:55:15] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:55:52] (03CR) 10Volans: [C: 03+1] "LGTM, modulo the decision on where to put the icinga check." [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [15:57:06] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868378 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [15:57:34] !log installing openexr security updates [15:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-stretch2001.codfw.wmnet with reason: host reimage [16:00:13] (03PS1) 10RobH: updating for gen 15 r650xs [software] - 10https://gerrit.wikimedia.org/r/868435 [16:00:16] (03CR) 10Bking: [C: 03+2] wdqs: add nofail to NFS mount options [puppet] - 10https://gerrit.wikimedia.org/r/868418 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [16:00:19] (03CR) 10CI reject: [V: 04-1] updating for gen 15 r650xs [software] - 10https://gerrit.wikimedia.org/r/868435 (owner: 10RobH) [16:01:16] (03Abandoned) 10RobH: updating for gen 15 r650xs [software] - 10https://gerrit.wikimedia.org/r/868435 (owner: 10RobH) [16:02:09] !log switching maps/kartotherian back to codfw [16:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:51] !log installing glibc security updates on bullseye [16:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:59] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch2001.codfw.wmnet with OS bullseye [16:03:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye ex... [16:03:54] (03PS1) 10RobH: updating R650xs skus [software] - 10https://gerrit.wikimedia.org/r/868436 [16:04:02] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): byte/str TypeError during svg conversion - https://phabricator.wikimedia.org/T325150 (10Vlad.shapik) a:05hnowlan→03Vlad.shapik [16:04:09] (03CR) 10RobH: [C: 03+2] updating R650xs skus [software] - 10https://gerrit.wikimedia.org/r/868436 (owner: 10RobH) [16:04:37] (03Merged) 10jenkins-bot: updating R650xs skus [software] - 10https://gerrit.wikimedia.org/r/868436 (owner: 10RobH) [16:04:56] (03PS20) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [16:06:57] (03PS11) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [16:07:56] (03CR) 10Andrew Bogott: [C: 03+2] Revert "mediawiki: remove prometheus-nutcracker-exporter auto restart" [puppet] - 10https://gerrit.wikimedia.org/r/868354 (owner: 10Andrew Bogott) [16:07:58] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:12:35] (03PS1) 10Muehlenhoff: elasticsearch: Enable profile::auto_restarts::service for Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/868438 (https://phabricator.wikimedia.org/T135991) [16:14:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Ottomata) WOWWW THANK YOU! [16:15:33] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:16:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868438 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:17:32] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host wdqs2009.codfw.wmnet [16:18:01] RECOVERY - Check systemd state on mw1351 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:01] RECOVERY - Check systemd state on mw1475 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:05] RECOVERY - Check systemd state on mw1402 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:05] RECOVERY - Check systemd state on mw1379 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:07] RECOVERY - Check systemd state on mw1423 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:09] RECOVERY - Check systemd state on mw1464 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:09] RECOVERY - Check systemd state on mw1494 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:09] RECOVERY - Check systemd state on mw1361 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:13] RECOVERY - Check systemd state on mw1447 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:13] RECOVERY - Check systemd state on mw1482 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:15] RECOVERY - Check systemd state on mw1415 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:15] RECOVERY - Check systemd state on parse1018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:19] RECOVERY - Check systemd state on mw1435 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:19] RECOVERY - Check systemd state on mw1449 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:19] RECOVERY - Check systemd state on mw1439 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:25] RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:25] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) @AnnWF sorry I missed adding you to the wmf group as well, try now please! [16:18:29] RECOVERY - Check systemd state on mw2382 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:29] RECOVERY - Check systemd state on mw2374 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:31] RECOVERY - Check systemd state on mw1390 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:31] RECOVERY - Check systemd state on mw1440 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:31] RECOVERY - Check systemd state on parse1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:33] RECOVERY - Check systemd state on mw2295 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:34] RECOVERY - Check systemd state on parse2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:35] RECOVERY - Check systemd state on mw2286 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:39] RECOVERY - Check systemd state on mw2371 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:39] RECOVERY - Check systemd state on mw1394 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:39] RECOVERY - Check systemd state on mw1497 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:41] RECOVERY - Check systemd state on mw1450 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:43] RECOVERY - Check systemd state on mw2269 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:45] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:45] RECOVERY - Check systemd state on mw2387 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:49] RECOVERY - Check systemd state on mw1448 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:49] RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:49] RECOVERY - Check systemd state on mw1474 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:49] RECOVERY - Check systemd state on mw2388 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:49] RECOVERY - Check systemd state on mw1401 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:51] RECOVERY - Check systemd state on mw1418 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:51] RECOVERY - Check systemd state on mw1416 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:53] RECOVERY - Check systemd state on mw1443 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:53] RECOVERY - Check systemd state on parse1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:55] RECOVERY - Check systemd state on mw2372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:59] RECOVERY - Check systemd state on parse2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:59] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:59] RECOVERY - Check systemd state on mw2310 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:01] RECOVERY - Check systemd state on mw2331 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:01] RECOVERY - Check systemd state on mw2350 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:05] RECOVERY - Check systemd state on mw1454 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:06] RECOVERY - Check systemd state on parse2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:07] RECOVERY - Check systemd state on mw1404 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:07] RECOVERY - Check systemd state on mw1436 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:07] RECOVERY - Check systemd state on mw1378 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:07] RECOVERY - Check systemd state on mw2272 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:07] RECOVERY - Check systemd state on mw2311 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:15] RECOVERY - Check systemd state on mw2316 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:15] RECOVERY - Check systemd state on mw1356 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:15] RECOVERY - Check systemd state on mw2271 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:15] RECOVERY - Check systemd state on mw2406 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:19] RECOVERY - Check systemd state on mw1389 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:25] RECOVERY - Check systemd state on mw1489 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:25] RECOVERY - Check systemd state on parse2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:29] RECOVERY - Check systemd state on mw2259 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:33] RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:35] RECOVERY - Check systemd state on mw2399 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:37] RECOVERY - Check systemd state on mw2273 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:43] RECOVERY - Check systemd state on mw2298 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:43] RECOVERY - Check systemd state on mw2276 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:45] RECOVERY - Check systemd state on mw2367 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:45] RECOVERY - Check systemd state on mw2376 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:55] RECOVERY - Check systemd state on mw1417 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:55] RECOVERY - Check systemd state on mw2385 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1004.wikimedia.org [16:22:36] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM durum1001.eqiad.wmnet [16:23:27] RECOVERY - Check systemd state on mw1441 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:37] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudweb1003.wikimedia.org [16:23:51] (03CR) 10Volans: [C: 03+2] spicerack: update config for v6.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [16:24:34] !log upgrading spicerack to v6.0.0 on cumin2002 [16:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:42] (03PS1) 10David Caro: toolforge: add missing /etc/toolforge dir [puppet] - 10https://gerrit.wikimedia.org/r/868439 [16:25:03] (03CR) 10CI reject: [V: 04-1] toolforge: add missing /etc/toolforge dir [puppet] - 10https://gerrit.wikimedia.org/r/868439 (owner: 10David Caro) [16:25:18] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2002-dev.wikimedia.org [16:26:16] (03CR) 10Volans: [C: 03+2] cookbooks: remove top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [16:26:31] RECOVERY - Check systemd state on mw1431 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:32] (03CR) 10Andrew Bogott: [C: 03+2] puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott) [16:26:33] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:39] (03PS1) 10Bking: wdqs: allow mounting clouddumps share from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/868440 (https://phabricator.wikimedia.org/T323096) [16:27:41] (03PS2) 10David Caro: toolforge: add missing /etc/toolforge dir [puppet] - 10https://gerrit.wikimedia.org/r/868439 [16:27:59] RECOVERY - Check systemd state on mw1462 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:00] (03Merged) 10jenkins-bot: cookbooks: remove top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans) [16:28:01] RECOVERY - Check systemd state on mw2356 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:01] RECOVERY - Check systemd state on mw2401 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:16] (03PS2) 10Bking: wdqs: allow mounting clouddumps share from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/868440 (https://phabricator.wikimedia.org/T323096) [16:29:11] (03CR) 10Volans: [C: 03+2] cookbooks.sre: add title for the group [cookbooks] - 10https://gerrit.wikimedia.org/r/868088 (owner: 10Volans) [16:30:45] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1003.wikimedia.org [16:31:28] (03Merged) 10jenkins-bot: cookbooks.sre: add title for the group [cookbooks] - 10https://gerrit.wikimedia.org/r/868088 (owner: 10Volans) [16:34:22] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin2002.codfw.wmnet with reason: test spicerack v6.0.0 [16:34:37] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin2002.codfw.wmnet with reason: test spicerack v6.0.0 [16:35:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: add missing /etc/toolforge dir [puppet] - 10https://gerrit.wikimedia.org/r/868439 (owner: 10David Caro) [16:36:37] (03PS1) 10DLynch: Release new DiscussionTools reply button enhancement to Arabic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868441 (https://phabricator.wikimedia.org/T323537) [16:36:48] !log upgrading spicerack to v6.0.0 on cumin1001 [16:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:49] 10SRE, 10serviceops: kubernetes102[34] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10akosiaris) [16:40:12] (03CR) 10Cathal Mooney: [C: 04-1] "Thanks for the review! I'll get working on those bits and upload a new patchset cheers." [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [16:42:02] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@15e6aa7] (codfw): Revert "codfw: Disable traffic mirroring" [16:43:46] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@15e6aa7] (codfw): Revert "codfw: Disable traffic mirroring" (duration: 01m 44s) [16:43:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10BTullis) OK, I think that both of these two hosts are set up correctly now. The failure in the cookbook above was only a delayed install of `per... [16:44:36] (03CR) 10David Caro: [C: 03+2] toolforge: add missing /etc/toolforge dir [puppet] - 10https://gerrit.wikimedia.org/r/868439 (owner: 10David Caro) [16:45:55] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1010.eqiad.wmnet with OS bullseye [16:46:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1010... [16:49:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868440 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [16:50:29] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus3001.esams.wmnet [16:50:29] (03PS1) 10AikoChou: ml-services: update revertrisk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) [16:51:58] (03PS2) 10AikoChou: ml-services: update revertrisk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) [16:52:45] !log aphlict1001 - rebooting - this could mean for a minute there are dropped notifications about Phabricator tickets [16:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:26] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2002.codfw.wmnet with OS bullseye [16:55:03] PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: aphlict.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:57] well, machine is back, service is not :p as g.odog would say: sadtrombone.wav [16:56:28] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3001.esams.wmnet [16:57:40] ok, had to delete pid file manually and restart it [16:57:51] did not properly shut down on reboot [16:58:03] (03PS1) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [16:58:09] RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:58:22] (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [16:58:39] !log aphlict1001 - aphlict service did not come back after rebooting machine. fix was to manually 'rm /var/run/aphlict/aphlict.pid' and 'systemtctl start aphlict' [16:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:02:16] (03PS2) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [17:02:36] (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:04:10] !log phab2002 (non active phabricator server) - rebooting [17:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:35] (03CR) 10DCausse: [C: 03+1] wdqs: allow mounting clouddumps share from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/868440 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [17:05:17] PROBLEM - Host phab2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:32] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on phab2002.codfw.wmnet with reason: reboot [17:06:38] (03PS3) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [17:06:47] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on phab2002.codfw.wmnet with reason: reboot [17:06:51] RECOVERY - Host phab2002 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms [17:07:05] (03CR) 10Bking: [C: 03+2] wdqs: allow mounting clouddumps share from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/868440 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [17:07:29] (03PS8) 10Volans: base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) [17:08:45] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:10:29] (03CR) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) (owner: 10Arturo Borrero Gonzalez) [17:11:43] (03PS4) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [17:13:35] (03PS5) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [17:14:16] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:15:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Ottomata) YAYYY [17:17:29] !log manually removed /etc/spicerack/redis_cluster/sessions.yaml from the cumin hosts [17:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:35] cc moritzm effie ^^^ [17:21:35] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2002.codfw.wmnet with reason: host reimage [17:21:45] (03PS1) 10Volans: cookbook: improve help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/868445 [17:24:21] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on miscweb2002.codfw.wmnet with reason: reboot [17:24:42] !log miscweb2002 - passive host, rebooting [17:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:47] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on miscweb2002.codfw.wmnet with reason: reboot [17:24:59] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2002.codfw.wmnet with reason: host reimage [17:25:17] !log denisse@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka logging-codfw cluster: Reboot kafka nodes [17:26:37] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:26:39] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [17:27:01] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2313.codfw.wmnet, mw2271.codfw.wmnet, mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2415.codfw.wmnet, mw2389.codfw.wmnet, mw2274.codfw.wmnet, mw2392.codfw.wmnet, mw2333.codfw.wmnet, mw2414.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2371.codfw.wmnet, mw2310.codfw.wmnet, mw2273.codfw.wmnet, mw2413.codfw.wmnet, mw [17:27:01] fw.wmnet, mw2303.codfw.wmnet, mw2325.codfw.wmnet, mw2393.codfw.wmnet, mw2314.codfw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2369.codfw.wmnet, mw2269.codfw.wmnet, mw2365.codfw.wmnet, mw2412.codfw.wmnet, mw2406.codfw.wmnet, mw2408.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2373.codfw.wmnet, mw2270.codfw.wmnet, mw2335.codfw.wmnet, mw2339.codfw.wmnet, mw2316.codfw.wmnet, mw2377.codfw.wmnet, mw2385.codfw [17:27:01] mw2331.codfw.wmnet, mw2277.codfw.wmnet, mw2384.codfw.wmnet, mw2305.codfw.wmnet, mw2388.codfw.wmnet, mw2337.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2379.codfw.wmnet, mw238 https://wikitech.wikimedia.org/wiki/PyBal [17:27:18] (ProbeDown) firing: (6) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:19] (ProbeDown) firing: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:27:41] ugh [17:27:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:27:50] err. [17:27:51] is this the kafka server reboot? [17:27:59] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.9697 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:28:13] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method= [17:28:17] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:28:21] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 1401 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:28:45] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:28:55] PROBLEM - PHP7 rendering on mw2384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:28:55] PROBLEM - PHP7 rendering on mw2337 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:28:55] PROBLEM - PHP7 rendering on mw2392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:28:57] PROBLEM - PHP7 rendering on mw2271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:28:57] PROBLEM - PHP7 rendering on mw2301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:28:57] PROBLEM - PHP7 rendering on mw2313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:28:57] PROBLEM - PHP7 rendering on mw2380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:28:59] PROBLEM - PHP7 rendering on mw2393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:28:59] PROBLEM - PHP7 rendering on mw2335 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:01] PROBLEM - PHP7 rendering on mw2315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:01] PROBLEM - PHP7 rendering on mw2274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:03] PROBLEM - PHP7 rendering on mw2371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:03] PROBLEM - PHP7 rendering on mw2272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:05] PROBLEM - PHP7 rendering on mw2269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:07] PROBLEM - PHP7 rendering on mw2305 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:07] PROBLEM - PHP7 rendering on mw2273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:11] mutante: Possibly yes, I'm looking at it. [17:29:11] PROBLEM - PHP7 rendering on mw2378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:11] PROBLEM - PHP7 rendering on mw2383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:11] PROBLEM - PHP7 rendering on mw2385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:11] PROBLEM - PHP7 rendering on mw2390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:11] PROBLEM - PHP7 rendering on mw2409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:12] PROBLEM - PHP7 rendering on mw2275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:12] PROBLEM - PHP7 rendering on mw2414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:13] PROBLEM - PHP7 rendering on mw2276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:15] PROBLEM - PHP7 rendering on mw2339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:15] PROBLEM - PHP7 rendering on mw2413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:23] PROBLEM - PHP7 rendering on mw2369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:23] PROBLEM - PHP7 rendering on mw2329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:27] PROBLEM - PHP7 rendering on mw2311 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:27] PROBLEM - PHP7 rendering on mw2391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:27] PROBLEM - PHP7 rendering on mw2331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:29] PROBLEM - PHP7 rendering on mw2379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:31] PROBLEM - PHP7 rendering on mw2303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:33] PROBLEM - PHP7 rendering on mw2270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [17:29:37] PROBLEM - PHP7 rendering on mw2388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:41] PROBLEM - PHP7 rendering on mw2314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:41] PROBLEM - PHP7 rendering on mw2363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:43] PROBLEM - PHP7 rendering on mw2386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:43] PROBLEM - PHP7 rendering on mw2408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:45] PROBLEM - PHP7 rendering on mw2307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:51] PROBLEM - PHP7 rendering on mw2312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:51] PROBLEM - PHP7 rendering on mw2373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:51] PROBLEM - PHP7 rendering on mw2361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:53] PROBLEM - PHP7 rendering on mw2316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:53] PROBLEM - PHP7 rendering on mw2359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:29:53] PROBLEM - PHP7 rendering on mw2367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:30:17] PROBLEM - PHP7 rendering on mw2309 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:31:37] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [17:32:05] (03PS1) 10BryanDavis: developer-portal: Bump container to 2022-12-12-165842-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868448 [17:32:18] (ProbeDown) firing: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:32:19] (ProbeDown) firing: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:17] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:33:29] PROBLEM - PHP7 rendering on mw2325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:33:29] PROBLEM - PHP7 rendering on mw2336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:33:29] PROBLEM - PHP7 rendering on mw2389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:33:29] PROBLEM - PHP7 rendering on mw2412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:34:21] (03PS6) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [17:35:06] (03PS3) 10AikoChou: ml-services: update revertrisk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218) [17:36:13] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2393.codfw.wmnet, mw2313.codfw.wmnet, mw2271.codfw.wmnet, mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2415.codfw.wmnet, mw2274.codfw.wmnet, mw2392.codfw.wmnet, mw2333.codfw.wmnet, mw2414.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2316.codfw.wmnet, mw2303.codfw.wmnet, mw2325.codfw.wmnet, mw2379.codfw.wmnet, mw [17:36:14] fw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2369.codfw.wmnet, mw2269.codfw.wmnet, mw2365.codfw.wmnet, mw2406.codfw.wmnet, mw2408.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2373.codfw.wmnet, mw2335.codfw.wmnet, mw2339.codfw.wmnet, mw2272.codfw.wmnet, mw2377.codfw.wmnet, mw2385.codfw.wmnet, mw2331.codfw.wmnet, mw2277.codfw.wmnet, mw2384.codfw.wmnet, mw2305.codfw.wmnet, mw2337.codfw.wmnet, mw2407.codfw [17:36:14] mw2268.codfw.wmnet, mw2301.codfw.wmnet, mw2273.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2329.codfw.wmnet, mw2391.codfw.wmnet, mw2309.codfw.wmnet, mw2387.codfw.wmnet, mw231 https://wikitech.wikimedia.org/wiki/PyBal [17:36:33] (03PS7) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [17:37:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2129', diff saved to https://phabricator.wikimedia.org/P42713 and previous config saved to /var/cache/conftool/dbconfig/20221215-173713-ladsgroup.json [17:37:16] (03PS1) 10BryanDavis: striker: Bump container version to 2022-12-01-223001-production [puppet] - 10https://gerrit.wikimedia.org/r/868449 [17:37:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38824/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:38:17] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [17:38:35] PROBLEM - PHP7 rendering on mw2415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:38:59] (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:39:01] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [17:39:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [17:39:41] Hmm looks like the sign-in page on en-wiki isn't working for me. Tried different browsers and workstations. [17:40:11] Oshwah: ack, is known currently ^^ [17:40:41] TheresNoTime: Yeah I figured such. I just thought I'd mention it just in case. Thanks for letting me know. [17:41:33] the wikis have been down for me for 15 mins or so. some people are reporting they've been working fine, others also reporting them down. [17:41:39] PROBLEM - PHP7 rendering on mw2277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:41:43] RECOVERY - PHP7 rendering on mw2316 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 9.894 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:41:50] <_joe_> ragesoss: still down now? [17:41:51] Looks like it's coming back for me. [17:42:21] _joe_: yes, still just spinning for me [17:42:31] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [17:42:40] okay, maybe finally coming back, although extremely slow page loads [17:42:40] Worked just fine for me. Make sure to clear cookies, cache and test again. [17:42:52] (03PS8) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [17:42:57] RECOVERY - PHP7 rendering on mw2276 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:42:57] RECOVERY - PHP7 rendering on mw2309 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:42:57] RECOVERY - PHP7 rendering on mw2275 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:42:57] RECOVERY - PHP7 rendering on mw2305 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:42:57] RECOVERY - PHP7 rendering on mw2307 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:42:58] RECOVERY - PHP7 rendering on mw2273 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:42:58] RECOVERY - PHP7 rendering on mw2325 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:42:59] RECOVERY - PHP7 rendering on mw2277 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 6.147 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:42:59] RECOVERY - PHP7 rendering on mw2270 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.803 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:44:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38825/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:44:17] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:44:21] RECOVERY - PHP7 rendering on mw2386 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:44:38] (03CR) 10BryanDavis: [C: 03+1] "PCC output: https://puppet-compiler.wmflabs.org/output/868449/3/" [puppet] - 10https://gerrit.wikimedia.org/r/868449 (owner: 10BryanDavis) [17:45:15] (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:45:37] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09091 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:45:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool db2129', diff saved to https://phabricator.wikimedia.org/P42714 and previous config saved to /var/cache/conftool/dbconfig/20221215-174537-ladsgroup.json [17:46:39] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [17:47:01] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:47:03] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:47:14] (03PS5) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) [17:47:18] (ProbeDown) resolved: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:47:19] (ProbeDown) resolved: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:48:11] RECOVERY - Host durum1001 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [17:48:18] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:48:33] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1010.eqiad.wmnet with reason: host reimage [17:49:38] (03PS9) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [17:49:40] (03PS1) 10Jbond: O:cluster::cloud_managment: remove unneeded profiles [puppet] - 10https://gerrit.wikimedia.org/r/868451 [17:49:45] PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:49:59] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:50:01] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:51:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38826/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:51:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1010.eqiad.wmnet with reason: host reimage [17:51:57] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2002.codfw.wmnet with OS bullseye [17:52:04] (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:53:56] (03PS10) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [17:55:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38827/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:56:23] (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:57:42] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2022-12-12-165842-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868448 (owner: 10BryanDavis) [18:00:05] bd808: #bothumor I � Unicode. All rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1800). [18:01:08] o/ I will be pushing out a new build of developer portal [18:02:45] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2022-12-12-165842-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868448 (owner: 10BryanDavis) [18:03:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38828/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [18:03:56] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:04:19] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:04:45] (03PS1) 10Andrew Bogott: cloud-vps puppet: allow multiple users to access our puppet git checkout [puppet] - 10https://gerrit.wikimedia.org/r/868454 (https://phabricator.wikimedia.org/T325280) [18:05:07] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [18:05:11] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:05:53] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:07:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38830/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [18:07:36] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:08:09] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:08:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [18:08:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1010.eqiad.wmnet with OS bullseye [18:08:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1010.eqi... [18:12:12] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps puppet: allow multiple users to access our puppet git checkout [puppet] - 10https://gerrit.wikimedia.org/r/868454 (https://phabricator.wikimedia.org/T325280) (owner: 10Andrew Bogott) [18:19:10] (03PS11) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) [18:27:46] (03CR) 10Jbond: [C: 03+2] sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond) [18:28:51] (03CR) 10Herron: [C: 03+1] elasticsearch: Enable profile::auto_restarts::service for Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/868438 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:29:18] (03CR) 10Herron: [C: 03+1] netmon: Remove netmon1002 from DSH node group [puppet] - 10https://gerrit.wikimedia.org/r/868207 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [18:29:35] (03CR) 10Herron: [C: 03+1] netmon: Remove netmon2001 from DSH node group. [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [18:33:25] (03CR) 10Ahmon Dancy: [C: 03+1] "I assume this will cause /usr/share/GeoIPInfo to be populated on the deploy server." [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn) [18:36:10] (03PS1) 10Vlad.shapik: Fix TypeError of SVG conversion [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) [18:36:14] (03CR) 10Dzahn: "yea, it definitely creates the resource /usr/share/GeoIPInfo https://puppet-compiler.wmflabs.org/output/868199/38794/deploy2002.codfw.wmn" [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn) [18:37:39] !log denisse@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka logging-codfw cluster: Reboot kafka nodes [18:38:19] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: byte/str TypeError during svg conversion - https://phabricator.wikimedia.org/T325150 (10Vlad.shapik) It seems that I found where is the trick. As it turned out the failed SVG file has a small body, as a result, the source in the prepare_sou... [18:42:10] (03CR) 10MSantos: Exclude OSM tag that causes a failing import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos) [18:43:15] (03PS2) 10Vlad.shapik: Fix TypeError of SVG conversion [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) [18:43:52] Please note we're going to see a lot of .mgmt.ulsfo.wmnet flaps shortly when i go to swap the msw in rack .22 [18:44:16] thanks for the heads-up robh [18:45:34] sorry rack .23 [18:45:37] but yeah, samet hign [18:45:44] shouldnt have user impact and only affects traffic [18:48:19] !log denisse@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka logging-eqiad cluster: Reboot kafka nodes [18:50:40] !log starting msw2-ulsfo swap for rack .23, mgmt will flap with no expected user impact [18:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:43] half done old msw out [18:53:45] new one going in [18:54:33] (03CR) 10Jbond: First stab at possible ferm::qos resource for DSCP marking (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [18:54:37] PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:55:01] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:55:45] PROBLEM - Host ripe-atlas-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:55:45] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:56:19] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2022-12-18 12:02:52 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [18:56:39] PROBLEM - Host cr4-ulsfo.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:57:38] expected for ulsfo mgmt [18:57:46] itll be back in less than 5, done with network adding power [18:59:14] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:59:40] ok new msw2 in palce [18:59:44] those should all start clearing [19:00:03] (03CR) 10Jbond: "orry missed one" [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [19:00:03] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.592e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [19:00:04] hashar and ^demon: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1900). [19:00:31] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.89 ms [19:00:49] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-02-16 11:40:18 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [19:00:49] PROBLEM - Maps - OSM synchronization lag - codfw on alert1001 is CRITICAL: 2.592e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=12 [19:01:01] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:01:18] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Remove netmon1002 from DSH node group [puppet] - 10https://gerrit.wikimedia.org/r/868207 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [19:02:09] RECOVERY - Host cr4-ulsfo.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.25 ms [19:04:14] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:07:01] (03PS2) 10Andrea Denisse: netmon: Remove netmon2001 from DSH node group. [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695) [19:07:58] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@d23127b]: Deploying Spark3 upgrade of image_suggestions job to the platform_eng Airflow instance. [19:07:59] bleh didnt mean to dc during recoveries [19:08:02] (03PS1) 10Ssingh: P:cumin: set per-site aliases for Wikidough/durum [puppet] - 10https://gerrit.wikimedia.org/r/868458 [19:08:08] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@d23127b]: Deploying Spark3 upgrade of image_suggestions job to the platform_eng Airflow instance. (duration: 00m 10s) [19:08:11] no more down mgmt though so yay [19:08:15] atlas still is so thats odd.... [19:09:09] (03PS2) 10Ssingh: P:cumin: set per-site aliases for Wikidough/durum [puppet] - 10https://gerrit.wikimedia.org/r/868458 [19:09:11] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Remove netmon2001 from DSH node group. [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [19:09:25] ok, itll come back it has a shitty power cable [19:09:37] and was unseated, its appliance so has non standard 2 pin ungrounded cable [19:10:03] (03PS3) 10Ssingh: P:cumin: set per-site aliases for Wikidough/durum [puppet] - 10https://gerrit.wikimedia.org/r/868458 [19:11:31] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38834/console" [puppet] - 10https://gerrit.wikimedia.org/r/868458 (owner: 10Ssingh) [19:16:13] (03PS1) 10Andrew Bogott: cloud-vps puppet: rework git config safe.dir definition [puppet] - 10https://gerrit.wikimedia.org/r/868460 (https://phabricator.wikimedia.org/T325280) [19:17:58] (03CR) 10CI reject: [V: 04-1] cloud-vps puppet: rework git config safe.dir definition [puppet] - 10https://gerrit.wikimedia.org/r/868460 (https://phabricator.wikimedia.org/T325280) (owner: 10Andrew Bogott) [19:20:05] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2022-12-18 12:02:52 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [19:21:25] andrewbogott: ^ is this why you had that certificate discussion yesterday? [19:21:33] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-02-16 11:40:18 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [19:21:55] andrewbogott: guess not (anymore) ^ :) [19:21:58] That was syslog related [19:22:11] (03PS2) 10Andrew Bogott: cloud-vps puppet: rework git config safe.dir definition [puppet] - 10https://gerrit.wikimedia.org/r/868460 (https://phabricator.wikimedia.org/T325280) [19:22:20] vgutierrez: ok, thanks [19:22:57] ok the remainder of ulsfo work is non impacting and decomed cruft [19:23:05] so ulsfo maint mostly done but cabinets still open [19:24:20] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps puppet: rework git config safe.dir definition [puppet] - 10https://gerrit.wikimedia.org/r/868460 (https://phabricator.wikimedia.org/T325280) (owner: 10Andrew Bogott) [19:24:27] robh: you can get power cables in the US with an actual ground? :P [19:24:46] for consumer grade appliances its all too common [19:24:55] but atlas in datacenter is outlier [19:25:02] heh [19:25:35] The fact that walgreens et al sell "adapters" to plug plugs that have a ground into wall sockets that don't... [19:26:58] mutante: no, that flapping alert means that some apache2 threads are still hanging to the old certificate for whatever reason. an apache restart ({{done}}) fixes it [19:29:15] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1746 MB (3% inode=84%): /tmp 1746 MB (3% inode=84%): /var/tmp 1746 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [19:30:24] taavi: makes sense, thanks. I had not seen the recovery yet at first [19:30:43] (03PS3) 10Bking: [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) [19:32:24] (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [19:35:48] !log short downtime for misc websites, iegreview, racktables, transparency.wm, annual.wm, design.wm, sitemaps.wm, research.wm, bienvenida.wm, wikiworkshop.org [19:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone) 05Open→03Resolved @akosiaris I tried Virginia Poundstone Luke pointed me to https://phabricator.wikimedia.org/T293241#7436893 and I got it... [19:51:04] (03PS1) 10Hashar: deploy_artifacts: add dry run mode [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868461 [19:51:06] (03PS1) 10Hashar: deploy_artifacts: --version is a required option [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868462 [19:51:55] 10SRE-OnFire, 10Discovery-Search, 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10RKemper) [19:52:50] (03CR) 10BPirkle: "lgtm, but I don't have +2 permissions on this repo." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) (owner: 10Vlad.shapik) [19:53:01] (03CR) 10BPirkle: [C: 03+1] Fix TypeError of SVG conversion [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) (owner: 10Vlad.shapik) [20:02:09] !log denisse@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka logging-eqiad cluster: Reboot kafka nodes [20:06:45] (03CR) 10Dzahn: [C: 03+2] docker_registry_ha: add contint2002 to image builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/867708 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [20:15:01] working on a new cookbook, apologize in advance for the spam [20:15:33] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:17:20] (03PS1) 10Ryan Kemper: [WIP] wdqs: switch to using NFS for dump files [cookbooks] - 10https://gerrit.wikimedia.org/r/868465 [20:19:02] (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: switch to using NFS for dump files [cookbooks] - 10https://gerrit.wikimedia.org/r/868465 (owner: 10Ryan Kemper) [20:19:54] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [20:20:46] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [20:21:55] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [20:22:09] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [20:22:16] (03PS1) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) [20:23:21] (03CR) 10MSantos: [C: 04-1] "I'm on the fence about this one, the docs states [1] that changing the mapping config will require a fresh re-import of the DB otherwise O" [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos) [20:24:38] jouncebot: nowandnext [20:24:38] For the next 0 hour(s) and 35 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1900) [20:24:38] In 0 hour(s) and 35 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T2100) [20:25:07] (03PS2) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) [20:26:32] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2022-12-01-223001-production [puppet] - 10https://gerrit.wikimedia.org/r/868449 (owner: 10BryanDavis) [20:27:03] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Andrew) Huh, is anyone tasked with this? This is one of the few cases that's keeping Stretch alive in cloud-vps and prod. [20:27:40] (03PS3) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) [20:29:06] (03CR) 10Dzahn: "looks pretty good to me, one inline nitpick about repeating the values in the host yaml" [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [20:30:45] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1741 MB (3% inode=84%): /tmp 1741 MB (3% inode=84%): /var/tmp 1741 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [20:31:15] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/868467/38839/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [20:31:41] (03PS1) 10Gergő Tisza: NewImpact: Use "View all edits" in footer [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216) [20:34:00] (03CR) 10Cwhite: [C: 03+2] install_server: set codfw logstash vms to install bullseye [puppet] - 10https://gerrit.wikimedia.org/r/861872 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [20:34:24] (03PS4) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) [20:42:42] (03PS2) 10RLazarus: httpbb: Add tests for test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/868211 (https://phabricator.wikimedia.org/T290536) [20:42:44] (03PS2) 10RLazarus: httpbb: Run hourly tests from the cumin hosts against mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536) [20:52:51] (03PS2) 10Gergő Tisza: NewImpact: Use "View all edits" in footer [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216) [20:56:28] (03PS3) 10Gergő Tisza: NewImpact: Use "View all edits" in footer [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216) [20:57:30] (03CR) 10RLazarus: [C: 03+2] httpbb: Add tests for test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/868211 (https://phabricator.wikimedia.org/T290536) (owner: 10RLazarus) [20:57:43] (03CR) 10RLazarus: [C: 03+2] httpbb: Run hourly tests from the cumin hosts against mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536) (owner: 10RLazarus) [21:00:04] brennen and TheresNoTime: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T2100). [21:00:05] zabe and tgr: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:15] o/ [21:00:16] o/ [21:00:38] I can deploy :) (did you want to self-serve tgr?) [21:01:13] TheresNoTime: if you don't mind doing it, I'm happy with that [21:01:44] sure :) [21:01:50] o/ no one for training today, so i'll let you all proceed. [21:02:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863434 (https://phabricator.wikimedia.org/T184782) (owner: 10Zabe) [21:02:50] (03Merged) 10jenkins-bot: Update reference to CommandLineInc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863434 (https://phabricator.wikimedia.org/T184782) (owner: 10Zabe) [21:03:07] !log samtar@deploy1002 Started scap: Backport for [[gerrit:863434|Update reference to CommandLineInc (T184782)]] [21:03:18] T184782: Get rid of `.inc` files in MediaWiki, using .php instead (was: Test coverage missing for .inc files) - https://phabricator.wikimedia.org/T184782 [21:04:14] (03CR) 10Vlad.shapik: Fix TypeError of SVG conversion (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) (owner: 10Vlad.shapik) [21:04:26] (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216) (owner: 10Gergő Tisza) [21:04:49] !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:863434|Update reference to CommandLineInc (T184782)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:04:51] zabe: that's live on mwdebug, can you test? [21:04:56] no [21:05:19] oh yeah [21:05:20] :D [21:05:31] (syncing, apologies) [21:05:54] lol [21:06:23] * TheresNoTime definitely doesn't have a script to automate all the IRC interactions too /sarcasm [21:11:26] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:863434|Update reference to CommandLineInc (T184782)]] (duration: 08m 18s) [21:11:30] T184782: Get rid of `.inc` files in MediaWiki, using .php instead (was: Test coverage missing for .inc files) - https://phabricator.wikimedia.org/T184782 [21:11:50] zabe: done :) [21:12:07] thanks :) [21:13:09] (just waiting on 868364 to merge) [21:15:47] !log cwhite@cumin2002 conftool action : set/pooled=no; selector: name=logstash2024.codfw.wmnet,service=kibana7 [21:15:47] (03PS14) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [21:16:59] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [21:17:10] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [21:22:25] (03Merged) 10jenkins-bot: NewImpact: Use "View all edits" in footer [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216) (owner: 10Gergő Tisza) [21:22:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216) (owner: 10Gergő Tisza) [21:22:46] !log samtar@deploy1002 Started scap: Backport for [[gerrit:868364|NewImpact: Use "View all edits" in footer (T325216)]] [21:22:50] T325216: Impact Module: "View all edits" - https://phabricator.wikimedia.org/T325216 [21:23:39] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [21:23:40] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [21:24:27] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [21:25:15] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [21:27:50] `build-and-push-container-images` step taking a bit longer than normal [21:31:51] (^ has moved on, but each step taking "longer than normal") [21:35:59] !log samtar@deploy1002 samtar and tgr: Backport for [[gerrit:868364|NewImpact: Use "View all edits" in footer (T325216)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:36:03] T325216: Impact Module: "View all edits" - https://phabricator.wikimedia.org/T325216 [21:36:13] tgr: that's live on mwdebug, can you test? :) [21:37:55] TheresNoTime: looks good, thanks! [21:38:01] ack [21:38:13] (03CR) 10Dzahn: [C: 03+2] cloud: allow VMs to connect to contint1002 and contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867675 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn) [21:40:50] (03PS1) 10Jbond: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/868471 (https://phabricator.wikimedia.org/T321783) [21:41:20] (03CR) 10CI reject: [V: 04-1] monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/868471 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [21:42:57] !log logging to note that this deploy is unusually slow, P42715 [21:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:21] (03CR) 10Cathal Mooney: "Thanks again John for the input. Just a couple of comments in response." [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [21:47:25] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:868364|NewImpact: Use "View all edits" in footer (T325216)]] (duration: 24m 38s) [21:47:29] T325216: Impact Module: "View all edits" - https://phabricator.wikimedia.org/T325216 [21:47:34] and live in prod tgr :) [21:51:57] thx [21:52:16] !log done UTC late backport and config training [21:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:34] (03PS1) 10Andrew Bogott: profile::openstack::base::nutcracker: merge in profile::mediawiki::nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/868472 (https://phabricator.wikimedia.org/T325244) [21:59:12] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::base::nutcracker: merge in profile::mediawiki::nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/868472 (https://phabricator.wikimedia.org/T325244) (owner: 10Andrew Bogott) [21:59:19] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [21:59:21] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:02:19] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts netmon2001.wikimedia.org [22:07:55] (03CR) 10Andrew Bogott: [C: 03+1] "For reference, here's a similar patch I merged for cloud-vps puppet servers" [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar) [22:10:03] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [22:11:12] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:11:13] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts netmon2001.wikimedia.org [22:19:46] (03PS15) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [22:20:06] (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [22:26:08] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 877130 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [22:33:14] RECOVERY - Disk space on an-coord1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [22:42:36] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts netmon2001.wikimedia.org [22:47:04] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [22:48:15] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:48:16] !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts netmon2001.wikimedia.org [22:49:27] !log `[samtar@mwmaint1002 imports]$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Coffeeandcrumbs /home/samtar/imports` T325330 [22:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:31] T325330: Server-side upload request for Coffeeandcrumbs - https://phabricator.wikimedia.org/T325330 [23:00:47] !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts netmon1002.wikimedia.org [23:05:32] !log denisse@cumin1001 START - Cookbook sre.dns.netbox [23:06:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:07:33] (03PS16) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) [23:09:17] (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [23:09:26] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netmon1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [23:10:48] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netmon1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - denisse@cumin1001" [23:10:48] !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:10:49] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netmon1002.wikimedia.org [23:21:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:32:40] PROBLEM - Check size of conntrack table on logstash2030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:32:44] PROBLEM - Check that envoy is running on logstash2030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [23:32:50] PROBLEM - Check systemd state on logstash2030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:58] PROBLEM - OpenSearch health check for shards on 9200 on logstash2030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Search%23Administration [23:33:57] ^^ that's me [23:41:28] RECOVERY - Check systemd state on logstash2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:41:38] RECOVERY - OpenSearch health check for shards on 9200 on logstash2030 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: yellow, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 663, active_shards: 1408, relocating_shards: 7, initializing_shards: 3, unassigned_shards: 68, delayed_unassigned_sh [23:41:38] number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.19945909398243 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:42:28] RECOVERY - Check size of conntrack table on logstash2030 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:42:30] RECOVERY - Check that envoy is running on logstash2030 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy