[00:01:50] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Remove netmon2001 from DSH node group. [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695)
[00:02:26] <icinga-wm>	 PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-fetchimage.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:03:30] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:04:07] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:05:06] <logmsgbot>	 !log tgr@deploy1002 Synchronized php-1.40.0-wmf.14/extensions/GrowthExperiments/: Backport: [[gerrit:868052|User impact: read edit count from primary db in save complete hook (T324930)]] (duration: 07m 03s)
[00:05:10] <stashbot>	 T324930: NewImpact: Cannot read properties of undefined (reading 'days') - https://phabricator.wikimedia.org/T324930
[00:05:36] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38795/console" [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[00:05:44] <tgr>	 !log EU late backports done
[00:05:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:39] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/868204/38795/" [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[00:09:00] <icinga-wm>	 PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:09:11] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Remove netmon1002 from DSH node group [puppet] - 10https://gerrit.wikimedia.org/r/868207 (https://phabricator.wikimedia.org/T322321)
[00:10:10] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:10:36] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38796/console" [puppet] - 10https://gerrit.wikimedia.org/r/868207 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[00:12:34] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:14:15] <logmsgbot>	 !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash2026.codfw.wmnet with OS bullseye
[00:15:08] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2026.codfw.wmnet with OS bullseye
[00:16:31] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/868207/38796/" [puppet] - 10https://gerrit.wikimedia.org/r/868207 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[00:17:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[00:17:44] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:19:30] <logmsgbot>	 !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash2026.codfw.wmnet with OS bullseye
[00:19:44] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2026.codfw.wmnet with OS bullseye
[00:20:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T324949 (10Dzahn) Thanks for adding docs! That's the perfect reaction. I just wanted to create awareness originally.  Your edit https://wikitech.wikimedia.org/w/index.php...
[00:20:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10AnnWF) @jhathaway  Hi there, sorry for the late reply, I am still not able to login to the https://turnilo.wikimedia.org/ as getting "Service access denied due to missing privileges." when re...
[00:27:44] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:30:05] <mutante>	 !log releases2002 - rebooting
[00:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:27] <mutante>	 !log releases1002 - rebooting
[00:32:28] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:40:16] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:41:46] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:42:20] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:49:58] <icinga-wm>	 PROBLEM - Check systemd state on parse1003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:55:48] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2026.codfw.wmnet with reason: host reimage
[00:58:53] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2026.codfw.wmnet with reason: host reimage
[01:05:10] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:05:12] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:06:16] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:06:18] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:13:16] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:30:28] <icinga-wm>	 PROBLEM - Check systemd state on mw2271 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:00] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2026.codfw.wmnet with OS bullseye
[01:57:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:59:48] <rzl>	 cccccclrjgugfkgfegifdcgblcgbntukllfiegikivlt
[01:59:54] <rzl>	 uh I mean, hi
[02:01:51] <TheresNoTime>	 didn't expect to see keysmashing in -operations
[02:01:54] <TheresNoTime>	 >:D
[02:02:06] <Reedy>	 yubikey smashing
[02:05:46] <icinga-wm>	 PROBLEM - Check systemd state on mw1416 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:17:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:18:04] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:25:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[03:44:38] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:00:18] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:17:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[04:29:05] <wikibugs>	 (03PS1) 10RLazarus: httpbb: Add tests for test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/868211 (https://phabricator.wikimedia.org/T290536)
[04:29:07] <wikibugs>	 (03PS1) 10RLazarus: httpbb: Run hourly tests from the cumin hosts against mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536)
[04:31:22] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38797/console" [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536) (owner: 10RLazarus)
[05:08:38] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 89, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:17:01] <wikibugs>	 (03PS3) 10PleaseStand: Remove obsolete setting $wgAutoloadAttemptLowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412)
[05:42:16] <icinga-wm>	 PROBLEM - Check systemd state on mw1447 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:52:26] <icinga-wm>	 PROBLEM - Check systemd state on mw1448 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:54:24] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:13:08] <wikibugs>	 (03PS1) 10Marostegui: Revert "production-m2.sql.erb: Add new user" [puppet] - 10https://gerrit.wikimedia.org/r/868062
[06:17:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "production-m2.sql.erb: Add new user" [puppet] - 10https://gerrit.wikimedia.org/r/868062 (owner: 10Marostegui)
[06:30:41] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[06:36:42] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[06:38:12] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[06:44:02] <icinga-wm>	 PROBLEM - Check systemd state on mw1417 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:04] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T0700).
[07:00:30] <wikibugs>	 (03PS1) 10Marostegui: phabricator.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868213 (https://phabricator.wikimedia.org/T325154)
[07:00:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] phabricator.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868213 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui)
[07:05:18] <wikibugs>	 (03PS1) 10KartikMistry: Enable Section Translation on 6 WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868215 (https://phabricator.wikimedia.org/T319177)
[07:24:19] <wikibugs>	 (03PS5) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582)
[07:24:24] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[07:24:49] <wikibugs>	 (03PS6) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582)
[07:25:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[07:34:46] <wikibugs>	 (03PS2) 10Ryan Kemper: [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[07:36:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[07:42:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[07:47:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[07:52:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Don't install quickstack on Bookworm, revisit later [puppet] - 10https://gerrit.wikimedia.org/r/868078 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[07:56:46] <icinga-wm>	 PROBLEM - Check systemd state on mw1415 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:57:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb2003.codfw.wmnet
[08:00:04] <jouncebot>	 Amir1, apergos, and jnuche: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T0800).
[08:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:13] <apergos>	 morning! there are no trainees signed up for the window, and one patch scheduled for deployment. kart_ I assume you wil self-deploy?
[08:00:14] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:22] <Amir1>	 you can self serve?
[08:00:30] <kart_>	 Yeah. I can self deploy.
[08:00:40] <kart_>	 apergos: Amir1 ^^
[08:00:48] <apergos>	 it's all you, take it away, kart_!
[08:00:53] <kart_>	 :)
[08:00:53] <Amir1>	 awesome, less work for me :P_
[08:01:00] <kart_>	 :D
[08:01:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868215 (https://phabricator.wikimedia.org/T319177) (owner: 10KartikMistry)
[08:02:37] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Section Translation on 6 WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868215 (https://phabricator.wikimedia.org/T319177) (owner: 10KartikMistry)
[08:03:03] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:868215|Enable Section Translation on 6 WPs (T319177)]]
[08:03:07] <stashbot>	 T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177
[08:04:57] <logmsgbot>	 !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:868215|Enable Section Translation on 6 WPs (T319177)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[08:08:30] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb2003.codfw.wmnet
[08:13:59] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:868215|Enable Section Translation on 6 WPs (T319177)]] (duration: 10m 55s)
[08:14:03] <stashbot>	 T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177
[08:17:18] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:18:37] <wikibugs>	 (03PS1) 10Majavah: base: puppet_alert: don't advertise the disable file [puppet] - 10https://gerrit.wikimedia.org/r/868221
[08:20:36] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:21:34] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10akosiaris) 05Open→03Resolved >>! In T238090#8469500, @mpopov wrote: > I just updated @Fuzzy's permissions for he.m.wikisource. U...
[08:22:07] <apergos>	 kart_:  how's it looking? still testing?
[08:24:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[08:27:29] <akosiaris>	 heads up, I am reboot rdb1009 for kernel upgrades. 
[08:28:19] <akosiaris>	 !log reboot rdb1009 for kernel upgrades. possibly (but probably not) affected applications: changeprop, cpjobqueue, api-gateway, redisLockManager
[08:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:29] <hashar>	 good morning
[08:30:47] <wikibugs>	 (03CR) 10David Caro: "I really think that it's as useful not advertising it, with the downside that then people will start sending those emails to spam/trash au" [puppet] - 10https://gerrit.wikimedia.org/r/868221 (owner: 10Majavah)
[08:30:54] <icinga-wm>	 PROBLEM - Host rdb1009 is DOWN: PING CRITICAL - Packet loss = 100%
[08:32:10] <icinga-wm>	 RECOVERY - Host rdb1009 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[08:33:04] <apergos>	 morning
[08:38:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Let's wait for Chris' approval before proceeding but it looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) (owner: 10AikoChou)
[08:43:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] echostore: Tighten egress to explit host/port list [deployment-charts] - 10https://gerrit.wikimedia.org/r/868146 (owner: 10Eevans)
[08:44:29] <wikibugs>	 (03PS1) 10Matthias Mullie: [SearchVue] Change protocol-less urls to https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446)
[08:47:55] <matthiasmullie>	 kart_: have you completed deployment, or still working? (I don't want to rush you - just want to merge a non-urgent beta-only config patch & want to make sure I'm staying out of your way!)
[08:50:34] <apergos>	 since no reply 20 minutes after I pinged for a check-in, I am assuming they completed and forgot to mention it here
[08:51:10] <akosiaris>	 !log nothing noticed with rdb1007 reboot for mw, jobqueue, api-gateway. changeprop had a minor backlog increase, but everything appears fine now.
[08:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:22] <akosiaris>	 !log reboot rdb1007 for kernel upgrades
[08:51:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:33] <akosiaris>	 !log correction: reboot rdb1011 for kernel upgrades
[08:52:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:25] <akosiaris>	 !log reboot rdb2009 for kernel upgrades
[08:53:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:48] <wikibugs>	 (03PS1) 10Marostegui: parsercache.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868225 (https://phabricator.wikimedia.org/T325154)
[08:54:54] <icinga-wm>	 PROBLEM - Host rdb1011 is DOWN: PING CRITICAL - Packet loss = 100%
[08:55:06] <icinga-wm>	 RECOVERY - Host rdb1011 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[08:55:18] <jinxer-wm>	 (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:55:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] parsercache.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/868225 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui)
[08:55:50] <icinga-wm>	 PROBLEM - Host rdb2009 is DOWN: PING CRITICAL - Packet loss = 100%
[08:56:18] <akosiaris>	 docker registry is probably the rdb2009 reboot, it should resolve quickly
[08:56:18] <icinga-wm>	 RECOVERY - Host rdb2009 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms
[08:57:18] <jinxer-wm>	 (ProbeDown) firing: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:58:35] <vgutierrez>	 akosiaris: ack
[08:58:47] <elukey>	 akosiaris: I got an ORES alert for workers down, now resolved IIUC
[08:58:55] <hashar>	 I will promote all wikis to 1.40.0-wmf.14 in a few minutes
[08:59:26] <vgutierrez>	 acked the alert in VOps
[08:59:40] <elukey>	 thanks :)
[08:59:45] <vgutierrez>	 are you getting those topranks?
[09:00:05] <jouncebot>	 hashar and ^demon: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T0900).
[09:00:13] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868226 (https://phabricator.wikimedia.org/T320519)
[09:00:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868226 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot)
[09:00:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:00:52] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868226 (https://phabricator.wikimedia.org/T320519) (owner: 10TrainBranchBot)
[09:02:18] <jinxer-wm>	 (ProbeDown) resolved: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:06:21] <elukey>	 we had a little outage for ORES, nothing big though:
[09:06:22] <elukey>	 https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&from=1671094279394&to=1671094818210
[09:08:18] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.14  refs T320519
[09:08:22] <stashbot>	 T320519: 1.40.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T320519
[09:10:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1001.wikimedia.org
[09:11:01] <akosiaris>	 elukey: I 'll have to repeat it, got more hosts I need to reboot
[09:11:21] <akosiaris>	 actually only 1 is left
[09:12:21] <akosiaris>	 !log reboot rdb2007 for kernel upgrades
[09:12:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:09] <elukey>	 akosiaris: next time we can do a failover if you want, I can prep the code reviews in advance etc..
[09:13:12] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:13:12] <icinga-wm>	 PROBLEM - Host rdb2007 is DOWN: PING CRITICAL - Packet loss = 100%
[09:13:28] <akosiaris>	 elukey: we could, but is it worth it? 
[09:13:32] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:14:20] <akosiaris>	 my understanding from last time we did this is was that no, but I maybe I misremember
[09:15:04] <icinga-wm>	 RECOVERY - Host rdb2007 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms
[09:15:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1001.wikimedia.org
[09:15:09] <elukey>	 I am trying to remember as well, did it break the same way? IIRC no, this time an alert fired (never seen it to be honest)
[09:17:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2002.wikimedia.org
[09:17:36] <akosiaris>	 I remember something similar in the graphs, but I don't remember if an alert fired or not.
[09:18:08] <akosiaris>	 uptime on the redis hosts was 155 days, so we can pin it down and figure out if it fired or not
[09:21:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2002.wikimedia.org
[09:27:59] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka test-eqiad cluster: Reboot kafka nodes
[09:30:14] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet
[09:30:38] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host acmechief-test1001.eqiad.wmnet
[09:31:28] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[09:31:50] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet
[09:32:40] <vgutierrez>	 hmmm is that dashboard broken ^^?
[09:33:35] <hashar>	 vgutierrez: the URL is cut after some amount of bytes, probably by the IRC bot
[09:34:01] <hashar>	 that alarms also keeps triggering but there is no real spike showing up in the graph
[09:34:16] <hashar>	 IIRC claime said he is aware of it / investigating
[09:34:40] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[09:34:42] <vgutierrez>	 we have a few spikes on POSTs
[09:34:45] <vgutierrez>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=now-3h&orgId=1&to=now&var-cluster=api_appserver&var-datasource=codfw%20prometheus%2Fops&var-method=POST&viewPanel=9&var-site=codfw&var-code=200&var-php_version=All
[09:34:49] <hashar>	 and there when it recovers it has the proper URL
[09:35:02] <hashar>	 the PROBLEM alarm lacks  var-method=POST ;)
[09:35:17] <vgutierrez>	 especially if compare it against eqiad
[09:36:28] <hashar>	 and the alarm triggers 6 minutes after the initial spike  (I guess cause the observed window  is a few minutes wide AND Icinga might recheck it 3 times before turning the alarm in hard state which triggers the notification)
[09:36:36] <hashar>	 but that is a different issue ;)
[09:37:32] <claime>	 So we really need to do something about this one
[09:37:41] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet
[09:37:41] <claime>	 Because there's like 2 POST/s
[09:38:02] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet
[09:38:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/868229
[09:38:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt1001.wikimedia.org
[09:38:33] <wikibugs>	 (03PS1) 10Marostegui: db1206: Testing RAID controller [puppet] - 10https://gerrit.wikimedia.org/r/868230 (https://phabricator.wikimedia.org/T324181)
[09:38:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1206: Testing RAID controller [puppet] - 10https://gerrit.wikimedia.org/r/868230 (https://phabricator.wikimedia.org/T324181) (owner: 10Marostegui)
[09:40:14] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet
[09:41:26] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief2001.codfw.wmnet
[09:42:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1001.wikimedia.org
[09:42:53] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: remove translatewiki and frwikisource isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) (owner: 10AikoChou)
[09:43:37] <wikibugs>	 (03PS7) 10Jcrespo: mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582)
[09:43:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[09:45:08] <wikibugs>	 (03CR) 10Btullis: Backing up HDFS FSImage to HDFS on Monday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[09:45:33] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief-test2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:45:40] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[09:48:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/868229 (owner: 10Muehlenhoff)
[09:49:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Setup new definitive databases for mediabackups & bacula [puppet] - 10https://gerrit.wikimedia.org/r/868072 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[09:51:23] <logmsgbot>	 !log vgutierrez@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host acmechief2001.codfw.wmnet
[09:52:34] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38799/console" [puppet] - 10https://gerrit.wikimedia.org/r/868035 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto)
[09:53:07] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief1001.eqiad.wmnet
[09:54:03] <effie>	 !log stopping and masking nutcracker on mw servers - T277183 
[09:54:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:07] <stashbot>	 T277183: Phase out nutcracker from mediawiki servers - https://phabricator.wikimedia.org/T277183
[09:55:33] <jinxer-wm>	 (KeyholderUnarmed) firing: (3) 1 unarmed Keyholder key(s) on acmechief-test1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:56:50] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1001.eqiad.wmnet
[10:01:52] <icinga-wm>	 PROBLEM - Host acmechief2001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:04:36] <wikibugs>	 (03PS2) 10Effie Mouzeli: Redis sessions: Goodbye [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581)
[10:05:36] <icinga-wm>	 PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:06:17] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: add trusted tag to Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/868035 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto)
[10:06:37] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] gitlab_runner: add wmcs tag to Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/868036 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto)
[10:06:41] <vgutierrez>	 ^^ acmechief2002 is down
[10:07:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:07:34] <wikibugs>	 (03PS1) 10Sergio Gimeno: Vue components: react to binding updates of v-click-outside directive [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868063 (https://phabricator.wikimedia.org/T325041)
[10:08:14] <wikibugs>	 (03PS1) 10Sergio Gimeno: User impact: instrument info tooltip clicks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868064 (https://phabricator.wikimedia.org/T325041)
[10:08:18] <wikibugs>	 (03PS2) 10Jelto: gitlab_runner: add trusted tag to Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/868035 (https://phabricator.wikimedia.org/T325069)
[10:08:22] <wikibugs>	 (03PS2) 10Jelto: gitlab_runner: add wmcs tag to Shared Runners [puppet] - 10https://gerrit.wikimedia.org/r/868036 (https://phabricator.wikimedia.org/T325069)
[10:12:26] <icinga-wm>	 PROBLEM - Check systemd state on mw1351 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:12:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:13:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815 (10akosiaris)
[10:13:32] <wikibugs>	 10SRE, 10Striker, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 (10akosiaris) 05Open→03Stalled This is open since 2016 with minimal updates, probably inaccurate now (as far as I know the corp LDAP doesn't exist now) and it is unclear, at...
[10:17:52] <icinga-wm>	 PROBLEM - Check systemd state on mw2286 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:18:34] <wikibugs>	 (03PS1) 10JMeybohm: If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236)
[10:19:22] <icinga-wm>	 PROBLEM - Check systemd state on mw2367 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:20:48] <icinga-wm>	 PROBLEM - Check systemd state on mw2382 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:21:00] <wikibugs>	 10SRE, 10Striker, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 (10MoritzMuehlenhoff) >>! In T148048#8470183, @akosiaris wrote: > This is open since 2016 with minimal updates, probably inaccurate now (as far as I know the corp LDAP doesn't ex...
[10:21:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations: IDM milestone 2 "Initial limited deployment" - https://phabricator.wikimedia.org/T320603 (10MoritzMuehlenhoff)
[10:22:01] <wikibugs>	 10SRE, 10Striker, 10LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048 (10MoritzMuehlenhoff)
[10:22:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236) (owner: 10JMeybohm)
[10:23:22] <icinga-wm>	 PROBLEM - Check systemd state on mw1441 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:26:34] <wikibugs>	 (03PS1) 10DCausse: team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234
[10:27:34] <icinga-wm>	 PROBLEM - Check systemd state on mw2259 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:28:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234 (owner: 10DCausse)
[10:28:50] <icinga-wm>	 PROBLEM - Check systemd state on mw2371 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:30:34] <wikibugs>	 (03PS1) 10Marostegui: orchestrator.conf.json.erb: Add Jaime to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/868235
[10:31:24] <wikibugs>	 (03CR) 10Marostegui: orchestrator.conf.json.erb: Add Jaime to powerusers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868235 (owner: 10Marostegui)
[10:31:52] <icinga-wm>	 PROBLEM - Check systemd state on mw1401 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:32:28] <jinxer-wm>	 (KubernetesAPILatency) resolved: (10) High Kubernetes API latency (LIST authorizationpolicies) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:32:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1002.wikimedia.org
[10:34:08] <jayme>	 !log restarted istiod pods in aux-k8s because of T303184
[10:34:08] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: backup1-eqiad on db1205 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:34:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:11] <stashbot>	 T303184: High API server request latencies (LIST)  - https://phabricator.wikimedia.org/T303184
[10:34:25] <marostegui>	 jynus: ^ should I downtime those?
[10:34:47] <jynus>	 in theory they should have notifications disabled
[10:35:00] <icinga-wm>	 PROBLEM - Check systemd state on mw1475 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:35:24] <marostegui>	 jynus: I just noticed you used profile::base::notifications: disabled and I always use profile::monitoring::notifications_enabled: false
[10:35:26] <icinga-wm>	 PROBLEM - Check systemd state on mw1489 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:35:44] <icinga-wm>	 PROBLEM - MariaDB read only backup1-codfw on db2184 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:35:55] <jynus>	 I see
[10:36:02] <jynus>	 I used the old syntax
[10:36:13] <marostegui>	 I missed that in the review :(
[10:36:48] <icinga-wm>	 PROBLEM - Check systemd state on mw2406 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:36:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1002.wikimedia.org
[10:37:15] <jynus>	 I disabled manually on icinga
[10:37:24] <marostegui>	 do you want me to send a patch?
[10:37:25] <jynus>	 but alertmanager will complain, I guess
[10:37:50] <jynus>	 I have a meeting, but I will do a proper patch after that
[10:37:59] <marostegui>	 don't worry, I will do it
[10:38:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2001.wikimedia.org
[10:39:40] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582)
[10:39:48] <marostegui>	 jynus: going to merge ^
[10:40:18] <icinga-wm>	 RECOVERY - Host acmechief2001 is UP: PING OK - Packet loss = 0%, RTA = 31.85 ms
[10:41:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582) (owner: 10Marostegui)
[10:41:14] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582) (owner: 10Marostegui)
[10:41:29] <marostegui>	 jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/868235/1 is that jcrespo or jynus?
[10:41:31] <jynus>	 thank you, I have too many things on my plate right now
[10:41:38] <marostegui>	 jynus: don't worry, I will take care of it
[10:42:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2001.wikimedia.org
[10:43:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2001.codfw.wmnet
[10:43:38] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582)
[10:44:14] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582)
[10:45:16] <icinga-wm>	 RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:45:33] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:46:16] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868338 (https://phabricator.wikimedia.org/T313582)
[10:46:24] <wikibugs>	 (03Abandoned) 10Marostegui: mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868337 (https://phabricator.wikimedia.org/T313582) (owner: 10Marostegui)
[10:47:00] <XioNoX>	 !log disable ping offload in eqiad
[10:47:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications on new databases [puppet] - 10https://gerrit.wikimedia.org/r/868338 (https://phabricator.wikimedia.org/T313582) (owner: 10Marostegui)
[10:48:40] <icinga-wm>	 PROBLEM - Check systemd state on mw1497 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:49:33] <wikibugs>	 (03PS1) 10Marostegui: orchestrator.conf.json.erb: Add Jaime to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/868339
[10:49:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2001.codfw.wmnet
[10:49:39] <wikibugs>	 (03Abandoned) 10Marostegui: orchestrator.conf.json.erb: Add Jaime to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/868235 (owner: 10Marostegui)
[10:50:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1002.eqiad.wmnet
[10:50:15] <wikibugs>	 (03CR) 10Marostegui: "Let me know if it is jcrespo or jynus" [puppet] - 10https://gerrit.wikimedia.org/r/868339 (owner: 10Marostegui)
[10:50:33] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[10:51:20] <vgutierrez>	 :?
[10:51:44] <icinga-wm>	 PROBLEM - Check systemd state on mw1402 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:52:23] <wikibugs>	 (03PS2) 10DCausse: team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234
[10:53:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1002.eqiad.wmnet
[10:53:57] <wikibugs>	 (03PS2) 10Daniel Kinzler: Increase PC writes from parsoid API to 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868127
[10:54:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant ssh access to analytics-admins to mnz - https://phabricator.wikimedia.org/T325072 (10akosiaris) @odimitrijevic, @Ottomata, we need the approval of one of you on this one.
[10:55:18] <icinga-wm>	 PROBLEM - Check systemd state on mw2388 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:40] <icinga-wm>	 PROBLEM - Check systemd state on parse2018 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:58:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris) @odimitrijevic, @Ottomata, we need the approval of one of you on this one.
[10:59:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris) analytics-product-users doesn't have a approver listed, let me chase that one down.
[10:59:34] <icinga-wm>	 PROBLEM - Check systemd state on mw1439 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:59:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10akosiaris)
[11:00:04] <jouncebot>	 mvolz: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1100).
[11:00:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping2002.codfw.wmnet
[11:01:32] <icinga-wm>	 PROBLEM - Check systemd state on mw1454 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:04:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2002.codfw.wmnet
[11:08:19] <wikibugs>	 (03PS1) 10Effie Mouzeli: tegola-vector-tiles: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/868343 (https://phabricator.wikimedia.org/T314472)
[11:10:02] <icinga-wm>	 PROBLEM - Check systemd state on mw1443 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:22] <icinga-wm>	 PROBLEM - Check systemd state on mw2401 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:11:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping3002.esams.wmnet
[11:14:04] <icinga-wm>	 PROBLEM - Check systemd state on mw2387 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:14:25] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345
[11:14:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345 (owner: 10Slyngshede)
[11:15:22] <wikibugs>	 (03PS2) 10Effie Mouzeli: tegola-vector-tiles: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/868343 (https://phabricator.wikimedia.org/T314472)
[11:15:45] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka test-eqiad cluster: Reboot kafka nodes
[11:15:55] <wikibugs>	 (03PS2) 10Slyngshede: P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345
[11:16:19] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/868343 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli)
[11:16:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping3002.esams.wmnet
[11:18:17] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] orchestrator.conf.json.erb: Add Jaime to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/868339 (owner: 10Marostegui)
[11:19:28] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:19:49] <wikibugs>	 (03PS3) 10Slyngshede: P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345
[11:20:38] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/868343 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli)
[11:21:09] <wikibugs>	 (03PS4) 10Slyngshede: P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345
[11:21:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host flowspec1001.eqiad.wmnet
[11:21:54] <icinga-wm>	 PROBLEM - Check systemd state on mw1435 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:22:12] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:22:24] <icinga-wm>	 PROBLEM - Check systemd state on mw1474 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:20] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] maps: remove redis [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan)
[11:23:41] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Redis sessions: Goodbye [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[11:24:52] <icinga-wm>	 PROBLEM - Check systemd state on mw1423 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:43] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: add more replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/868343 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli)
[11:26:07] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:IDM Fix minor configuration issues. [puppet] - 10https://gerrit.wikimedia.org/r/868345 (owner: 10Slyngshede)
[11:27:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flowspec1001.eqiad.wmnet
[11:27:56] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:27:58] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.249 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[11:30:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] orchestrator.conf.json.erb: Add Jaime to powerusers [puppet] - 10https://gerrit.wikimedia.org/r/868339 (owner: 10Marostegui)
[11:34:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1001.eqiad.wmnet
[11:34:15] <icinga-wm>	 PROBLEM - Check systemd state on mw2385 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:36] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@00c9a16] (codfw): codfw: Disable traffic mirroring
[11:39:20] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@00c9a16] (codfw): codfw: Disable traffic mirroring (duration: 01m 43s)
[11:39:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1001.eqiad.wmnet
[11:42:03] <effie>	 !log switching maps/kartotherian from codfw to eqiad 
[11:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:43] <icinga-wm>	 PROBLEM - Check systemd state on mw1431 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:06] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad
[11:44:06] <effie>	 the failed nutcracker things are mine, I will deal with them in a bit 
[11:44:23] <icinga-wm>	 PROBLEM - Check systemd state on mw2311 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:05] <wikibugs>	 (03PS1) 10Jbond: P:installserver::proxy: allow production to use squid to proxy ssh [puppet] - 10https://gerrit.wikimedia.org/r/868370
[11:50:53] <icinga-wm>	 PROBLEM - Check systemd state on mw1464 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:40] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "Should be done after the role has been applied on the host since scap will run scripts upon deployment (such as restarting php-fpm when de" [puppet] - 10https://gerrit.wikimedia.org/r/867670 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[11:53:55] <icinga-wm>	 PROBLEM - Check systemd state on mw2316 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:55] <icinga-wm>	 PROBLEM - Check systemd state on mw2350 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org
[11:55:28] <wikibugs>	 (03Abandoned) 10Hashar: build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 (owner: 10Hashar)
[11:55:30] <wikibugs>	 (03Abandoned) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/830654 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[11:55:31] <icinga-wm>	 PROBLEM - Check systemd state on mw1378 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:55:33] <wikibugs>	 (03Abandoned) 10Hashar: Implement REST API and Ssh commands [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[11:55:36] <wikibugs>	 (03Abandoned) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar)
[11:58:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org
[11:58:47] <icinga-wm>	 PROBLEM - Check systemd state on mw2276 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:09] <icinga-wm>	 PROBLEM - Check systemd state on mw2269 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:49] <wikibugs>	 (03PS1) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401)
[12:03:30] <wikibugs>	 (03PS1) 10Btullis: Increase max_connections on analytics_meta MariaDB [puppet] - 10https://gerrit.wikimedia.org/r/868373 (https://phabricator.wikimedia.org/T325278)
[12:07:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1002.wikimedia.org
[12:07:12] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@00c9a16] (eqiad): codfw: Disable traffic mirroring
[12:08:12] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@00c9a16] (eqiad): codfw: Disable traffic mirroring (duration: 01m 00s)
[12:08:22] <wikibugs>	 (03PS2) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401)
[12:10:07] <wikibugs>	 (03PS3) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401)
[12:10:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "I have not much to say here as we do not maintain this DB. My recommendation would be to closely monitor connections and if needing more a" [puppet] - 10https://gerrit.wikimedia.org/r/868373 (https://phabricator.wikimedia.org/T325278) (owner: 10Btullis)
[12:10:39] <icinga-wm>	 PROBLEM - Check systemd state on mw1482 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:17] <icinga-wm>	 PROBLEM - Check systemd state on mw2273 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:46] <wikibugs>	 (03PS1) 10Jcrespo: orchestrator: Change poweruser jcrespo to use the shell name: jynus [puppet] - 10https://gerrit.wikimedia.org/r/868376
[12:12:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] orchestrator: Change poweruser jcrespo to use the shell name: jynus [puppet] - 10https://gerrit.wikimedia.org/r/868376 (owner: 10Jcrespo)
[12:12:09] <icinga-wm>	 PROBLEM - Check systemd state on mw1436 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:12:28] <wikibugs>	 (03PS1) 10Volans: spicerack: update config for v6.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168)
[12:12:30] <wikibugs>	 (03PS1) 10Volans: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868378 (https://phabricator.wikimedia.org/T325168)
[12:12:32] <wikibugs>	 (03CR) 10Jcrespo: "I added the puppet comment so we don't get confused again :-)" [puppet] - 10https://gerrit.wikimedia.org/r/868376 (owner: 10Jcrespo)
[12:12:34] <wikibugs>	 (03PS4) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401)
[12:12:45] <icinga-wm>	 PROBLEM - Check systemd state on parse1018 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:12:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1002.wikimedia.org
[12:13:02] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] orchestrator: Change poweruser jcrespo to use the shell name: jynus [puppet] - 10https://gerrit.wikimedia.org/r/868376 (owner: 10Jcrespo)
[12:13:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38806/console" [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) (owner: 10Jbond)
[12:14:40] <wikibugs>	 (03PS5) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401)
[12:15:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/868379 (owner: 10L10n-bot)
[12:15:33] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:16:02] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[12:16:13] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868378 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[12:16:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) (owner: 10Jbond)
[12:16:52] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "This must be deployend in conjunction with the deploy of Spicerack v6.0.0 on the fleet." [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[12:18:19] <wikibugs>	 (03PS6) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401)
[12:19:33] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=kartotherian-ssl,name=maps1010.eqiad.wmnet
[12:19:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Increase max_connections on analytics_meta MariaDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868373 (https://phabricator.wikimedia.org/T325278) (owner: 10Btullis)
[12:19:39] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=kartotherian,name=maps1010.eqiad.wmnet
[12:19:57] <wikibugs>	 (03CR) 10Volans: "This is a draft proposal to setup the two spicerack/cookbooks environments in the different hosts." [puppet] - 10https://gerrit.wikimedia.org/r/868378 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[12:20:32] <logmsgbot>	 !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
[12:20:46] <wikibugs>	 (03PS2) 10Volans: spicerack: update config for v6.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168)
[12:20:53] <wikibugs>	 (03CR) 10Volans: "check_experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[12:23:14] <wikibugs>	 (03PS2) 10JMeybohm: If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236)
[12:23:53] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) (owner: 10AikoChou)
[12:23:55] <wikibugs>	 (03CR) 10Jbond: P:installserver::proxy: add ability to proxy ssh ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868372 (https://phabricator.wikimedia.org/T319401) (owner: 10Jbond)
[12:25:54] <wikibugs>	 (03PS1) 10JMeybohm: WIP: Update staging-codfw to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868389 (https://phabricator.wikimedia.org/T307943)
[12:27:05] <icinga-wm>	 PROBLEM - Check systemd state on mw1450 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:28:38] <akosiaris>	 ottomata: any hints as to who should be added as approver for "analytics-product-users" ? See https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/data/data.yaml#918
[12:29:25] <icinga-wm>	 PROBLEM - puppet last run on idp-test1002 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:30:27] <wikibugs>	 (03PS4) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153)
[12:32:52] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[12:34:59] <icinga-wm>	 RECOVERY - puppet last run on idp-test1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[12:36:23] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye
[12:36:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye
[12:36:37] <icinga-wm>	 PROBLEM - Check systemd state on mw2356 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:39:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[12:40:37] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "This must be deployend in conjunction with the deploy of Spicerack v6.0.0 on the fleet." [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[12:42:41] <wikibugs>	 (03PS10) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661)
[12:43:10] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392
[12:43:53] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::spicerack: Stop writing Redis sessions data [puppet] - 10https://gerrit.wikimedia.org/r/868393 (https://phabricator.wikimedia.org/T267581)
[12:44:07] <wikibugs>	 (03Abandoned) 10Jbond: P:installserver::proxy: allow production to use squid to proxy ssh [puppet] - 10https://gerrit.wikimedia.org/r/868370 (owner: 10Jbond)
[12:44:21] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392
[12:44:27] <wikibugs>	 (03CR) 10Jaime Nuche: mwdebug_deploy: remove configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche)
[12:46:03] <wikibugs>	 (03CR) 10Jcrespo: "Sending you the issues I had to setup the new backup databases from 0, in the form of the patch. Feel free to alter the patch in any way y" [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo)
[12:46:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] tools-webservice: create /etc/toolforge/webservice.yaml with puppet [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[12:47:14] <wikibugs>	 (03PS3) 10Hnowlan: maps: remove redis [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246)
[12:47:16] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch Moritz, we were chatting about this in -serviceops" [puppet] - 10https://gerrit.wikimedia.org/r/868393 (https://phabricator.wikimedia.org/T267581) (owner: 10Muehlenhoff)
[12:47:17] <icinga-wm>	 PROBLEM - Check systemd state on parse2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:39] <icinga-wm>	 PROBLEM - Check systemd state on mw1404 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:48] <wikibugs>	 (03PS2) 10Jbond: rsyslog: use ensure_resource for package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[12:48:11] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392
[12:48:36] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392
[12:48:47] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-stretch1001.eqiad.wmnet with reason: host reimage
[12:49:27] <wikibugs>	 (03PS5) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392
[12:49:54] <wikibugs>	 (03PS3) 10Jbond: rsyslog: use ensure_resource for package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[12:50:07] <wikibugs>	 (03PS1) 10Effie Mouzeli: P:spicerack: Remove redis_sessions leftovers [puppet] - 10https://gerrit.wikimedia.org/r/868395
[12:50:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38808/console" [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[12:50:57] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] profile::spicerack: Stop writing Redis sessions data [puppet] - 10https://gerrit.wikimedia.org/r/868393 (https://phabricator.wikimedia.org/T267581) (owner: 10Muehlenhoff)
[12:51:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38809/console" [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[12:51:17] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: P:spicerack: Remove redis_sessions leftovers [puppet] - 10https://gerrit.wikimedia.org/r/868395 (owner: 10Effie Mouzeli)
[12:51:48] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-stretch1001.eqiad.wmnet with reason: host reimage
[12:52:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::spicerack: Stop writing Redis sessions data [puppet] - 10https://gerrit.wikimedia.org/r/868393 (https://phabricator.wikimedia.org/T267581) (owner: 10Muehlenhoff)
[12:52:19] <wikibugs>	 (03PS1) 10Volans: cumin: fix email address for insetup role audit [puppet] - 10https://gerrit.wikimedia.org/r/868396
[12:53:49] <wikibugs>	 (03PS1) 10Aqu: Fix systemd syntax in hadoop-namenode-backup-fetchimage [puppet] - 10https://gerrit.wikimedia.org/r/868397 (https://phabricator.wikimedia.org/T324850)
[12:56:33] <icinga-wm>	 PROBLEM - Check systemd state on mw1390 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:53] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38810/console" [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan)
[13:02:06] <_joe_>	 effie: ^^ see the alert, we need to remove nutcracker from auto-restarts
[13:02:13] <_joe_>	 it *always* gets me
[13:03:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/868148 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[13:03:24] <wikibugs>	 (03CR) 10Volans: "See comments inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[13:03:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/868396 (owner: 10Volans)
[13:04:08] <effie>	 _joe_: I gave seen the alert, I wrote earlier that I need to mend something else first 
[13:04:20] <effie>	 we are in the middle of a maps thing
[13:04:23] <effie>	 thank you !
[13:04:31] <_joe_>	 oh sorry 
[13:04:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/868088 (owner: 10Volans)
[13:04:46] <effie>	 too much noise it this channel :/
[13:05:15] <icinga-wm>	 PROBLEM - Check systemd state on mw1414 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:05:28] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280)
[13:06:03] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] tools-webservice: create /etc/toolforge/webservice.yaml with puppet [puppet] - 10https://gerrit.wikimedia.org/r/867911 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[13:06:22] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[13:07:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536) (owner: 10RLazarus)
[13:07:59] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[13:09:13] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[13:09:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236) (owner: 10JMeybohm)
[13:11:20] <claime>	 hashar: vgutierrez, fyi about the latency alarm https://phabricator.wikimedia.org/T325277 I won't be making anymore headway on it before next week tho, I'm ooo starting about now until monday
[13:11:33] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280)
[13:13:29] <icinga-wm>	 PROBLEM - Check systemd state on mw1394 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:15] <icinga-wm>	 RECOVERY - Check systemd state on mw1414 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:29] <icinga-wm>	 PROBLEM - Check systemd state on mw1462 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:15:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add initial support for webauthn [puppet] - 10https://gerrit.wikimedia.org/r/867576 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[13:15:24] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236) (owner: 10JMeybohm)
[13:19:11] <icinga-wm>	 PROBLEM - Check systemd state on mw1356 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:24] <hashar>	 claime: nice ;) enjoy the week-end!
[13:19:37] <claime>	 Thanks, I will :D
[13:20:34] <wikibugs>	 (03Merged) 10jenkins-bot: If-guard admin_ng objects no longer relevant for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/868232 (https://phabricator.wikimedia.org/T299236) (owner: 10JMeybohm)
[13:20:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868396 (owner: 10Volans)
[13:22:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "not tested but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) (owner: 10Arturo Borrero Gonzalez)
[13:23:55] <icinga-wm>	 PROBLEM - Check systemd state on mw2331 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:25:17] <icinga-wm>	 PROBLEM - Check systemd state on mw2372 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:25:19] <icinga-wm>	 PROBLEM - Check systemd state on parse2003 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:27:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] puppetmaster: git-sync-upstream: use the gitpuppet user for git operations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) (owner: 10Arturo Borrero Gonzalez)
[13:33:20] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki: remove prometheus-nutcracker-exporter auto restart [puppet] - 10https://gerrit.wikimedia.org/r/868403
[13:34:39] <icinga-wm>	 PROBLEM - Check systemd state on mw1494 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:34:42] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[13:34:47] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-stretch1001.eqiad.wmnet with OS bullseye
[13:34:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye co...
[13:35:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: remove prometheus-nutcracker-exporter auto restart [puppet] - 10https://gerrit.wikimedia.org/r/868403 (owner: 10Effie Mouzeli)
[13:35:42] <wikibugs>	 (03CR) 10Clément Goubert: "Sorry, I should have caught the package and docker::credential removal issue when reviewing I37bd41014b77" [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche)
[13:38:35] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Backing up HDFS FSImage to HDFS on Monday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[13:39:24] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant ssh access to analytics-admins to mnz - https://phabricator.wikimedia.org/T325072 (10Ottomata) Approved.
[13:40:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10Ottomata) Approved for analytics-privatedata-users.  analytics-product-users approver could be @mpopov?
[13:41:35] <icinga-wm>	 PROBLEM - Check systemd state on mw2376 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:42:44] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Fix systemd syntax in hadoop-namenode-backup-fetchimage [puppet] - 10https://gerrit.wikimedia.org/r/868397 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[13:44:25] <icinga-wm>	 PROBLEM - Check systemd state on mw2310 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:47] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki: remove prometheus-nutcracker-exporter auto restart [puppet] - 10https://gerrit.wikimedia.org/r/868403
[13:48:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) kafka-stretch1001 worked ok with the new raid config.  I'm just going to rebuild kafka-stretch1002 because although the drives are in t...
[13:49:11] <icinga-wm>	 PROBLEM - Check systemd state on mw2295 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:49:31] <wikibugs>	 (03CR) 10Aqu: Backing up HDFS FSImage to HDFS on Monday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[13:49:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/868403 (owner: 10Effie Mouzeli)
[13:49:55] <wikibugs>	 (03PS5) 10Muehlenhoff: Make puppetdb[12]003 puppetdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/863255
[13:50:31] <icinga-wm>	 PROBLEM - Check systemd state on mw1361 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:09] <wikibugs>	 (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/868379 (owner: 10L10n-bot)
[13:51:25] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1002.eqiad.wmnet with OS bullseye
[13:51:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye
[13:53:19] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: use the same image for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/868407 (https://phabricator.wikimedia.org/T323586)
[13:56:14] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: use the same image for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/868407 (https://phabricator.wikimedia.org/T323586)
[14:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1400)
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1400). nyaa~
[14:00:05] <jouncebot>	 arlolra, PleaseStand, and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:17] <sergi0>	 hello
[14:00:23] <PleaseStand>	 I'm also here
[14:01:07] <Lucas_WMDE>	 I’m here but only for the first half hour
[14:01:19] <wikibugs>	 (03PS2) 10Matthias Mullie: [SearchVue] Change protocol-less urls to https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446)
[14:01:26] <wikibugs>	 (03PS3) 10Matthias Mullie: [beta] [SearchVue] Change protocol-less urls to https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446)
[14:02:13] <Lucas_WMDE>	 arlolra doesn’t seem to be online yet, so I’ll start with PleaseStand
[14:03:20] <matthiasmullie>	 Lucas_WMDE: do you mind me merging this beta-only config patch? (LMK when appropriate so I don't interfere - don't want to cause confusion with an undeployed patch :p)
[14:03:32] <matthiasmullie>	 *this: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/868224
[14:03:44] <Lucas_WMDE>	 idk how scap backport will handle that so I’d prefer if you could wait ^^
[14:03:53] <arlolra_>	 o/
[14:04:12] <matthiasmullie>	 Lucas_WMDE: I can wait; LMK when you're done :p
[14:04:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) (owner: 10PleaseStand)
[14:04:17] <Lucas_WMDE>	 matthiasmullie: ok, thanks
[14:04:34] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Remove obsolete setting $wgAutoloadAttemptLowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) (owner: 10PleaseStand)
[14:04:40] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) (owner: 10PleaseStand)
[14:04:40] <matthiasmullie>	 FYI - I encountered one last week; scap backport actually warns you before deploying that there's another patch
[14:04:48] <taavi>	 Lucas_WMDE: scap backport will work properly with beta only patches, so just merge and fetch but not sync
[14:05:25] <matthiasmullie>	 Ah, I can wait until the rest of the deployment work is done :)
[14:05:51] <wikibugs>	 (03PS3) 10Effie Mouzeli: mediawiki: remove prometheus-nutcracker-exporter auto restart [puppet] - 10https://gerrit.wikimedia.org/r/868403
[14:05:53] <wikibugs>	 (03Merged) 10jenkins-bot: Remove obsolete setting $wgAutoloadAttemptLowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/867307 (https://phabricator.wikimedia.org/T231412) (owner: 10PleaseStand)
[14:06:08] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:867307|Remove obsolete setting $wgAutoloadAttemptLowercase (T231412)]]
[14:06:13] <stashbot>	 T231412: Deprecate and remove $wgAutoloadAttemptLowercase - https://phabricator.wikimedia.org/T231412
[14:06:23] <icinga-wm>	 RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:57] <Lucas_WMDE>	 taavi: I assume “just merge and fetch but not sync” is what would happen if I ran `scap backport`?
[14:07:11] <Lucas_WMDE>	 whereas matthiasmullie just wanted to +2 the patch himself IIUC ^^
[14:07:15] <icinga-wm>	 PROBLEM - Check systemd state on mw2298 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and ki: Backport for [[gerrit:867307|Remove obsolete setting $wgAutoloadAttemptLowercase (T231412)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[14:08:14] <Lucas_WMDE>	 PleaseStand: is it possible to test this change?
[14:08:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make puppetdb[12]003 puppetdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/863255 (owner: 10Muehlenhoff)
[14:08:33] * Lucas_WMDE quickly checks that https://en.wikipedia.org/wiki/Special:Version doesn’t explode on mwdebug
[14:09:52] <PleaseStand>	 Lucas_WMDE: The removed config setting is obsolete, so there should be no visible change.
[14:10:57] <Lucas_WMDE>	 ok, thanks
[14:11:01] <Lucas_WMDE>	 syncing
[14:12:05] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "\o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/868407 (https://phabricator.wikimedia.org/T323586) (owner: 10Ilias Sarantopoulos)
[14:12:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: Q3: rack/setup/install dumpsdata100[67] - https://phabricator.wikimedia.org/T299443 (10Jclark-ctr) @Cmjohnson  reseated
[14:12:20] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: remove translatewiki and frwikisource isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/867601 (https://phabricator.wikimedia.org/T324567) (owner: 10AikoChou)
[14:12:48] <wikibugs>	 (03CR) 10Effie Mouzeli: "pcc https://puppet-compiler.wmflabs.org/output/868403/38813/" [puppet] - 10https://gerrit.wikimedia.org/r/868403 (owner: 10Effie Mouzeli)
[14:12:53] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: remove prometheus-nutcracker-exporter auto restart [puppet] - 10https://gerrit.wikimedia.org/r/868403 (owner: 10Effie Mouzeli)
[14:14:01] <wikibugs>	 (03PS1) 10Jgiannelos: Exclude OSM tag that causes a failing import [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293)
[14:14:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Exclude OSM tag that causes a failing import [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos)
[14:14:50] <wikibugs>	 (03PS2) 10Jgiannelos: Exclude OSM tag that causes a failing import [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293)
[14:14:51] <icinga-wm>	 PROBLEM - Check systemd state on mw1440 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:36] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Disable wgParserEnableLegacyMediaDOM on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866653 (https://phabricator.wikimedia.org/T297984) (owner: 10Arlolra)
[14:16:19] <wikibugs>	 (03CR) 10Jgiannelos: "MSantos:" [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos)
[14:17:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:867307|Remove obsolete setting $wgAutoloadAttemptLowercase (T231412)]] (duration: 10m 57s)
[14:17:10] <stashbot>	 T231412: Deprecate and remove $wgAutoloadAttemptLowercase - https://phabricator.wikimedia.org/T231412
[14:17:28] <wikibugs>	 (03PS1) 10Jbond: Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416
[14:17:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866653 (https://phabricator.wikimedia.org/T297984) (owner: 10Arlolra)
[14:17:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 (owner: 10Jbond)
[14:17:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 (owner: 10Jbond)
[14:18:05] <Lucas_WMDE>	 sergi0: I don’t think I’ll have time to deploy your backport before I have to go into a meeting, sorry
[14:18:11] <Lucas_WMDE>	 maybe someone else is around who can deploy them
[14:18:16] <Lucas_WMDE>	 (*backports, plural)
[14:18:18] <wikibugs>	 (03Merged) 10jenkins-bot: Disable wgParserEnableLegacyMediaDOM on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866653 (https://phabricator.wikimedia.org/T297984) (owner: 10Arlolra)
[14:18:24] <wikibugs>	 (03PS2) 10Jbond: Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416
[14:18:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:866653|Disable wgParserEnableLegacyMediaDOM on cawiki (T297984 T314318)]]
[14:18:36] <stashbot>	 T297984: Media html read view considerations - https://phabricator.wikimedia.org/T297984
[14:18:36] <stashbot>	 T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318
[14:18:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 (owner: 10Jbond)
[14:20:04] <wikibugs>	 (03PS3) 10Jbond: Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416
[14:20:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and arlolra: Backport for [[gerrit:866653|Disable wgParserEnableLegacyMediaDOM on cawiki (T297984 T314318)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[14:20:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Wmflib: add port range type validator [puppet] - 10https://gerrit.wikimedia.org/r/868416 (owner: 10Jbond)
[14:20:32] <Lucas_WMDE>	 arlolra_: can you test the cawiki change on mwdebug?
[14:20:40] <arlolra_>	 Yup, one sec
[14:21:55] <arlolra_>	 Looks good
[14:21:59] <Lucas_WMDE>	 ok thanks
[14:23:25] <icinga-wm>	 PROBLEM - Check systemd state on mw2399 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:30] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1003.eqiad.wmnet
[14:24:28] <sergi0>	 Lucas_WMDE: no worries, I'll wait see if someone is around and re-schedule for later if not, ty
[14:24:43] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:29] <Lucas_WMDE>	 hmm, the deployment calendar still has most of the non-train windows for the next week, afaict
[14:25:30] <Reedy>	 What is left to deploy?
[14:25:38] <Lucas_WMDE>	 but the yearly calendar says “no train or deploys”
[14:25:43] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudweb2002-dev.wikimedia.org
[14:25:47] <Lucas_WMDE>	 Reedy: sergi0’s two backports for GrowthExperiments
[14:26:01] <Lucas_WMDE>	 also matthiasmullie still needs to merge a config-only change
[14:26:10] <Lucas_WMDE>	 (php-fpm-restart at 40% on my end btw)
[14:26:25] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:26:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-stretch1002.eqiad.wmnet with reason: host reimage
[14:27:15] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:27:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:866653|Disable wgParserEnableLegacyMediaDOM on cawiki (T297984 T314318)]] (duration: 09m 18s)
[14:27:54] <stashbot>	 T297984: Media html read view considerations - https://phabricator.wikimedia.org/T297984
[14:27:54] <stashbot>	 T314318: Disable wgParserEnableLegacyMediaDOM on all wikis - https://phabricator.wikimedia.org/T314318
[14:28:06] <Lucas_WMDE>	 matthiasmullie: I’m done, you can merge your change
[14:28:25] <wikibugs>	 (03PS4) 10Matthias Mullie: [beta] [SearchVue] Change protocol-less urls to https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446)
[14:28:29] <matthiasmullie>	 thanks
[14:29:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446) (owner: 10Matthias Mullie)
[14:29:58] <wikibugs>	 (03PS1) 10Bking: wdqs: add nofail to NFS mount options [puppet] - 10https://gerrit.wikimedia.org/r/868418 (https://phabricator.wikimedia.org/T325114)
[14:30:00] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] [SearchVue] Change protocol-less urls to https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868224 (https://phabricator.wikimedia.org/T324446) (owner: 10Matthias Mullie)
[14:30:27] <icinga-wm>	 PROBLEM - Check systemd state on mw1389 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:29] <matthiasmullie>	 I'm done; anyone else?
[14:30:33] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1003.eqiad.wmnet
[14:30:38] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1004.eqiad.wmnet
[14:31:03] <sergi0>	 o/
[14:31:25] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-stretch1002.eqiad.wmnet with reason: host reimage
[14:31:46] <sergi0>	 Anyone around can help with the remaining GrowthExperiments patches?
[14:32:40] <Lucas_WMDE>	 jouncebot: next
[14:32:40] <jouncebot>	 In 2 hour(s) and 27 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1700)
[14:32:55] <Reedy>	 Yeah, I'm just having a look
[14:32:56] <Lucas_WMDE>	 I might be able to deploy later (since there’ll be some time before the puppet window)
[14:33:00] <Lucas_WMDE>	 ok
[14:33:24] <Reedy>	 sergi0: I presume both can go out together?
[14:33:38] <sergi0>	 that's right
[14:33:40] <Reedy>	 (makes it quicker for everyone)
[14:33:49] <arlolra_>	 Lucas_WMDE: thanks
[14:34:23] <Reedy>	 Hmm
[14:34:32] <Reedy>	 They've not actually merged into master (but Gergo has +2'd them)
[14:35:40] <wikibugs>	 (03PS2) 10Reedy: User impact: instrument info tooltip clicks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868064 (https://phabricator.wikimedia.org/T325041) (owner: 10Sergio Gimeno)
[14:35:44] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] User impact: instrument info tooltip clicks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868064 (https://phabricator.wikimedia.org/T325041) (owner: 10Sergio Gimeno)
[14:35:48] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Vue components: react to binding updates of v-click-outside directive [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868063 (https://phabricator.wikimedia.org/T325041) (owner: 10Sergio Gimeno)
[14:36:01] <Reedy>	 just rebase to stack them to get rid of the merge commit chance
[14:37:27] <icinga-wm>	 PROBLEM - Check systemd state on mw1379 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-nutcracker-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:02] <wikibugs>	 (03PS1) 10JMeybohm: Update cert-manager to 1.10.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868422 (https://phabricator.wikimedia.org/T325292)
[14:38:03] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1004.eqiad.wmnet
[14:38:31] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[14:39:25] <sergi0>	 Reedy: sorry, I didn't get your last comment. They show up to date with wmf.14 branch. Do you mean I need to rebase the master changes?
[14:39:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & analytics-product-users for Hxi-ctr - https://phabricator.wikimedia.org/T325004 (10mpopov) Yes, please list myself and @kzimmerman as approvers for analytics-product-users.  And approved :)
[14:40:53] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[14:42:11] <icinga-wm>	 PROBLEM - Disk space on puppetdb2003 is CRITICAL: DISK CRITICAL - /run/credentials/systemd-sysctl.service is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=puppetdb2003&var-datasource=codfw+prometheus/ops
[14:43:59] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudweb1004.wikimedia.org
[14:45:30] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-stretch1002.eqiad.wmnet with OS bullseye
[14:45:30] <Reedy>	 sergi0: The changes they are cherry picked from (ie in master) weren't merged into master
[14:45:35] <Reedy>	 Even though Gergo +2'd them hours ago
[14:45:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye co...
[14:45:47] <Reedy>	 I've just rebased them and reapplied +2s and they seem to be going through the gate
[14:46:13] <Reedy>	 (While it's not always the case, it's a usual expectation that cherry picks to deployment branches would have been reviwed and merged into master first)
[14:46:14] <sergi0>	 oh, got you. Thank you
[14:46:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] OpenStack nova: add a default for profile::openstack::base::nova::instance_dev [puppet] - 10https://gerrit.wikimedia.org/r/868167 (owner: 10Andrew Bogott)
[14:47:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis)
[14:47:38] <wikibugs>	 (03CR) 10Andrew Bogott: "This breaks puppet runs on wikitech hosts" [puppet] - 10https://gerrit.wikimedia.org/r/868403 (owner: 10Effie Mouzeli)
[14:47:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: add a default for profile::openstack::base::nova::instance_dev [puppet] - 10https://gerrit.wikimedia.org/r/868167 (owner: 10Andrew Bogott)
[14:51:37] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "mediawiki: remove prometheus-nutcracker-exporter auto restart" [puppet] - 10https://gerrit.wikimedia.org/r/868354
[14:51:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:53:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "mediawiki: remove prometheus-nutcracker-exporter auto restart" [puppet] - 10https://gerrit.wikimedia.org/r/868354 (owner: 10Andrew Bogott)
[14:54:04] <wikibugs>	 (03PS1) 10Hashar: wm-checks-api: add support for Puppet Catalogue Compiler [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868424
[14:54:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10BTullis) 05Open→03Resolved I think that there are all done now.
[14:56:06] <wikibugs>	 (03PS2) 10Andrew Bogott: Revert "mediawiki: remove prometheus-nutcracker-exporter auto restart" [puppet] - 10https://gerrit.wikimedia.org/r/868354
[14:57:17] <Reedy>	 CI is nearly done...
[14:58:10] <wikibugs>	 (03CR) 10Hashar: "Hi John, this patch is for the Gerrit Checks API UI enhancement which I have deployed this week. The code should be able to recognize mess" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868424 (owner: 10Hashar)
[14:58:38] <wikibugs>	 (03Merged) 10jenkins-bot: Vue components: react to binding updates of v-click-outside directive [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868063 (https://phabricator.wikimedia.org/T325041) (owner: 10Sergio Gimeno)
[14:58:41] <wikibugs>	 (03Merged) 10jenkins-bot: User impact: instrument info tooltip clicks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868064 (https://phabricator.wikimedia.org/T325041) (owner: 10Sergio Gimeno)
[14:58:55] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10BTullis) Now kafka-stretch2001 is the only one of these four kafka-stretch hosts left with the drive order reversed. ` btullis@cumin1001:~$ sudo...
[14:59:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch2001.codfw.wmnet with OS bullseye
[14:59:43] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye
[14:59:48] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280)
[15:00:13] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280)
[15:00:27] <Reedy>	 yay
[15:01:27] <Reedy>	 sergi0: Do you need/want to test them on a staging host first? Or not fussed if they just go straight out?
[15:02:10] <wikibugs>	 (03PS2) 10JMeybohm: Update cert-manager to 1.10.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868422 (https://phabricator.wikimedia.org/T325292)
[15:02:18] <sergi0>	 I think they can go straight out, I'll check there aren't invalid events request after
[15:02:37] <wikibugs>	 (03CR) 10Jbond: "lgtm see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[15:05:02] <moritzm>	 !log imported prometheus-jmx-exporter for bookworm-wikimedia T321783
[15:05:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:06] <stashbot>	 T321783: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783
[15:05:11] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[15:06:14] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[15:06:39] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[15:06:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:10:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: php-multiversion-base: add sendmail [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868428 (https://phabricator.wikimedia.org/T325131)
[15:12:11] <wikibugs>	 (03PS5) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153)
[15:12:15] <wikibugs>	 (03CR) 10Elukey: "LGTM! Left a little nit for the build's changelog, the rest looks good." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868422 (https://phabricator.wikimedia.org/T325292) (owner: 10JMeybohm)
[15:12:23] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[15:12:45] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[15:15:03] <logmsgbot>	 !log reedy@deploy1002 Synchronized php-1.40.0-wmf.14/extensions/GrowthExperiments/: Two backports (duration: 06m 57s)
[15:15:11] <Reedy>	 sergi0: ^^
[15:15:30] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[15:15:44] <sergi0>	 Reedy: is it synced?
[15:15:57] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[15:16:14] <Reedy>	 yeah
[15:16:47] <sergi0>	 Reedy: All good. I don't see errors.
[15:17:01] <wikibugs>	 (03PS6) 10Jbond: sre.hosts.reboot-single: add ability to enable host on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153)
[15:17:03] <wikibugs>	 (03PS1) 10Jbond: sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153)
[15:17:48] <sergi0>	 Reedy: thank you very much!
[15:17:53] <wikibugs>	 (03CR) 10Jbond: "thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[15:18:00] * Lucas_WMDE back
[15:18:03] <Lucas_WMDE>	 just too late, it seems ^^
[15:18:08] <Lucas_WMDE>	 thanks Reedy!
[15:18:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10jbond) [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1026163 | upstream bug re java 11 ]]
[15:18:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:22:30] <wikibugs>	 (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:22:30] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[15:23:16] <wikibugs>	 (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:23:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:24:03] <wikibugs>	 (03PS19) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576)
[15:24:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM durum1001.eqiad.wmnet
[15:24:48] <wikibugs>	 (03CR) 10JMeybohm: "Another thing I forgot (sorry): For prod, you currently need support for overwriting k8s apiserver env variables, like https://gerrit.wiki" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:24:52] <wikibugs>	 (03PS9) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[15:24:57] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[15:25:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:28:07] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:28:43] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:29:53] <sukhe>	 ^ please ignore these, durum reboots in progress. they should resolve. 
[15:29:56] <sukhe>	 10.64.48.95              Down                     0.900     2.000        3   
[15:29:59] <sukhe>	 this is durum1001
[15:30:39] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[15:31:43] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[15:32:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[15:34:40] <volans>	 !log temporary disabling puppet on A:cumin-all to deploy spicerack v6.0.0
[15:34:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:28] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[15:37:56] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[15:38:03] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[15:38:12] <wikibugs>	 (03PS3) 10Volans: cookbooks.sre: add title for the group [cookbooks] - 10https://gerrit.wikimedia.org/r/868088
[15:39:16] <wikibugs>	 (03PS1) 10Jbond: nginx: let puppet pick the correct provider [puppet] - 10https://gerrit.wikimedia.org/r/868431 (https://phabricator.wikimedia.org/T321783)
[15:40:08] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cumin: fix email address for insetup role audit [puppet] - 10https://gerrit.wikimedia.org/r/868396 (owner: 10Volans)
[15:41:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2009.codfw.wmnet
[15:41:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/868431 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[15:42:20] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: add nofail to NFS mount options [puppet] - 10https://gerrit.wikimedia.org/r/868418 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[15:43:34] <icinga-wm>	 RECOVERY - Disk space on puppetdb2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=puppetdb2003&var-datasource=codfw+prometheus/ops
[15:44:08] <icinga-wm>	 PROBLEM - Host durum1001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:44:53] <sukhe>	 it gave up!
[15:45:54] <volans>	 nic renamed?
[15:47:10] <sukhe>	 not sure, but unlikely I guess given I have had no issues in the previous reboots. looking!
[15:48:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38815/console" [puppet] - 10https://gerrit.wikimedia.org/r/868431 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[15:51:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: add configuration for sendmail [deployment-charts] - 10https://gerrit.wikimedia.org/r/868432 (https://phabricator.wikimedia.org/T325131)
[15:52:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[15:52:37] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868418 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[15:54:24] <wikibugs>	 (03PS10) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[15:54:45] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-stretch2001.codfw.wmnet with reason: host reimage
[15:55:13] <wikibugs>	 (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:55:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[15:55:52] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, modulo the decision on where to put the icinga check." [cookbooks] - 10https://gerrit.wikimedia.org/r/868077 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[15:57:06] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868378 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[15:57:34] <moritzm>	 !log installing openexr security updates
[15:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:47] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-stretch2001.codfw.wmnet with reason: host reimage
[16:00:13] <wikibugs>	 (03PS1) 10RobH: updating for gen 15 r650xs [software] - 10https://gerrit.wikimedia.org/r/868435
[16:00:16] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: add nofail to NFS mount options [puppet] - 10https://gerrit.wikimedia.org/r/868418 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[16:00:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] updating for gen 15 r650xs [software] - 10https://gerrit.wikimedia.org/r/868435 (owner: 10RobH)
[16:01:16] <wikibugs>	 (03Abandoned) 10RobH: updating for gen 15 r650xs [software] - 10https://gerrit.wikimedia.org/r/868435 (owner: 10RobH)
[16:02:09] <nemo-yiannis>	 !log switching maps/kartotherian back to codfw 
[16:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:51] <moritzm>	 !log installing glibc security updates on bullseye
[16:02:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:59] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch2001.codfw.wmnet with OS bullseye
[16:03:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch2001.codfw.wmnet with OS bullseye ex...
[16:03:54] <wikibugs>	 (03PS1) 10RobH: updating R650xs skus [software] - 10https://gerrit.wikimedia.org/r/868436
[16:04:02] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): byte/str TypeError during svg conversion - https://phabricator.wikimedia.org/T325150 (10Vlad.shapik) a:05hnowlan→03Vlad.shapik
[16:04:09] <wikibugs>	 (03CR) 10RobH: [C: 03+2] updating R650xs skus [software] - 10https://gerrit.wikimedia.org/r/868436 (owner: 10RobH)
[16:04:37] <wikibugs>	 (03Merged) 10jenkins-bot: updating R650xs skus [software] - 10https://gerrit.wikimedia.org/r/868436 (owner: 10RobH)
[16:04:56] <wikibugs>	 (03PS20) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576)
[16:06:57] <wikibugs>	 (03PS11) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[16:07:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "mediawiki: remove prometheus-nutcracker-exporter auto restart" [puppet] - 10https://gerrit.wikimedia.org/r/868354 (owner: 10Andrew Bogott)
[16:07:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[16:12:35] <wikibugs>	 (03PS1) 10Muehlenhoff: elasticsearch: Enable profile::auto_restarts::service for Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/868438 (https://phabricator.wikimedia.org/T135991)
[16:14:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Ottomata) WOWWW THANK YOU!
[16:15:33] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[16:16:09] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868438 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[16:17:32] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host wdqs2009.codfw.wmnet
[16:18:01] <icinga-wm>	 RECOVERY - Check systemd state on mw1351 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:01] <icinga-wm>	 RECOVERY - Check systemd state on mw1475 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:05] <icinga-wm>	 RECOVERY - Check systemd state on mw1402 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:05] <icinga-wm>	 RECOVERY - Check systemd state on mw1379 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:07] <icinga-wm>	 RECOVERY - Check systemd state on mw1423 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:09] <icinga-wm>	 RECOVERY - Check systemd state on mw1464 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:09] <icinga-wm>	 RECOVERY - Check systemd state on mw1494 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:09] <icinga-wm>	 RECOVERY - Check systemd state on mw1361 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:13] <icinga-wm>	 RECOVERY - Check systemd state on mw1447 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:13] <icinga-wm>	 RECOVERY - Check systemd state on mw1482 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:15] <icinga-wm>	 RECOVERY - Check systemd state on mw1415 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:15] <icinga-wm>	 RECOVERY - Check systemd state on parse1018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:19] <icinga-wm>	 RECOVERY - Check systemd state on mw1435 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:19] <icinga-wm>	 RECOVERY - Check systemd state on mw1449 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:19] <icinga-wm>	 RECOVERY - Check systemd state on mw1439 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:25] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) @AnnWF sorry I missed adding you to the wmf group as well, try now please!
[16:18:29] <icinga-wm>	 RECOVERY - Check systemd state on mw2382 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:29] <icinga-wm>	 RECOVERY - Check systemd state on mw2374 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:31] <icinga-wm>	 RECOVERY - Check systemd state on mw1390 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:31] <icinga-wm>	 RECOVERY - Check systemd state on mw1440 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:31] <icinga-wm>	 RECOVERY - Check systemd state on parse1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:33] <icinga-wm>	 RECOVERY - Check systemd state on mw2295 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:34] <icinga-wm>	 RECOVERY - Check systemd state on parse2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:35] <icinga-wm>	 RECOVERY - Check systemd state on mw2286 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:39] <icinga-wm>	 RECOVERY - Check systemd state on mw2371 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:39] <icinga-wm>	 RECOVERY - Check systemd state on mw1394 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:39] <icinga-wm>	 RECOVERY - Check systemd state on mw1497 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:41] <icinga-wm>	 RECOVERY - Check systemd state on mw1450 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:43] <icinga-wm>	 RECOVERY - Check systemd state on mw2269 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:45] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:45] <icinga-wm>	 RECOVERY - Check systemd state on mw2387 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:49] <icinga-wm>	 RECOVERY - Check systemd state on mw1448 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:49] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:49] <icinga-wm>	 RECOVERY - Check systemd state on mw1474 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:49] <icinga-wm>	 RECOVERY - Check systemd state on mw2388 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:49] <icinga-wm>	 RECOVERY - Check systemd state on mw1401 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:51] <icinga-wm>	 RECOVERY - Check systemd state on mw1418 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:51] <icinga-wm>	 RECOVERY - Check systemd state on mw1416 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:53] <icinga-wm>	 RECOVERY - Check systemd state on mw1443 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:53] <icinga-wm>	 RECOVERY - Check systemd state on parse1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:55] <icinga-wm>	 RECOVERY - Check systemd state on mw2372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:59] <icinga-wm>	 RECOVERY - Check systemd state on parse2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:59] <icinga-wm>	 RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2310 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:01] <icinga-wm>	 RECOVERY - Check systemd state on mw2331 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:01] <icinga-wm>	 RECOVERY - Check systemd state on mw2350 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:05] <icinga-wm>	 RECOVERY - Check systemd state on mw1454 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:06] <icinga-wm>	 RECOVERY - Check systemd state on parse2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:07] <icinga-wm>	 RECOVERY - Check systemd state on mw1404 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:07] <icinga-wm>	 RECOVERY - Check systemd state on mw1436 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:07] <icinga-wm>	 RECOVERY - Check systemd state on mw1378 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:07] <icinga-wm>	 RECOVERY - Check systemd state on mw2272 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:07] <icinga-wm>	 RECOVERY - Check systemd state on mw2311 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:15] <icinga-wm>	 RECOVERY - Check systemd state on mw2316 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:15] <icinga-wm>	 RECOVERY - Check systemd state on mw1356 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:15] <icinga-wm>	 RECOVERY - Check systemd state on mw2271 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:15] <icinga-wm>	 RECOVERY - Check systemd state on mw2406 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:19] <icinga-wm>	 RECOVERY - Check systemd state on mw1389 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:25] <icinga-wm>	 RECOVERY - Check systemd state on mw1489 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:25] <icinga-wm>	 RECOVERY - Check systemd state on parse2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:29] <icinga-wm>	 RECOVERY - Check systemd state on mw2259 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:33] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:35] <icinga-wm>	 RECOVERY - Check systemd state on mw2399 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:37] <icinga-wm>	 RECOVERY - Check systemd state on mw2273 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:43] <icinga-wm>	 RECOVERY - Check systemd state on mw2298 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:43] <icinga-wm>	 RECOVERY - Check systemd state on mw2276 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:45] <icinga-wm>	 RECOVERY - Check systemd state on mw2367 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:45] <icinga-wm>	 RECOVERY - Check systemd state on mw2376 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:55] <icinga-wm>	 RECOVERY - Check systemd state on mw1417 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:55] <icinga-wm>	 RECOVERY - Check systemd state on mw2385 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:22:33] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1004.wikimedia.org
[16:22:36] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM durum1001.eqiad.wmnet
[16:23:27] <icinga-wm>	 RECOVERY - Check systemd state on mw1441 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:23:37] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudweb1003.wikimedia.org
[16:23:51] <wikibugs>	 (03CR) 10Volans: [C: 03+2] spicerack: update config for v6.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/868377 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[16:24:34] <volans>	 !log upgrading spicerack to v6.0.0 on cumin2002
[16:24:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:42] <wikibugs>	 (03PS1) 10David Caro: toolforge: add missing /etc/toolforge dir [puppet] - 10https://gerrit.wikimedia.org/r/868439
[16:25:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] toolforge: add missing /etc/toolforge dir [puppet] - 10https://gerrit.wikimedia.org/r/868439 (owner: 10David Caro)
[16:25:18] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2002-dev.wikimedia.org
[16:26:16] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cookbooks: remove top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[16:26:31] <icinga-wm>	 RECOVERY - Check systemd state on mw1431 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott)
[16:26:33] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:39] <wikibugs>	 (03PS1) 10Bking: wdqs: allow mounting clouddumps share from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/868440 (https://phabricator.wikimedia.org/T323096)
[16:27:41] <wikibugs>	 (03PS2) 10David Caro: toolforge: add missing /etc/toolforge dir [puppet] - 10https://gerrit.wikimedia.org/r/868439
[16:27:59] <icinga-wm>	 RECOVERY - Check systemd state on mw1462 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:28:00] <wikibugs>	 (03Merged) 10jenkins-bot: cookbooks: remove top-level __init__.py [cookbooks] - 10https://gerrit.wikimedia.org/r/868087 (https://phabricator.wikimedia.org/T325168) (owner: 10Volans)
[16:28:01] <icinga-wm>	 RECOVERY - Check systemd state on mw2356 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:28:01] <icinga-wm>	 RECOVERY - Check systemd state on mw2401 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:28:16] <wikibugs>	 (03PS2) 10Bking: wdqs: allow mounting clouddumps share from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/868440 (https://phabricator.wikimedia.org/T323096)
[16:29:11] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cookbooks.sre: add title for the group [cookbooks] - 10https://gerrit.wikimedia.org/r/868088 (owner: 10Volans)
[16:30:45] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1003.wikimedia.org
[16:31:28] <wikibugs>	 (03Merged) 10jenkins-bot: cookbooks.sre: add title for the group [cookbooks] - 10https://gerrit.wikimedia.org/r/868088 (owner: 10Volans)
[16:34:22] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin2002.codfw.wmnet with reason: test spicerack v6.0.0
[16:34:37] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin2002.codfw.wmnet with reason: test spicerack v6.0.0
[16:35:57] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: add missing /etc/toolforge dir [puppet] - 10https://gerrit.wikimedia.org/r/868439 (owner: 10David Caro)
[16:36:37] <wikibugs>	 (03PS1) 10DLynch: Release new DiscussionTools reply button enhancement to Arabic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868441 (https://phabricator.wikimedia.org/T323537)
[16:36:48] <volans>	 !log upgrading spicerack to v6.0.0 on cumin1001
[16:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:49] <wikibugs>	 10SRE, 10serviceops: kubernetes102[34] implemetation tracking - https://phabricator.wikimedia.org/T313874 (10akosiaris)
[16:40:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 04-1] "Thanks for the review!  I'll get working on those bits and upload a new patchset cheers." [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[16:42:02] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@15e6aa7] (codfw): Revert "codfw: Disable traffic mirroring"
[16:43:46] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@15e6aa7] (codfw): Revert "codfw: Disable traffic mirroring" (duration: 01m 44s)
[16:43:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10BTullis) OK, I think that both of these two hosts are set up correctly now. The failure in the cookbook above was only a delayed install of `per...
[16:44:36] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] toolforge: add missing /etc/toolforge dir [puppet] - 10https://gerrit.wikimedia.org/r/868439 (owner: 10David Caro)
[16:45:55] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1010.eqiad.wmnet with OS bullseye
[16:46:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1010...
[16:49:32] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868440 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking)
[16:50:29] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus3001.esams.wmnet
[16:50:29] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revertrisk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218)
[16:51:58] <wikibugs>	 (03PS2) 10AikoChou: ml-services: update revertrisk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218)
[16:52:45] <mutante>	 !log aphlict1001 - rebooting - this could mean for a minute there are dropped notifications about Phabricator tickets
[16:52:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:26] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2002.codfw.wmnet with OS bullseye
[16:55:03] <icinga-wm>	 PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: aphlict.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:55:57] <mutante>	 well, machine is back, service is not :p as g.odog would say: sadtrombone.wav
[16:56:28] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus3001.esams.wmnet
[16:57:40] <mutante>	 ok, had to delete pid file manually and restart it
[16:57:51] <mutante>	 did not properly shut down on reboot
[16:58:03] <wikibugs>	 (03PS1) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[16:58:09] <icinga-wm>	 RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:58:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[16:58:39] <mutante>	 !log aphlict1001 - aphlict service did not come back after rebooting machine. fix was to manually 'rm /var/run/aphlict/aphlict.pid' and 'systemtctl start aphlict'
[16:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:05] <jouncebot>	 jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1700).
[17:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:02:16] <wikibugs>	 (03PS2) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[17:02:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[17:04:10] <mutante>	 !log phab2002 (non active phabricator server) - rebooting
[17:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:35] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wdqs: allow mounting clouddumps share from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/868440 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking)
[17:05:17] <icinga-wm>	 PROBLEM - Host phab2002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:32] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on phab2002.codfw.wmnet with reason: reboot
[17:06:38] <wikibugs>	 (03PS3) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[17:06:47] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on phab2002.codfw.wmnet with reason: reboot
[17:06:51] <icinga-wm>	 RECOVERY - Host phab2002 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms
[17:07:05] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: allow mounting clouddumps share from wdqs2009 [puppet] - 10https://gerrit.wikimedia.org/r/868440 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking)
[17:07:29] <wikibugs>	 (03PS8) 10Volans: base::cloud_production: introduce new profile [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401)
[17:08:45] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867169 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[17:10:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280) (owner: 10Arturo Borrero Gonzalez)
[17:11:43] <wikibugs>	 (03PS4) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[17:13:35] <wikibugs>	 (03PS5) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[17:14:16] <wikibugs>	 (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[17:15:09] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch200[12] - https://phabricator.wikimedia.org/T314160 (10Ottomata) YAYYY
[17:17:29] <volans>	 !log manually removed /etc/spicerack/redis_cluster/sessions.yaml from the cumin hosts
[17:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:35] <volans>	 cc moritzm effie ^^^
[17:21:35] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2002.codfw.wmnet with reason: host reimage
[17:21:45] <wikibugs>	 (03PS1) 10Volans: cookbook: improve help message [software/spicerack] - 10https://gerrit.wikimedia.org/r/868445
[17:24:21] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on miscweb2002.codfw.wmnet with reason: reboot
[17:24:42] <mutante>	 !log miscweb2002 - passive host, rebooting
[17:24:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:47] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on miscweb2002.codfw.wmnet with reason: reboot
[17:24:59] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2002.codfw.wmnet with reason: host reimage
[17:25:17] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka logging-codfw cluster: Reboot kafka nodes
[17:26:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[17:26:39] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[17:27:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2313.codfw.wmnet, mw2271.codfw.wmnet, mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2415.codfw.wmnet, mw2389.codfw.wmnet, mw2274.codfw.wmnet, mw2392.codfw.wmnet, mw2333.codfw.wmnet, mw2414.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2371.codfw.wmnet, mw2310.codfw.wmnet, mw2273.codfw.wmnet, mw2413.codfw.wmnet, mw
[17:27:01] <icinga-wm>	 fw.wmnet, mw2303.codfw.wmnet, mw2325.codfw.wmnet, mw2393.codfw.wmnet, mw2314.codfw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2369.codfw.wmnet, mw2269.codfw.wmnet, mw2365.codfw.wmnet, mw2412.codfw.wmnet, mw2406.codfw.wmnet, mw2408.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2373.codfw.wmnet, mw2270.codfw.wmnet, mw2335.codfw.wmnet, mw2339.codfw.wmnet, mw2316.codfw.wmnet, mw2377.codfw.wmnet, mw2385.codfw
[17:27:01] <icinga-wm>	 mw2331.codfw.wmnet, mw2277.codfw.wmnet, mw2384.codfw.wmnet, mw2305.codfw.wmnet, mw2388.codfw.wmnet, mw2337.codfw.wmnet, mw2272.codfw.wmnet, mw2307.codfw.wmnet, mw2379.codfw.wmnet, mw238 https://wikitech.wikimedia.org/wiki/PyBal
[17:27:18] <jinxer-wm>	 (ProbeDown) firing: (6) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:27:19] <jinxer-wm>	 (ProbeDown) firing: (9) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:27:41] <mutante>	 ugh
[17:27:41] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[17:27:50] <TheresNoTime>	 err.
[17:27:51] <mutante>	 is this the kafka server reboot?
[17:27:59] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is CRITICAL: 0.9697 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:28:13] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=
[17:28:17] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:28:21] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 1401 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:28:45] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[17:28:55] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2384 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:28:55] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2337 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:28:55] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2392 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:28:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:28:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:28:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:28:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:28:59] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2393 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:28:59] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2335 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:03] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:03] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:05] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:07] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2305 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:07] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:11] <denisse>	 mutante: Possibly yes, I'm looking at it.
[17:29:11] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2378 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:11] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2383 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:11] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2385 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:11] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2390 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:11] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2409 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:12] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:12] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2414 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:13] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:15] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2339 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:15] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:23] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:23] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2329 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:27] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2311 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:27] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2391 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:27] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2379 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:31] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2303 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:33] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:35] <jinxer-wm>	 (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[17:29:37] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:41] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:41] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2363 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2386 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:43] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2408 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:45] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:51] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:51] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2373 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:51] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:53] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:53] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2359 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:29:53] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2367 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:30:17] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2309 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:31:37] <icinga-wm>	 PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[17:32:05] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2022-12-12-165842-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868448
[17:32:18] <jinxer-wm>	 (ProbeDown) firing: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:32:19] <jinxer-wm>	 (ProbeDown) firing: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:33:17] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:33:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2325 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:33:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:33:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2389 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:33:29] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2412 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:34:21] <wikibugs>	 (03PS6) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[17:35:06] <wikibugs>	 (03PS3) 10AikoChou: ml-services: update revertrisk docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/868442 (https://phabricator.wikimedia.org/T325218)
[17:36:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2393.codfw.wmnet, mw2313.codfw.wmnet, mw2271.codfw.wmnet, mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2415.codfw.wmnet, mw2274.codfw.wmnet, mw2392.codfw.wmnet, mw2333.codfw.wmnet, mw2414.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2316.codfw.wmnet, mw2303.codfw.wmnet, mw2325.codfw.wmnet, mw2379.codfw.wmnet, mw
[17:36:14] <icinga-wm>	 fw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2369.codfw.wmnet, mw2269.codfw.wmnet, mw2365.codfw.wmnet, mw2406.codfw.wmnet, mw2408.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2373.codfw.wmnet, mw2335.codfw.wmnet, mw2339.codfw.wmnet, mw2272.codfw.wmnet, mw2377.codfw.wmnet, mw2385.codfw.wmnet, mw2331.codfw.wmnet, mw2277.codfw.wmnet, mw2384.codfw.wmnet, mw2305.codfw.wmnet, mw2337.codfw.wmnet, mw2407.codfw
[17:36:14] <icinga-wm>	 mw2268.codfw.wmnet, mw2301.codfw.wmnet, mw2273.codfw.wmnet, mw2276.codfw.wmnet, mw2363.codfw.wmnet, mw2329.codfw.wmnet, mw2391.codfw.wmnet, mw2309.codfw.wmnet, mw2387.codfw.wmnet, mw231 https://wikitech.wikimedia.org/wiki/PyBal
[17:36:33] <wikibugs>	 (03PS7) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[17:37:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2129', diff saved to https://phabricator.wikimedia.org/P42713 and previous config saved to /var/cache/conftool/dbconfig/20221215-173713-ladsgroup.json
[17:37:16] <wikibugs>	 (03PS1) 10BryanDavis: striker: Bump container version to 2022-12-01-223001-production [puppet] - 10https://gerrit.wikimedia.org/r/868449
[17:37:45] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38824/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[17:38:17] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[17:38:35] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2415 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:38:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[17:39:01] <icinga-wm>	 RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[17:39:35] <jinxer-wm>	 (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable
[17:39:41] <Oshwah>	 Hmm looks like the sign-in page on en-wiki isn't working for me. Tried different browsers and workstations.
[17:40:11] <TheresNoTime>	 Oshwah: ack, is known currently ^^
[17:40:41] <Oshwah>	 TheresNoTime: Yeah I figured such. I just thought I'd mention it just in case. Thanks for letting me know.
[17:41:33] <ragesoss>	 the wikis have been down for me for 15 mins or so. some people are reporting they've been working fine, others also reporting them down.
[17:41:39] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:41:43] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2316 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 9.894 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:41:50] <_joe_>	 ragesoss: still down now?
[17:41:51] <Oshwah>	 Looks like it's coming back for me.
[17:42:21] <ragesoss>	 _joe_: yes, still just spinning for me
[17:42:31] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[17:42:40] <ragesoss>	 okay, maybe finally coming back, although extremely slow page loads
[17:42:40] <Oshwah>	 Worked just fine for me. Make sure to clear cookies, cache and test again.
[17:42:52] <wikibugs>	 (03PS8) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[17:42:57] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2276 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:42:57] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2309 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:42:57] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2275 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:42:57] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2305 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:42:57] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2307 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:42:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2273 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:42:58] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2325 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.087 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:42:59] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2277 is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 6.147 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:42:59] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2270 is OK: HTTP OK: HTTP/1.1 302 Found - 517 bytes in 0.803 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:44:05] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38825/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[17:44:17] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:44:21] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2386 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[17:44:38] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "PCC output: https://puppet-compiler.wmflabs.org/output/868449/3/" [puppet] - 10https://gerrit.wikimedia.org/r/868449 (owner: 10BryanDavis)
[17:45:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[17:45:37] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in appserver at codfw on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09091 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:45:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool db2129', diff saved to https://phabricator.wikimedia.org/P42714 and previous config saved to /var/cache/conftool/dbconfig/20221215-174537-ladsgroup.json
[17:46:39] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[17:47:01] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:47:03] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[17:47:14] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: puppetmaster: git-sync-upstream: use the gitpuppet user for git operations [puppet] - 10https://gerrit.wikimedia.org/r/868400 (https://phabricator.wikimedia.org/T325280)
[17:47:18] <jinxer-wm>	 (ProbeDown) resolved: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:47:19] <jinxer-wm>	 (ProbeDown) resolved: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:48:11] <icinga-wm>	 RECOVERY - Host durum1001 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms
[17:48:18] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[17:48:33] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1010.eqiad.wmnet with reason: host reimage
[17:49:38] <wikibugs>	 (03PS9) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[17:49:40] <wikibugs>	 (03PS1) 10Jbond: O:cluster::cloud_managment: remove unneeded profiles [puppet] - 10https://gerrit.wikimedia.org/r/868451
[17:49:45] <icinga-wm>	 PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:49:59] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:50:01] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:51:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38826/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[17:51:38] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1010.eqiad.wmnet with reason: host reimage
[17:51:57] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2002.codfw.wmnet with OS bullseye
[17:52:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[17:53:56] <wikibugs>	 (03PS10) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[17:55:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38827/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[17:56:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[17:57:42] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2022-12-12-165842-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868448 (owner: 10BryanDavis)
[18:00:05] <jouncebot>	 bd808: #bothumor I � Unicode. All rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1800).
[18:01:08] <bd808>	 o/ I will be pushing out a new build of developer portal
[18:02:45] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2022-12-12-165842-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/868448 (owner: 10BryanDavis)
[18:03:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38828/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[18:03:56] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[18:04:19] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[18:04:45] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps puppet: allow multiple users to access our puppet git checkout [puppet] - 10https://gerrit.wikimedia.org/r/868454 (https://phabricator.wikimedia.org/T325280)
[18:05:07] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[18:05:11] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[18:05:53] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[18:07:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38830/console" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[18:07:36] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[18:08:09] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[18:08:44] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[18:08:49] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1010.eqiad.wmnet with OS bullseye
[18:08:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1010.eqi...
[18:12:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps puppet: allow multiple users to access our puppet git checkout [puppet] - 10https://gerrit.wikimedia.org/r/868454 (https://phabricator.wikimedia.org/T325280) (owner: 10Andrew Bogott)
[18:19:10] <wikibugs>	 (03PS11) 10Jbond: [DRAFT] spicerack: refactor puppetization [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168)
[18:27:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hosts.reboot-single: Simplify icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/868430 (https://phabricator.wikimedia.org/T325153) (owner: 10Jbond)
[18:28:51] <wikibugs>	 (03CR) 10Herron: [C: 03+1] elasticsearch: Enable profile::auto_restarts::service for Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/868438 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[18:29:18] <wikibugs>	 (03CR) 10Herron: [C: 03+1] netmon: Remove netmon1002 from DSH node group [puppet] - 10https://gerrit.wikimedia.org/r/868207 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[18:29:35] <wikibugs>	 (03CR) 10Herron: [C: 03+1] netmon: Remove netmon2001 from DSH node group. [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[18:33:25] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "I assume this will cause /usr/share/GeoIPInfo to be populated on the deploy server." [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn)
[18:36:10] <wikibugs>	 (03PS1) 10Vlad.shapik: Fix TypeError of SVG conversion [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150)
[18:36:14] <wikibugs>	 (03CR) 10Dzahn: "yea, it definitely creates the resource /usr/share/GeoIPInfo  https://puppet-compiler.wmflabs.org/output/868199/38794/deploy2002.codfw.wmn" [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn)
[18:37:39] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka logging-codfw cluster: Reboot kafka nodes
[18:38:19] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: byte/str TypeError during svg conversion - https://phabricator.wikimedia.org/T325150 (10Vlad.shapik) It seems that I found where is the trick. As it turned out the failed SVG file has a small body, as a result, the source in the prepare_sou...
[18:42:10] <wikibugs>	 (03CR) 10MSantos: Exclude OSM tag that causes a failing import (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos)
[18:43:15] <wikibugs>	 (03PS2) 10Vlad.shapik: Fix TypeError of SVG conversion [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150)
[18:43:52] <robh>	 Please note we're going to see a lot of .mgmt.ulsfo.wmnet flaps shortly when i go to swap the msw in rack .22
[18:44:16] <mutante>	 thanks for the heads-up robh 
[18:45:34] <robh>	 sorry rack .23
[18:45:37] <robh>	 but yeah, samet hign
[18:45:44] <robh>	 shouldnt have user impact and only affects traffic
[18:48:19] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka logging-eqiad cluster: Reboot kafka nodes
[18:50:40] <robh>	 !log starting msw2-ulsfo swap for rack .23, mgmt will flap with no expected user impact
[18:50:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:43] <robh>	 half done old msw out
[18:53:45] <robh>	 new one going in
[18:54:33] <wikibugs>	 (03CR) 10Jbond: First stab at possible ferm::qos resource for DSCP marking (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[18:54:37] <icinga-wm>	 PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[18:55:01] <icinga-wm>	 PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:55:45] <icinga-wm>	 PROBLEM - Host ripe-atlas-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[18:55:45] <icinga-wm>	 PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[18:56:19] <icinga-wm>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2022-12-18 12:02:52 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[18:56:39] <icinga-wm>	 PROBLEM - Host cr4-ulsfo.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:57:38] <robh>	 expected for ulsfo mgmt
[18:57:46] <robh>	 itll be back in less than 5, done with network adding power
[18:59:14] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:59:40] <robh>	 ok new msw2 in palce
[18:59:44] <robh>	 those should all start clearing
[19:00:03] <wikibugs>	 (03CR) 10Jbond: "orry missed one" [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[19:00:03] <icinga-wm>	 PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.592e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11
[19:00:04] <jouncebot>	 hashar and ^demon: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1900).
[19:00:31] <icinga-wm>	 RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.89 ms
[19:00:49] <icinga-wm>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-02-16 11:40:18 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[19:00:49] <icinga-wm>	 PROBLEM - Maps - OSM synchronization lag - codfw on alert1001 is CRITICAL: 2.592e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=12
[19:01:01] <icinga-wm>	 RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:01:18] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Remove netmon1002 from DSH node group [puppet] - 10https://gerrit.wikimedia.org/r/868207 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[19:02:09] <icinga-wm>	 RECOVERY - Host cr4-ulsfo.mgmt is UP: PING OK - Packet loss = 0%, RTA = 71.25 ms
[19:04:14] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:07:01] <wikibugs>	 (03PS2) 10Andrea Denisse: netmon: Remove netmon2001 from DSH node group. [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695)
[19:07:58] <logmsgbot>	 !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@d23127b]: Deploying Spark3 upgrade of image_suggestions job to the platform_eng Airflow instance.
[19:07:59] <robh>	 bleh didnt mean to dc during recoveries
[19:08:02] <wikibugs>	 (03PS1) 10Ssingh: P:cumin: set per-site aliases for Wikidough/durum [puppet] - 10https://gerrit.wikimedia.org/r/868458
[19:08:08] <logmsgbot>	 !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@d23127b]: Deploying Spark3 upgrade of image_suggestions job to the platform_eng Airflow instance. (duration: 00m 10s)
[19:08:11] <robh>	 no more down mgmt though so yay
[19:08:15] <robh>	 atlas still is so thats odd....
[19:09:09] <wikibugs>	 (03PS2) 10Ssingh: P:cumin: set per-site aliases for Wikidough/durum [puppet] - 10https://gerrit.wikimedia.org/r/868458
[19:09:11] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] netmon: Remove netmon2001 from DSH node group. [puppet] - 10https://gerrit.wikimedia.org/r/868204 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[19:09:25] <robh>	 ok, itll come back it has a shitty power cable
[19:09:37] <robh>	 and was unseated, its appliance so has non standard 2 pin ungrounded cable
[19:10:03] <wikibugs>	 (03PS3) 10Ssingh: P:cumin: set per-site aliases for Wikidough/durum [puppet] - 10https://gerrit.wikimedia.org/r/868458
[19:11:31] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38834/console" [puppet] - 10https://gerrit.wikimedia.org/r/868458 (owner: 10Ssingh)
[19:16:13] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps puppet: rework git config safe.dir definition [puppet] - 10https://gerrit.wikimedia.org/r/868460 (https://phabricator.wikimedia.org/T325280)
[19:17:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloud-vps puppet: rework git config safe.dir definition [puppet] - 10https://gerrit.wikimedia.org/r/868460 (https://phabricator.wikimedia.org/T325280) (owner: 10Andrew Bogott)
[19:20:05] <icinga-wm>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2022-12-18 12:02:52 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[19:21:25] <mutante>	 andrewbogott: ^ is this why you had that certificate discussion yesterday?
[19:21:33] <icinga-wm>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2023-02-16 11:40:18 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/
[19:21:55] <mutante>	 andrewbogott: guess not (anymore) ^ :)
[19:21:58] <vgutierrez>	 That was syslog related 
[19:22:11] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-vps puppet: rework git config safe.dir definition [puppet] - 10https://gerrit.wikimedia.org/r/868460 (https://phabricator.wikimedia.org/T325280)
[19:22:20] <mutante>	 vgutierrez: ok, thanks
[19:22:57] <robh>	 ok the remainder of ulsfo work is non impacting and decomed cruft
[19:23:05] <robh>	 so ulsfo maint mostly done but cabinets still open
[19:24:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps puppet: rework git config safe.dir definition [puppet] - 10https://gerrit.wikimedia.org/r/868460 (https://phabricator.wikimedia.org/T325280) (owner: 10Andrew Bogott)
[19:24:27] <Reedy>	 robh: you can get power cables in the US with an actual ground? :P
[19:24:46] <robh>	 for consumer grade appliances its all too common
[19:24:55] <robh>	 but atlas in datacenter is outlier
[19:25:02] <Reedy>	 heh
[19:25:35] <Reedy>	 The fact that walgreens et al sell "adapters" to plug plugs that have a ground into wall sockets that don't...
[19:26:58] <taavi>	 mutante: no, that flapping alert means that some apache2 threads are still hanging to the old certificate for whatever reason. an apache restart ({{done}}) fixes it
[19:29:15] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1746 MB (3% inode=84%): /tmp 1746 MB (3% inode=84%): /var/tmp 1746 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops
[19:30:24] <mutante>	 taavi: makes sense, thanks. I had not seen the recovery yet  at first
[19:30:43] <wikibugs>	 (03PS3) 10Bking: [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114)
[19:32:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: extract and validate kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[19:35:48] <mutante>	 !log short downtime for misc websites, iegreview, racktables, transparency.wm, annual.wm, design.wm, sitemaps.wm, research.wm, bienvenida.wm, wikiworkshop.org 
[19:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10VirginiaPoundstone) 05Open→03Resolved @akosiaris I tried Virginia Poundstone  Luke pointed me to https://phabricator.wikimedia.org/T293241#7436893 and I got it...
[19:51:04] <wikibugs>	 (03PS1) 10Hashar: deploy_artifacts: add dry run mode [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868461
[19:51:06] <wikibugs>	 (03PS1) 10Hashar: deploy_artifacts: --version is a required option [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868462
[19:51:55] <wikibugs>	 10SRE-OnFire, 10Discovery-Search, 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10RKemper)
[19:52:50] <wikibugs>	 (03CR) 10BPirkle: "lgtm, but I don't have +2 permissions on this repo." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) (owner: 10Vlad.shapik)
[19:53:01] <wikibugs>	 (03CR) 10BPirkle: [C: 03+1] Fix TypeError of SVG conversion [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) (owner: 10Vlad.shapik)
[20:02:09] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka logging-eqiad cluster: Reboot kafka nodes
[20:06:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] docker_registry_ha: add contint2002 to image builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/867708 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[20:15:01] <inflatador>	 working on a new cookbook, apologize in advance for the spam
[20:15:33] <jinxer-wm>	 (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[20:17:20] <wikibugs>	 (03PS1) 10Ryan Kemper: [WIP] wdqs: switch to using NFS for dump files [cookbooks] - 10https://gerrit.wikimedia.org/r/868465
[20:19:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: switch to using NFS for dump files [cookbooks] - 10https://gerrit.wikimedia.org/r/868465 (owner: 10Ryan Kemper)
[20:19:54] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[20:20:46] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[20:21:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[20:22:09] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[20:22:16] <wikibugs>	 (03PS1) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515)
[20:23:21] <wikibugs>	 (03CR) 10MSantos: [C: 04-1] "I'm on the fence about this one, the docs states [1] that changing the mapping config will require a fresh re-import of the DB otherwise O" [puppet] - 10https://gerrit.wikimedia.org/r/868415 (https://phabricator.wikimedia.org/T325293) (owner: 10Jgiannelos)
[20:24:38] <TheresNoTime>	 jouncebot: nowandnext
[20:24:38] <jouncebot>	 For the next 0 hour(s) and 35 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T1900)
[20:24:38] <jouncebot>	 In 0 hour(s) and 35 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T2100)
[20:25:07] <wikibugs>	 (03PS2) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515)
[20:26:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump container version to 2022-12-01-223001-production [puppet] - 10https://gerrit.wikimedia.org/r/868449 (owner: 10BryanDavis)
[20:27:03] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10Andrew) Huh, is anyone tasked with this? This is one of the few cases that's keeping Stretch alive in cloud-vps and prod.
[20:27:40] <wikibugs>	 (03PS3) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515)
[20:29:06] <wikibugs>	 (03CR) 10Dzahn: "looks pretty good to me, one inline nitpick about repeating the values in the host yaml" [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth)
[20:30:45] <icinga-wm>	 PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1741 MB (3% inode=84%): /tmp 1741 MB (3% inode=84%): /var/tmp 1741 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops
[20:31:15] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/868467/38839/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth)
[20:31:41] <wikibugs>	 (03PS1) 10Gergő Tisza: NewImpact: Use "View all edits" in footer [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216)
[20:34:00] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] install_server: set codfw logstash vms to install bullseye [puppet] - 10https://gerrit.wikimedia.org/r/861872 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite)
[20:34:24] <wikibugs>	 (03PS4) 10AOkoth: vrts: add vrts2001 values and add database port in config [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515)
[20:42:42] <wikibugs>	 (03PS2) 10RLazarus: httpbb: Add tests for test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/868211 (https://phabricator.wikimedia.org/T290536)
[20:42:44] <wikibugs>	 (03PS2) 10RLazarus: httpbb: Run hourly tests from the cumin hosts against mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536)
[20:52:51] <wikibugs>	 (03PS2) 10Gergő Tisza: NewImpact: Use "View all edits" in footer [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216)
[20:56:28] <wikibugs>	 (03PS3) 10Gergő Tisza: NewImpact: Use "View all edits" in footer [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216)
[20:57:30] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] httpbb: Add tests for test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/868211 (https://phabricator.wikimedia.org/T290536) (owner: 10RLazarus)
[20:57:43] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] httpbb: Run hourly tests from the cumin hosts against mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/868212 (https://phabricator.wikimedia.org/T290536) (owner: 10RLazarus)
[21:00:04] <jouncebot>	 brennen and TheresNoTime: Your horoscope predicts another unfortunate UTC late backport and config training deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221215T2100).
[21:00:05] <jouncebot>	 zabe and tgr: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:15] <zabe>	 o/
[21:00:16] <tgr>	 o/
[21:00:38] <TheresNoTime>	 I can deploy :) (did you want to self-serve tgr?)
[21:01:13] <tgr>	 TheresNoTime: if you don't mind doing it, I'm happy with that
[21:01:44] <TheresNoTime>	 sure :)
[21:01:50] <brennen>	 o/ no one for training today, so i'll let you all proceed.
[21:02:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863434 (https://phabricator.wikimedia.org/T184782) (owner: 10Zabe)
[21:02:50] <wikibugs>	 (03Merged) 10jenkins-bot: Update reference to CommandLineInc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863434 (https://phabricator.wikimedia.org/T184782) (owner: 10Zabe)
[21:03:07] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:863434|Update reference to CommandLineInc (T184782)]]
[21:03:18] <stashbot>	 T184782: Get rid of `.inc` files in MediaWiki, using .php instead (was: Test coverage missing for .inc files) - https://phabricator.wikimedia.org/T184782
[21:04:14] <wikibugs>	 (03CR) 10Vlad.shapik: Fix TypeError of SVG conversion (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/868456 (https://phabricator.wikimedia.org/T325150) (owner: 10Vlad.shapik)
[21:04:26] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "start merge for deploy" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216) (owner: 10Gergő Tisza)
[21:04:49] <logmsgbot>	 !log samtar@deploy1002 samtar and zabe: Backport for [[gerrit:863434|Update reference to CommandLineInc (T184782)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet
[21:04:51] <TheresNoTime>	 zabe: that's live on mwdebug, can you test?
[21:04:56] <zabe>	 no
[21:05:19] <TheresNoTime>	 oh yeah
[21:05:20] <TheresNoTime>	 :D
[21:05:31] <TheresNoTime>	 (syncing, apologies)
[21:05:54] <Reedy>	 lol
[21:06:23] * TheresNoTime definitely doesn't have a script to automate all the IRC interactions too /sarcasm
[21:11:26] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:863434|Update reference to CommandLineInc (T184782)]] (duration: 08m 18s)
[21:11:30] <stashbot>	 T184782: Get rid of `.inc` files in MediaWiki, using .php instead (was: Test coverage missing for .inc files) - https://phabricator.wikimedia.org/T184782
[21:11:50] <TheresNoTime>	 zabe: done :)
[21:12:07] <zabe>	 thanks :)
[21:13:09] <TheresNoTime>	 (just waiting on 868364 to merge)
[21:15:47] <logmsgbot>	 !log cwhite@cumin2002 conftool action : set/pooled=no; selector: name=logstash2024.codfw.wmnet,service=kibana7
[21:15:47] <wikibugs>	 (03PS14) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[21:16:59] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[21:17:10] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[21:22:25] <wikibugs>	 (03Merged) 10jenkins-bot: NewImpact: Use "View all edits" in footer [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216) (owner: 10Gergő Tisza)
[21:22:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868364 (https://phabricator.wikimedia.org/T325216) (owner: 10Gergő Tisza)
[21:22:46] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:868364|NewImpact: Use "View all edits" in footer (T325216)]]
[21:22:50] <stashbot>	 T325216: Impact Module: "View all edits"  - https://phabricator.wikimedia.org/T325216
[21:23:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[21:23:40] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[21:24:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[21:25:15] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[21:27:50] <TheresNoTime>	 `build-and-push-container-images` step taking a bit longer than normal
[21:31:51] <TheresNoTime>	 (^ has moved on, but each step taking "longer than normal")
[21:35:59] <logmsgbot>	 !log samtar@deploy1002 samtar and tgr: Backport for [[gerrit:868364|NewImpact: Use "View all edits" in footer (T325216)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[21:36:03] <stashbot>	 T325216: Impact Module: "View all edits"  - https://phabricator.wikimedia.org/T325216
[21:36:13] <TheresNoTime>	 tgr: that's live on mwdebug, can you test? :)
[21:37:55] <tgr>	 TheresNoTime: looks good, thanks!
[21:38:01] <TheresNoTime>	 ack
[21:38:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] cloud: allow VMs to connect to contint1002 and contint2002 [puppet] - 10https://gerrit.wikimedia.org/r/867675 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn)
[21:40:50] <wikibugs>	 (03PS1) 10Jbond: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/868471 (https://phabricator.wikimedia.org/T321783)
[21:41:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/868471 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[21:42:57] <TheresNoTime>	 !log logging to note that this deploy is unusually slow, P42715
[21:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:44:21] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks again John for the input.  Just a couple of comments in response." [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[21:47:25] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:868364|NewImpact: Use "View all edits" in footer (T325216)]] (duration: 24m 38s)
[21:47:29] <stashbot>	 T325216: Impact Module: "View all edits"  - https://phabricator.wikimedia.org/T325216
[21:47:34] <TheresNoTime>	 and live in prod tgr :)
[21:51:57] <tgr>	 thx
[21:52:16] <TheresNoTime>	 !log done UTC late backport and config training
[21:52:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:34] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::openstack::base::nutcracker: merge in profile::mediawiki::nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/868472 (https://phabricator.wikimedia.org/T325244)
[21:59:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::base::nutcracker: merge in profile::mediawiki::nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/868472 (https://phabricator.wikimedia.org/T325244) (owner: 10Andrew Bogott)
[21:59:19] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[21:59:21] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[22:02:19] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts netmon2001.wikimedia.org
[22:07:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "For reference, here's a similar patch I merged for cloud-vps puppet servers" [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar)
[22:10:03] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[22:11:12] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:11:13] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts netmon2001.wikimedia.org
[22:19:46] <wikibugs>	 (03PS15) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[22:20:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[22:26:08] <icinga-wm>	 PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 877130 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops
[22:33:14] <icinga-wm>	 RECOVERY - Disk space on an-coord1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops
[22:42:36] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts netmon2001.wikimedia.org
[22:47:04] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[22:48:15] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:48:16] <logmsgbot>	 !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts netmon2001.wikimedia.org
[22:49:27] <TheresNoTime>	 !log `[samtar@mwmaint1002 imports]$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Coffeeandcrumbs /home/samtar/imports` T325330
[22:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:49:31] <stashbot>	 T325330: Server-side upload request for Coffeeandcrumbs - https://phabricator.wikimedia.org/T325330
[23:00:47] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.hosts.decommission for hosts netmon1002.wikimedia.org
[23:05:32] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.dns.netbox
[23:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:07:33] <wikibugs>	 (03PS16) 10Cathal Mooney: First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358)
[23:09:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] First stab at possible ferm::qos resource for DSCP marking [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[23:09:26] <logmsgbot>	 !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netmon1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - denisse@cumin1001"
[23:10:48] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netmon1002.wikimedia.org decommissioned, removing all IPs except the asset tag one - denisse@cumin1001"
[23:10:48] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:10:49] <logmsgbot>	 !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netmon1002.wikimedia.org
[23:21:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:32:40] <icinga-wm>	 PROBLEM - Check size of conntrack table on logstash2030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[23:32:44] <icinga-wm>	 PROBLEM - Check that envoy is running on logstash2030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[23:32:50] <icinga-wm>	 PROBLEM - Check systemd state on logstash2030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:32:58] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash2030 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:33:57] <cwhite>	 ^^ that's me
[23:41:28] <icinga-wm>	 RECOVERY - Check systemd state on logstash2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:41:38] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash2030 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: yellow, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 663, active_shards: 1408, relocating_shards: 7, initializing_shards: 3, unassigned_shards: 68, delayed_unassigned_sh
[23:41:38] <icinga-wm>	  number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 95.19945909398243 https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:42:28] <icinga-wm>	 RECOVERY - Check size of conntrack table on logstash2030 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[23:42:30] <icinga-wm>	 RECOVERY - Check that envoy is running on logstash2030 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy