[00:06:44] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:35:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:23:24] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:48] (03PS1) 10KartikMistry: Remove akwiki from CX config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904952 [05:41:45] (03CR) 10Santhosh: WIP: Add new self hosted machinetranslation service (MinT) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) (owner: 10KartikMistry) [06:14:30] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:23:07] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/904883 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [06:32:30] (03CR) 10Ayounsi: [V: 03+1 C: 03+1] "Minor adjustments might be needed, as well as configure BGP on the router side but for the scope of this patch it looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [06:34:30] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [06:52:43] !log move kafka-jumbo1005's kafka broker cert to PKI - T296064 [06:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:48] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [06:52:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1005.eqiad.wmnet with reason: restart kafka, switch to PKI [06:53:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1005.eqiad.wmnet with reason: restart kafka, switch to PKI [06:55:34] (03PS3) 10Elukey: Switch kafka-main1001 broker's TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372) [07:00:05] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230403T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:04:41] (03PS1) 10Slyngshede: Partman: Add /var pratition to urldownloader hosts. [puppet] - 10https://gerrit.wikimedia.org/r/905040 (https://phabricator.wikimedia.org/T333676) [07:05:08] (03CR) 10CI reject: [V: 04-1] Partman: Add /var pratition to urldownloader hosts. [puppet] - 10https://gerrit.wikimedia.org/r/905040 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [07:07:00] (03PS2) 10Slyngshede: Partman: Add /var pratition to urldownloader hosts. [puppet] - 10https://gerrit.wikimedia.org/r/905040 (https://phabricator.wikimedia.org/T333676) [07:43:02] !log move kafka-jumbo1006's kafka broker cert to PKI - T296064 [07:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:06] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [07:43:12] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1007.eqiad.wmnet with reason: restart kafka, switch to PKI [07:43:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1007.eqiad.wmnet with reason: restart kafka, switch to PKI [07:43:53] (fixed s/1006/1007 in the SAL) [07:44:14] (1006 is a controller right now, so I'll do it for last) [07:45:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: introduce BGP setup by means of bird [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [07:45:27] (03CR) 10Filippo Giunchedi: alertmanager: create receiver for both sre-collab and releng combined (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [07:47:21] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: let check_dpkg write prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/904792 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [07:56:38] (03PS1) 10Elukey: profile::hue: restart envoy only when CAS is enabled [puppet] - 10https://gerrit.wikimedia.org/r/905149 [07:58:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40495/console" [puppet] - 10https://gerrit.wikimedia.org/r/905149 (owner: 10Elukey) [08:00:14] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::hue: restart envoy only when CAS is enabled [puppet] - 10https://gerrit.wikimedia.org/r/905149 (owner: 10Elukey) [08:01:44] (03CR) 10Volans: [C: 03+2] superset: requestctl-generator error handling [puppet] - 10https://gerrit.wikimedia.org/r/904550 (owner: 10Volans) [08:02:50] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1008.eqiad.wmnet with reason: restart kafka, switch to PKI [08:03:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1008.eqiad.wmnet with reason: restart kafka, switch to PKI [08:03:24] !log move kafka-jumbo1008's kafka broker cert to PKI - T296064 [08:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:27] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [08:07:54] (03CR) 10Filippo Giunchedi: [C: 03+2] eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [08:08:03] (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [08:09:39] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudlb2002-dev is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [08:12:23] RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:10] (03CR) 10David Caro: [C: 03+2] toolforge: add k8s bastion with toolforge config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901235 (owner: 10David Caro) [08:15:03] (03CR) 10Elukey: [C: 03+2] Switch kafka-main1001 broker's TLS cert to PKI [puppet] - 10https://gerrit.wikimedia.org/r/904667 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [08:15:38] dcaro: o/ merge anytime [08:16:07] elukey: ack [08:16:08] merging [08:23:49] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-main1001.eqiad.wmnet with reason: restart kafka, switch to PKI [08:24:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-main1001.eqiad.wmnet with reason: restart kafka, switch to PKI [08:26:15] !log fetch HAProxy 2.6.12 on thirdparty/haproxy26 for bullseye (apt.wm.o) [08:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:35] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudlb2002-dev is OK: OK: UP (pid=3933775) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [08:29:04] !log move kafka-main1001's kafka broker to PKI - T319372 [08:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:08] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [08:29:41] (03CR) 10David Caro: maintain-dbusers: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [08:31:15] !log rolling upgrade to HAProxy 2.6.12 in A:cp-ulsfo [08:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:52] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo [08:32:14] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo [08:34:48] (03PS1) 10Slyngshede: Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 [08:37:42] (03PS2) 10Slyngshede: Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 [08:42:14] (03CR) 10Majavah: [C: 04-1] "Since this is a continuously running daemon, I think it should be running a HTTP server that Prometheus can scrape directly instead of rel" [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [08:49:50] (03PS3) 10David Caro: ceph: Allow setting a crush location hook for the rack [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) [08:49:52] (03PS3) 10David Caro: p:cloudceph::osd: enable location hook [puppet] - 10https://gerrit.wikimedia.org/r/904788 (https://phabricator.wikimedia.org/T297083) [08:49:54] (03PS1) 10David Caro: cloud.yaml: pass a yaml formatter to it [puppet] - 10https://gerrit.wikimedia.org/r/905159 [08:49:56] (03CR) 10David Caro: ceph: Allow setting a crush location hook for the rack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [08:50:33] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo [08:52:21] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo [08:52:33] (03CR) 10Volans: Django 3.2 support (034 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede) [08:53:08] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1009.eqiad.wmnet with reason: restart kafka, switch to PKI [08:53:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1009.eqiad.wmnet with reason: restart kafka, switch to PKI [08:54:30] !log move kafka-jumbo1009's kafka broker cert to PKI - T296064 [08:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:33] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [08:54:53] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:55:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:58:45] (03PS3) 10Slyngshede: Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 [09:06:14] (03PS4) 10Slyngshede: Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 [09:18:03] (03PS1) 10Jbond: partman: for virtual disks use the whole disk for the root partition [puppet] - 10https://gerrit.wikimedia.org/r/905160 [09:19:25] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:15:00 on kafka-jumbo1006.eqiad.wmnet with reason: restart kafka, switch to PKI [09:19:39] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on kafka-jumbo1006.eqiad.wmnet with reason: restart kafka, switch to PKI [09:19:44] !log move kafka-jumbo1006's kafka broker cert to PKI - T296064 [09:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:48] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [09:20:16] (03PS2) 10Jbond: partman: for virtual disks use the whole disk for the root partition [puppet] - 10https://gerrit.wikimedia.org/r/905160 [09:21:12] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905160 (owner: 10Jbond) [09:26:24] (03CR) 10Slyngshede: partman: for virtual disks use the whole disk for the root partition [puppet] - 10https://gerrit.wikimedia.org/r/905160 (owner: 10Jbond) [09:26:41] (03CR) 10Slyngshede: [C: 04-1] partman: for virtual disks use the whole disk for the root partition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905160 (owner: 10Jbond) [09:27:48] (03CR) 10Slyngshede: [C: 04-1] partman: for virtual disks use the whole disk for the root partition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905160 (owner: 10Jbond) [09:28:11] (03CR) 10Jbond: partman: for virtual disks use the whole disk for the root partition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905160 (owner: 10Jbond) [09:28:22] (03PS3) 10Jbond: partman: for virtual disks use the whole disk for the root partition [puppet] - 10https://gerrit.wikimedia.org/r/905160 [09:31:52] (03CR) 10Ayounsi: Bird: remove anycast subnet filter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/904754 (https://phabricator.wikimedia.org/T324992) (owner: 10Ayounsi) [09:33:30] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/905160 (owner: 10Jbond) [09:34:23] (03PS5) 10Slyngshede: Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 [09:36:41] (03CR) 10CI reject: [V: 04-1] Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede) [09:38:48] (03CR) 10Ayounsi: [C: 03+2] Kubestage: don't set next-hop self on exported prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/904544 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [09:39:28] (03Merged) 10jenkins-bot: Kubestage: don't set next-hop self on exported prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/904544 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [09:40:13] (03CR) 10Jbond: [C: 03+2] partman: for virtual disks use the whole disk for the root partition [puppet] - 10https://gerrit.wikimedia.org/r/905160 (owner: 10Jbond) [09:46:28] (03PS6) 10Slyngshede: Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 [09:50:21] (03PS7) 10Slyngshede: Django 3.2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 [09:55:23] (03CR) 10Slyngshede: Django 3.2 support (034 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/905158 (owner: 10Slyngshede) [09:57:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikim [09:57:19] /wiki/Citoid [09:58:37] hmm expected? [09:59:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230403T1000) [10:08:24] (03PS1) 10Ayounsi: [k8s mlstage/aux] Add policy to export prefixes to notes [homer/public] - 10https://gerrit.wikimedia.org/r/905170 (https://phabricator.wikimedia.org/T328523) [10:08:42] A few Maximum call stack size exceeded errors in citoid [10:09:15] (03PS2) 10Ayounsi: [k8s mlstage/aux] Add policy to export prefixes to nodes [homer/public] - 10https://gerrit.wikimedia.org/r/905170 (https://phabricator.wikimedia.org/T328523) [10:09:38] zotero looking not very healthy at the same time [10:12:15] (03PS1) 10Ayounsi: [k8s ml/dse/wiki] Add policy to export prefixes to nodes [homer/public] - 10https://gerrit.wikimedia.org/r/905171 (https://phabricator.wikimedia.org/T328523) [10:12:54] (03CR) 10Ayounsi: [C: 03+2] Remove redundant or outdated prefixes from aggregate_networks -> labs [puppet] - 10https://gerrit.wikimedia.org/r/904529 (https://phabricator.wikimedia.org/T329669) (owner: 10Ayounsi) [10:14:24] "Zotero, despite being a nodejs service, isn't owned by WMF and is not service-runner compliant unfortunately. Which means that it does not emit any metrics we currently capture." That'll help [10:14:52] (03PS1) 10Vgutierrez: hiera: Enable esitest on text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/905173 (https://phabricator.wikimedia.org/T308799) [10:15:28] (03PS1) 10Volans: netbox: set http proxies for netbox-next [puppet] - 10https://gerrit.wikimedia.org/r/905174 [10:16:45] (03CR) 10Ayounsi: [C: 03+1] netbox: set http proxies for netbox-next [puppet] - 10https://gerrit.wikimedia.org/r/905174 (owner: 10Volans) [10:16:57] (03CR) 10Volans: [C: 03+2] netbox: set http proxies for netbox-next [puppet] - 10https://gerrit.wikimedia.org/r/905174 (owner: 10Volans) [10:17:16] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40496/console" [puppet] - 10https://gerrit.wikimedia.org/r/905173 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [10:17:33] (03PS1) 10Jbond: logrotate: add coumentations and fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/905175 [10:19:58] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [10:19:58] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [10:20:08] jbond: typo "coumentations" --^ [10:20:15] (saw it passing by) [10:21:19] PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:44] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [10:23:49] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [10:26:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1073.mgmt.eqiad.wmnet with reboot policy FORCED [10:26:24] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [10:28:11] (03PS1) 10David Caro: toolforge: fix included profile class path [puppet] - 10https://gerrit.wikimedia.org/r/905177 [10:30:16] (03CR) 10David Caro: [C: 03+2] toolforge: fix included profile class path [puppet] - 10https://gerrit.wikimedia.org/r/905177 (owner: 10David Caro) [10:30:44] (03PS1) 10Clément Goubert: noc: Fix alertmanager severity [puppet] - 10https://gerrit.wikimedia.org/r/905178 (https://phabricator.wikimedia.org/T331901) [10:31:45] akosiaris: Can I get a +1 for this ? Former patch broke puppet on mwmaint hosts [10:32:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] noc: Fix alertmanager severity [puppet] - 10https://gerrit.wikimedia.org/r/905178 (https://phabricator.wikimedia.org/T331901) (owner: 10Clément Goubert) [10:32:28] tyvm [10:32:38] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable esitest on text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/905173 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [10:32:42] (03CR) 10Clément Goubert: [C: 03+2] noc: Fix alertmanager severity [puppet] - 10https://gerrit.wikimedia.org/r/905178 (https://phabricator.wikimedia.org/T331901) (owner: 10Clément Goubert) [10:34:18] (03PS4) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) [10:35:01] !log Extend the ESI test to text@eqsin, revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/905173/ if this gives any issue - T308799 [10:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:06] T308799: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 [10:35:09] (03CR) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan) [10:37:39] (03PS1) 10Jbond: logrotate: add logrotate profile [puppet] - 10https://gerrit.wikimedia.org/r/905181 [10:39:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40498/console" [puppet] - 10https://gerrit.wikimedia.org/r/905181 (owner: 10Jbond) [10:44:18] (03PS8) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) [10:44:29] (03CR) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan) [10:44:46] (03CR) 10Jelto: [C: 03+1] "lgtm, I agree monitoring the http service makes more sense than monitoring the process" [puppet] - 10https://gerrit.wikimedia.org/r/904856 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [10:46:01] (03PS2) 10Jbond: logrotate: add coumentations and fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/905175 [10:46:03] (03PS2) 10Jbond: logrotate: add logrotate profile [puppet] - 10https://gerrit.wikimedia.org/r/905181 [10:46:05] (03PS1) 10Jbond: O:aphlict: update to use profile::logrotate to configure hourly [puppet] - 10https://gerrit.wikimedia.org/r/905182 [10:47:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40499/console" [puppet] - 10https://gerrit.wikimedia.org/r/905182 (owner: 10Jbond) [10:47:31] (03PS3) 10Jbond: logrotate: add logrotate profile [puppet] - 10https://gerrit.wikimedia.org/r/905181 [10:51:01] (03CR) 10Jbond: O:aphlict: update to use profile::logrotate to configure hourly [puppet] - 10https://gerrit.wikimedia.org/r/905182 (owner: 10Jbond) [10:51:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 232k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [10:52:27] (03CR) 10Jbond: [C: 04-1] "I ended up needed to set a different server to use logrotate hourly so created a more standard way of setting this and have created a Cr t" [puppet] - 10https://gerrit.wikimedia.org/r/904498 (https://phabricator.wikimedia.org/T332869) (owner: 10EoghanGaffney) [10:54:58] (03PS1) 10Volans: reports: exclude recycled devices from accounting [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905183 (https://phabricator.wikimedia.org/T320955) [10:55:53] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall) [10:56:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 205.7k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [10:56:22] (03CR) 10Volans: "Tested on netbox-next" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905183 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans) [10:58:57] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [11:01:16] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [11:04:37] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [11:06:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-be1072.mgmt.eqiad.wmnet with reboot policy FORCED [11:14:51] (03Abandoned) 10Slyngshede: Partman: Add /var pratition to urldownloader hosts. [puppet] - 10https://gerrit.wikimedia.org/r/905040 (https://phabricator.wikimedia.org/T333676) (owner: 10Slyngshede) [11:19:27] (03PS1) 10Sergio Gimeno: GrowthExperiments: add link backend amends [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905193 (https://phabricator.wikimedia.org/T308133) [11:19:38] (03Abandoned) 10Jbond: 6.4.0-RC2: test to see if issue is still present [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/661721 (https://phabricator.wikimedia.org/T273867) (owner: 10Jbond) [11:23:20] (03PS16) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [11:23:28] (03PS25) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [11:23:33] (03PS14) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [11:29:44] (03CR) 10Jelto: "this change affects scap configs fleet-wide. Someone with more scap knowledge should review this." [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [11:29:45] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [11:29:52] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [11:31:13] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:31:16] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:31:22] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:31:29] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:35:47] (03CR) 10David Caro: osd: Add osd on new ceph cluster (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [11:39:04] (03CR) 10Jaime Nuche: scap: block Scap execution on inactive deployment hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904502 (https://phabricator.wikimedia.org/T330756) (owner: 10Jaime Nuche) [11:51:09] (03CR) 10Jaime Nuche: docker-gc: remove image from repository (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899611 (owner: 10Jaime Nuche) [11:56:13] (03PS17) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [11:56:15] (03PS26) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [11:56:17] (03PS15) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [11:56:19] (03PS18) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [11:56:47] (03PS27) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [11:56:52] (03PS16) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [11:58:21] (03CR) 10CI reject: [V: 04-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [11:58:29] (03CR) 10CI reject: [V: 04-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [11:58:46] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [11:58:55] (03CR) 10jenkins-bot: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [12:02:25] !log jbond@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=netbox [12:02:47] !log testing netbox failover cookbook [12:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:06] (ConfdResourceFailed) firing: (2) confd resource _var_lib_gdnsd_discovery-netbox.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:05:40] ^^ this likley relates to my netbox testing [12:06:23] * jbond confirmed [12:10:05] (ConfdResourceFailed) resolved: (2) confd resource _var_lib_gdnsd_discovery-netbox.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:10:55] (03PS1) 10Jbond: sre/confd: Increase the time before the alert triggeres [alerts] - 10https://gerrit.wikimedia.org/r/905212 [12:11:01] !log jbond@cumin2002 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=netbox [12:14:43] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:16:01] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:16:22] XioNoX: topranks: is this expected? [12:17:15] jbond: dunno, let's check the calendar [12:17:45] (03PS17) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [12:17:48] looks like not [12:21:18] nothing in the emails neither, I'll email singtel if it doesn't recover shortly [12:21:25] ack thanks [12:25:27] (03CR) 10Ayounsi: [C: 03+2] Add role_contacts to buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/903686 (owner: 10Ayounsi) [12:34:46] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: bump image version to flink-1.16-rc3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904813 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [12:39:49] (03Merged) 10jenkins-bot: rdf-streaming-updater: bump image version to flink-1.16-rc3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/904813 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [12:41:04] (03PS1) 10Filippo Giunchedi: sre: move confd alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905215 (https://phabricator.wikimedia.org/T309182) [12:41:06] (03PS1) 10Filippo Giunchedi: sre: move k8s alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/905216 (https://phabricator.wikimedia.org/T309182) [12:41:08] (03PS1) 10Filippo Giunchedi: sre: move hardware alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905217 (https://phabricator.wikimedia.org/T309182) [12:41:10] (03PS1) 10Filippo Giunchedi: sre: move keyholder alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905218 (https://phabricator.wikimedia.org/T309182) [12:41:12] (03PS1) 10Filippo Giunchedi: sre: move alerting puppet agent failure to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905219 (https://phabricator.wikimedia.org/T309182) [12:41:14] (03PS1) 10Filippo Giunchedi: sre: move etcd alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905220 (https://phabricator.wikimedia.org/T309182) [12:41:16] (03PS1) 10Filippo Giunchedi: sre: move druid/webrequest alerts to 'analytics' instance [alerts] - 10https://gerrit.wikimedia.org/r/905221 (https://phabricator.wikimedia.org/T309182) [12:41:18] (03PS1) 10Filippo Giunchedi: warn on deploy-tag missing [alerts] - 10https://gerrit.wikimedia.org/r/905222 (https://phabricator.wikimedia.org/T309182) [12:43:29] (03CR) 10CI reject: [V: 04-1] warn on deploy-tag missing [alerts] - 10https://gerrit.wikimedia.org/r/905222 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:45:25] (03PS2) 10Filippo Giunchedi: warn on deploy-tag missing [alerts] - 10https://gerrit.wikimedia.org/r/905222 (https://phabricator.wikimedia.org/T309182) [12:46:54] (03CR) 10Filippo Giunchedi: [C: 03+2] warn on deploy-tag missing [alerts] - 10https://gerrit.wikimedia.org/r/905222 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:46:57] (03PS3) 10Filippo Giunchedi: warn on deploy-tag missing [alerts] - 10https://gerrit.wikimedia.org/r/905222 (https://phabricator.wikimedia.org/T309182) [12:47:16] (03CR) 10Filippo Giunchedi: [V: 03+2] warn on deploy-tag missing [alerts] - 10https://gerrit.wikimedia.org/r/905222 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:48:00] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move confd alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905215 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:48:18] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move keyholder alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905218 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:49:07] (03Merged) 10jenkins-bot: sre: move confd alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905215 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:49:53] (03PS2) 10Ayounsi: BGP: remove local-as 14907 loops 2 for anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/827950 [12:50:13] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm. I take the output is the same format for both commands?" [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro) [12:51:11] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:51:42] (03CR) 10Ayounsi: "Reviving this as we're not going with "dynamic neighbor"." [homer/public] - 10https://gerrit.wikimedia.org/r/827950 (owner: 10Ayounsi) [12:52:33] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:54:12] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:54:47] (03CR) 10Andrew Bogott: [C: 03+1] ceph: Allow setting a crush location hook for the rack [puppet] - 10https://gerrit.wikimedia.org/r/904787 (https://phabricator.wikimedia.org/T297083) (owner: 10David Caro) [12:55:44] !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [12:57:23] (03CR) 10Andrew Bogott: "I'm slightly confused by the log message. You ran this file through a yaml formatter?" [puppet] - 10https://gerrit.wikimedia.org/r/905159 (owner: 10David Caro) [12:57:51] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move etcd alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905220 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:58:12] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move hardware alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905217 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:58:42] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move druid/webrequest alerts to 'analytics' instance [alerts] - 10https://gerrit.wikimedia.org/r/905221 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:58:51] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move alerting puppet agent failure to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905219 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [12:59:54] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move k8s alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/905216 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230403T1300). [13:00:05] MatmaRex and sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] hello [13:00:16] hi [13:01:14] (03Merged) 10jenkins-bot: sre: move k8s alerts to specific Prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/905216 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:01:48] (03Merged) 10jenkins-bot: sre: move hardware alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905217 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:01:51] (03Merged) 10jenkins-bot: sre: move keyholder alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905218 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:02:13] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:24] (03Merged) 10jenkins-bot: sre: move alerting puppet agent failure to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905219 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:02:45] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:02:56] (03Merged) 10jenkins-bot: sre: move etcd alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/905220 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:03:01] (03Merged) 10jenkins-bot: sre: move druid/webrequest alerts to 'analytics' instance [alerts] - 10https://gerrit.wikimedia.org/r/905221 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [13:03:49] (03CR) 10Ayounsi: [C: 03+1] "One comment, overall lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905183 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans) [13:04:01] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:28] (03PS18) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [13:07:01] anyone deploying? [13:07:09] I can deploy I guess [13:07:13] (03CR) 10Volans: "reply inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905183 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans) [13:07:21] I wonder why stashbot didn't ping me [13:07:42] (03PS2) 10Majavah: Enable visual enhancements on pages using __NEWSECTIONLINK__ on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904631 (https://phabricator.wikimedia.org/T333570) (owner: 10Bartosz Dziewoński) [13:07:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904631 (https://phabricator.wikimedia.org/T333570) (owner: 10Bartosz Dziewoński) [13:08:48] (03Merged) 10jenkins-bot: Enable visual enhancements on pages using __NEWSECTIONLINK__ on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904631 (https://phabricator.wikimedia.org/T333570) (owner: 10Bartosz Dziewoński) [13:09:20] (03CR) 10Andrew Bogott: [C: 03+2] Toolforge: move to new VM-hosted NFS server [puppet] - 10https://gerrit.wikimedia.org/r/904562 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [13:09:37] (03CR) 10Andrew Bogott: [C: 03+2] nfs traffic_shaping: replace labstore1004 rules with rules for tools-nfs.svc [puppet] - 10https://gerrit.wikimedia.org/r/904627 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [13:09:44] !log taavi@deploy2002 Started scap: Backport for [[gerrit:904631|Enable visual enhancements on pages using __NEWSECTIONLINK__ on huwiki (T333570)]] [13:09:49] T333570: Enable visual enhancements on pages using __NEWSECTIONLINK__ on hu.wiki - https://phabricator.wikimedia.org/T333570 [13:12:20] thanks taavi [13:16:47] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:18:18] !log taavi@deploy2002 matmarex and taavi: Backport for [[gerrit:904631|Enable visual enhancements on pages using __NEWSECTIONLINK__ on huwiki (T333570)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [13:18:22] T333570: Enable visual enhancements on pages using __NEWSECTIONLINK__ on hu.wiki - https://phabricator.wikimedia.org/T333570 [13:18:33] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:18:34] MatmaRex: please test [13:19:26] taavi:looking good at https://hu.wikipedia.org/wiki/Wikipédia:Kocsmafal_(műszaki) [13:19:38] syncing [13:19:52] (03CR) 10Ayounsi: [C: 03+2] [k8s mlstage/aux] Add policy to export prefixes to nodes [homer/public] - 10https://gerrit.wikimedia.org/r/905170 (https://phabricator.wikimedia.org/T328523) (owner: 10Ayounsi) [13:21:06] (03CR) 10David Caro: cloud.yaml: pass a yaml formatter to it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905159 (owner: 10David Caro) [13:23:26] (03CR) 10David Caro: smart_data_dump: adapt for newer ssacli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro) [13:25:51] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:904631|Enable visual enhancements on pages using __NEWSECTIONLINK__ on huwiki (T333570)]] (duration: 16m 06s) [13:25:55] T333570: Enable visual enhancements on pages using __NEWSECTIONLINK__ on hu.wiki - https://phabricator.wikimedia.org/T333570 [13:25:57] that was slow [13:26:15] (03PS2) 10Majavah: GrowthExperiments: add link backend amends [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905193 (https://phabricator.wikimedia.org/T308133) (owner: 10Sergio Gimeno) [13:26:40] (03PS19) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [13:26:42] (03PS28) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [13:26:44] (03PS17) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [13:27:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905193 (https://phabricator.wikimedia.org/T308133) (owner: 10Sergio Gimeno) [13:27:56] (03Merged) 10jenkins-bot: GrowthExperiments: add link backend amends [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905193 (https://phabricator.wikimedia.org/T308133) (owner: 10Sergio Gimeno) [13:28:08] !log taavi@deploy2002 Started scap: Backport for [[gerrit:905193|GrowthExperiments: add link backend amends (T308133)]] [13:28:12] T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 [13:28:53] PROBLEM - Check systemd state on mw2352 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:54] (03PS20) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [13:29:04] (03PS29) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [13:29:09] (03PS18) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [13:29:28] !log taavi@deploy2002 sgimeno and taavi: Backport for [[gerrit:905193|GrowthExperiments: add link backend amends (T308133)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [13:29:48] taavi: we won't be able to test much since the flag is just used in a maintenace script triggered by a periodic job. Should be safe enough since we've been using it for some time already [13:29:59] ok, I'll just sync [13:30:25] oki, ty [13:30:36] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 11062 [13:30:51] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 11062 [13:32:48] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] "No images depend on it, path was changed in I279e4722c8aa1c5d308738eaef7f760ecb19cd35" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899611 (owner: 10Jaime Nuche) [13:32:52] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] docker-gc: remove image from repository [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/899611 (owner: 10Jaime Nuche) [13:32:55] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072'] [13:33:12] (03CR) 10Ssingh: [C: 03+1] BGP: remove local-as 14907 loops 2 for anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/827950 (owner: 10Ayounsi) [13:34:27] (03PS1) 10Andrew Bogott: profile::wmcs::nfsclient: move toolforge nodes to the new NFS server [puppet] - 10https://gerrit.wikimedia.org/r/905229 (https://phabricator.wikimedia.org/T333477) [13:34:45] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1073'] [13:35:23] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:905193|GrowthExperiments: add link backend amends (T308133)]] (duration: 07m 15s) [13:35:32] T308133: Deploy "add a link" to 8th round of wikis - https://phabricator.wikimedia.org/T308133 [13:35:38] (03CR) 10CDanis: [C: 03+1] Varnish: prefix 403 and 429 with a unique ID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903284 (https://phabricator.wikimedia.org/T330973) (owner: 10Ayounsi) [13:35:39] sergi0: done [13:35:56] cool, thank you for the assistance [13:37:21] (03PS11) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [13:38:03] (03CR) 10Majavah: [C: 03+1] profile::wmcs::nfsclient: move toolforge nodes to the new NFS server [puppet] - 10https://gerrit.wikimedia.org/r/905229 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [13:38:31] (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [13:39:38] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfsclient: move toolforge nodes to the new NFS server [puppet] - 10https://gerrit.wikimedia.org/r/905229 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [13:42:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [13:42:56] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [13:43:02] (03CR) 10Majavah: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 56): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40501/console" [puppet] - 10https://gerrit.wikimedia.org/r/905229 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [13:44:05] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [13:44:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [13:44:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [13:44:34] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [13:45:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [13:46:33] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be1072.eqiad.wmnet'] [13:47:02] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1074.eqiad.wmnet'] [13:47:09] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1075.eqiad.wmnet'] [13:47:17] o/ [13:47:27] anything left to deploy or did taavi do all the work while I was away? ;) [13:47:49] I did it all I think [13:47:55] cool, thanks :) [13:47:58] jouncebot: nowandnext [13:47:58] For the next 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230403T1300) [13:47:58] In 1 hour(s) and 42 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230403T1530) [13:49:00] If there's nothing else to deploy I have a test to run on deploy servers [13:49:04] Shouldn't take long [13:49:34] I think you can go ahead [13:49:38] #lastfamouswords [13:49:43] Heh [13:49:46] ;P [13:49:54] At worst it'll make a deploy fail, I think we should be ok :P [13:50:17] ^^ [13:50:26] (03PS12) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [13:50:54] !log Testing deploy server dsh group inclusion - T329857 [13:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:58] T329857: MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 [13:51:22] (03CR) 10Clément Goubert: [C: 03+2] Experiment: Remove deploy1002/deploy2002 from mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/901676 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [13:53:16] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Vgutierrez) [13:56:10] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1074.eqiad.wmnet'] [14:00:34] !log Finished testing deploy server dsh group inclusion - T329857 [14:01:10] !log beginning failover of alert1001 to alert2001 [14:01:27] Hmm stashboat is down [14:01:31] stashbot [14:03:42] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@fabc2cf]: Deploy refine webrequest job on analytics_test to fix matching Oozie job [14:03:54] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@fabc2cf]: Deploy refine webrequest job on analytics_test to fix matching Oozie job (duration: 00m 12s) [14:06:06] stashbot/wikibugs being down is https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/Q4C4QW5X4ATG3ANRO6CMDIWCG42YM6NJ/ [14:09:58] PROBLEM - Check systemd state on mw2352 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:41] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:12:11] RECOVERY - Check systemd state on mw2352 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:31] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:12:37] er, what's this about, nothing is down [14:12:37] ok [14:13:00] RECOVERY - Check systemd state on mw2352 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:15] (JobUnavailable) firing: (2) Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:30:01] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-test-worker1001.eqiad.wmnet with reason: Investigate service failures from bullseye upgrade [14:30:17] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-test-worker1001.eqiad.wmnet with reason: Investigate service failures from bullseye upgrade [14:37:30] (JobUnavailable) firing: (2) Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:47:30] (JobUnavailable) firing: (2) Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:52:30] (JobUnavailable) firing: (2) Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:59] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@fabc2cf]: Deploy refine webrequest job on analytics_test to fix matching Oozie job [15:05:10] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@fabc2cf]: Deploy refine webrequest job on analytics_test to fix matching Oozie job (duration: 00m 11s) [15:07:13] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw [15:07:23] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw [15:12:22] !log rolling restart of bird.service on doh* and not doh2002 [15:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:55] some BGP alerts expected [15:26:57] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw [15:27:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw [15:30:04] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230403T1530) [15:30:47] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin [15:30:54] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin [15:32:25] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905247 (https://phabricator.wikimedia.org/T128546) [15:34:29] (03PS2) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243 [15:36:25] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905247 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:36:33] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@04b4841]: (no justification provided) [15:36:46] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@04b4841]: (no justification provided) (duration: 00m 12s) [15:37:10] !log restarted sirenbot (vopsbot) on alert2001 (msg="could not find the topic for this channel stored. Is the bot in the channel?") [15:37:10] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905247 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:21] cc godog / herron ^^^ (my log) [15:37:25] (03CR) 10BCornwall: [V: 03+1] lists: Disable access on port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904854 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [15:37:38] volans: ack, thank you! [15:37:56] (03CR) 10BCornwall: [V: 03+1] lists: Disable access on port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904854 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [15:38:02] might need to be added to the failover procedure, probably some detail missing [15:38:05] but I didn't dig [15:42:39] (03PS1) 10Herron: alertmanager: switch data.retention unit to hours [puppet] - 10https://gerrit.wikimedia.org/r/905248 [15:43:02] (03CR) 10CI reject: [V: 04-1] alertmanager: switch data.retention unit to hours [puppet] - 10https://gerrit.wikimedia.org/r/905248 (owner: 10Herron) [15:43:37] (03CR) 10Cwhite: [C: 03+1] "Good catch! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/904784 (owner: 10Slyngshede) [15:43:39] (03PS2) 10Herron: alertmanager: switch data.retention unit to hours [puppet] - 10https://gerrit.wikimedia.org/r/905248 [15:45:32] (03PS8) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) [15:45:37] (03CR) 10Herron: [C: 03+2] alertmanager: switch data.retention unit to hours [puppet] - 10https://gerrit.wikimedia.org/r/905248 (owner: 10Herron) [15:46:40] (03CR) 10CI reject: [V: 04-1] ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos) [15:46:58] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:905247| Bumping portals to master (T128546)]] (duration: 06m 14s) [15:47:02] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:47:48] (03PS1) 10David Caro: maintain_dbusers: use the right webproxy url [puppet] - 10https://gerrit.wikimedia.org/r/905250 [15:48:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:49:03] (03CR) 10David Caro: [C: 03+2] maintain_dbusers: use the right webproxy url [puppet] - 10https://gerrit.wikimedia.org/r/905250 (owner: 10David Caro) [15:51:19] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:51:27] (03PS1) 10Elukey: Upgrade kafka-main to use PKI TLS certificates for brokers [puppet] - 10https://gerrit.wikimedia.org/r/905251 (https://phabricator.wikimedia.org/T332013) [15:52:31] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:905247| Bumping portals to master (T128546)]] (duration: 05m 33s) [15:52:35] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:52:42] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40503/console" [puppet] - 10https://gerrit.wikimedia.org/r/905251 (https://phabricator.wikimedia.org/T332013) (owner: 10Elukey) [15:53:34] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:56:19] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:40] volans: thanks, was the unit in a failed state out of curiosity? I think what happened was two instances of vopsbot running at the same time, alert2001 started the service while alert1001 was still running [15:59:04] the journal looks like it kept running [15:59:23] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin [15:59:24] herron: no it was not, but I noticed that the bot quit from the old chan [15:59:28] *sre chan [15:59:45] Active: active (running) since Mon 2023-04-03 14:08:44 UTC; 1h 27min ago [15:59:52] that was the state I found it in [16:00:12] in the error message it was mentioning also error="sql: database is closed" [16:00:25] ok, good catch by the way [16:00:42] I'm guessing this was a nick collision manifesting in a weird way? [16:01:15] not sure, I didn't look deeply at the logs [16:01:40] ok, thx again I'll move to a task [16:02:27] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin [16:02:40] (03PS2) 10Elukey: Upgrade kafka-main to use PKI TLS certificates for brokers [puppet] - 10https://gerrit.wikimedia.org/r/905251 (https://phabricator.wikimedia.org/T319372) [16:02:56] (03PS9) 10Ilias Sarantopoulos: ml-services: FastAPI chart using sextant for ores-legacy service [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) [16:07:39] (03CR) 10Ryan Kemper: "Resolving comments so this disappears from my gerrit UI" [cookbooks] - 10https://gerrit.wikimedia.org/r/769109 (https://phabricator.wikimedia.org/T301955) (owner: 10Ryan Kemper) [16:10:11] (03CR) 10Ryan Kemper: [C: 03+2] "Resolving comments so this disappears from my gerrit UI" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [16:10:24] (03CR) 10Ryan Kemper: [C: 03+2] "Resolving comments so this disappears from my gerrit UI" [puppet] - 10https://gerrit.wikimedia.org/r/758908 (owner: 10Ryan Kemper) [16:12:13] (03PS1) 10David Caro: replica_cnf: update the tools paths [puppet] - 10https://gerrit.wikimedia.org/r/905252 (https://phabricator.wikimedia.org/T333477) [16:13:30] (03CR) 10David Caro: [C: 03+2] replica_cnf: update the tools paths [puppet] - 10https://gerrit.wikimedia.org/r/905252 (https://phabricator.wikimedia.org/T333477) (owner: 10David Caro) [16:14:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:16:24] (03CR) 10Aaron Schulz: [C: 03+1] objectcache: Disable cool-off bounce feature [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902376 (https://phabricator.wikimedia.org/T203786) (owner: 10Krinkle) [16:18:23] (03PS3) 10Andrew Bogott: labstore1004: park in an 'insetup' role until we're ready to decom [puppet] - 10https://gerrit.wikimedia.org/r/904630 (https://phabricator.wikimedia.org/T333477) [16:19:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:36] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/904630/40505/labstore1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/904630 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230403T1700) [17:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230403T1700). [17:03:05] (03CR) 10Ladsgroup: [C: 04-1] lists: Disable access on port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904854 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [17:04:57] (03PS1) 10Ayounsi: Manage drmrs LVS/bird BGP with Homer [homer/public] - 10https://gerrit.wikimedia.org/r/905257 [17:10:24] (03CR) 10BCornwall: [V: 03+1] lists: Disable access on port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904854 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [17:12:06] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) @Ladsgroup You expressed opposition to removing :80 from lists. Is that to say that you don't see a way forward and that lists should be remo... [17:15:22] (03CR) 10Ayounsi: "Mostly to start the conversation and not to be merged as it, there are no diffs for eqiad L3 nor L2 switches (as expected). For drmrs the " [homer/public] - 10https://gerrit.wikimedia.org/r/905257 (owner: 10Ayounsi) [17:16:48] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be107[2-5] - https://phabricator.wikimedia.org/T326350 (10Jclark-ctr) [17:25:40] (03PS1) 10Elukey: admin_ng: add list-nodes ClusterRole and assign it to Prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/905260 [17:27:05] (03CR) 10Elukey: "This is what we see on the prometheus side:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905260 (owner: 10Elukey) [17:47:04] (03PS1) 10Phuedx: mediawiki.edit_attempt: Ignore events from PHP MPC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905261 (https://phabricator.wikimedia.org/T309985) [17:52:16] (03CR) 10Slyngshede: [C: 03+2] get_single_object - get modified timestamp [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/900304 (owner: 10Slyngshede) [17:54:36] (03CR) 10Dzahn: [C: 03+2] etherpad: remove process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/904856 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [18:00:40] 10SRE, 10MediaWiki-extensions-OAuth, 10Datacenter-Switchover, 10Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10MusikAnimal) I have verified that at least for CopyPatrol, the req... [18:09:05] (03PS1) 10Jdlrobson: [refactor] split out Minerva configuration from main config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905264 [18:09:09] (03CR) 10Dzahn: [C: 04-1] alertmanager: create receiver for both sre-collab and releng combined (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [18:09:29] (03PS6) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) [18:11:05] (03PS7) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587) [18:13:46] (03PS3) 10Dzahn: gerrit: replace Icinga monitoring with Prometheus, ssh port 29418 [puppet] - 10https://gerrit.wikimedia.org/r/904857 (https://phabricator.wikimedia.org/T331901) [18:14:34] !log Disable Puppet/PyBal on lvs5006 in preparation for reimaging - T321309 [18:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:39] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [18:15:30] (03PS5) 10Dzahn: gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) [18:21:18] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [18:21:27] ^ expected [18:21:35] (03PS1) 10BCornwall: hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905265 (https://phabricator.wikimedia.org/T321309) [18:22:34] PROBLEM - pybal on lvs5006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [18:22:35] (03PS1) 10Aklapper: phabricator weekly changes email: Include "In Progress" task status [puppet] - 10https://gerrit.wikimedia.org/r/905286 [18:23:09] ^Expected on lvs5006, it's been disabled for reimaging [18:23:40] PROBLEM - PyBal connections to etcd on lvs5006 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [18:24:25] (03PS1) 10Jdlrobson: make "advanced mode" default on ptwikinews mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905287 (https://phabricator.wikimedia.org/T290812) [18:24:28] (03PS2) 10BCornwall: hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905265 (https://phabricator.wikimedia.org/T321309) [18:25:08] (03CR) 10CI reject: [V: 04-1] make "advanced mode" default on ptwikinews mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905287 (https://phabricator.wikimedia.org/T290812) (owner: 10Jdlrobson) [18:28:15] (03CR) 10Ssingh: [C: 03+1] hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905265 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [18:29:42] (03CR) 10BCornwall: [C: 03+2] hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905265 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [18:30:59] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5006.eqsin.wmnet with OS bullseye [18:31:06] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs5006.eqsin.wmnet with OS bullseye [18:43:28] 10SRE, 10SRE-Access-Requests: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10sbassett) [18:43:48] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: Include "In Progress" task status [puppet] - 10https://gerrit.wikimedia.org/r/905286 (owner: 10Aklapper) [18:45:00] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) (owner: 10TsepoThoabala) [18:50:31] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10ssingh) Hi @Trizek-WMF: This requires approval from your manager. Thank you! From Analytics, adding @Ottomata @odimitrijevic for approval. [18:52:30] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:55:58] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5006.eqsin.wmnet with reason: host reimage [18:56:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904915 [18:56:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904915 (owner: 10TrainBranchBot) [18:56:27] (03PS9) 10Ryan Kemper: wdqs: make sli uptime use pre-existing metric [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) [18:58:23] (03Abandoned) 10Ryan Kemper: Add new sli_panel_title field [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/903731 (owner: 10Ryan Kemper) [18:59:13] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5006.eqsin.wmnet with reason: host reimage [18:59:31] (03CR) 10Andrew Bogott: [C: 03+2] labstore1004: park in an 'insetup' role until we're ready to decom [puppet] - 10https://gerrit.wikimedia.org/r/904630 (https://phabricator.wikimedia.org/T333477) (owner: 10Andrew Bogott) [19:03:29] (03Abandoned) 10Ryan Kemper: elastic: use logger not print [cookbooks] - 10https://gerrit.wikimedia.org/r/823706 (owner: 10Ryan Kemper) [19:06:03] 10SRE, 10MediaWiki-extensions-TranslationNotifications: Requesting latest logs for ::maintenance::translationnotifications periodic job - https://phabricator.wikimedia.org/T333851 (10Dzahn) a:03Dzahn [19:06:08] (03PS1) 10Dzahn: trafficserver/wdqs: switch query-preview.wikidata.org to new backend [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) [19:06:34] 10SRE, 10MediaWiki-extensions-TranslationNotifications: Requesting latest logs for ::maintenance::translationnotifications periodic job - https://phabricator.wikimedia.org/T333851 (10Dzahn) 05Open→03In progress p:05Triage→03High [19:09:20] !log manually upgrade vopsbot on alert2001 to version 0.3.3 [19:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:36] (03PS2) 10Dzahn: trafficserver/wdqs: switch query-preview.wikidata.org to new backend [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) [19:13:09] ACKNOWLEDGEMENT - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is inactive Andrew Bogott these servers are on their way out https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:13:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/904915 (owner: 10TrainBranchBot) [19:13:22] 10SRE, 10MediaWiki-extensions-TranslationNotifications: Requesting latest logs for ::maintenance::translationnotifications periodic job - https://phabricator.wikimedia.org/T333851 (10Dzahn) ` ● mediawiki_job_translationnotifications-metawiki.service - MediaWiki periodic job translationnotifications-metawiki... [19:13:52] (03PS1) 10BCornwall: fixup! hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905294 [19:14:03] (03CR) 10CI reject: [V: 04-1] fixup! hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905294 (owner: 10BCornwall) [19:14:27] (03PS2) 10BCornwall: fixup! hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905294 (https://phabricator.wikimedia.org/T321309) [19:14:38] (03CR) 10CI reject: [V: 04-1] fixup! hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905294 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:14:56] (03PS3) 10BCornwall: fixup! hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905294 (https://phabricator.wikimedia.org/T321309) [19:16:53] (03CR) 10Ssingh: [C: 03+1] fixup! hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905294 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:17:07] (03CR) 10BCornwall: [C: 03+2] fixup! hiera: lvs/interfaces: update lvs5006 iface name [puppet] - 10https://gerrit.wikimedia.org/r/905294 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall) [19:18:41] (03PS1) 10Ottomata: admin_ng/flink-operator - fix prometheus reporting configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/905295 (https://phabricator.wikimedia.org/T333464) [19:23:01] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10wiki_willy) a:03Jclark-ctr [19:26:07] (03CR) 10Ottomata: [C: 03+2] admin_ng/flink-operator - fix prometheus reporting configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/905295 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [19:27:02] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10ssingh) [19:30:54] (03Merged) 10jenkins-bot: admin_ng/flink-operator - fix prometheus reporting configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/905295 (https://phabricator.wikimedia.org/T333464) (owner: 10Ottomata) [19:35:16] (03PS1) 10Ahmon Dancy: beta: Enable /srv/mediawiki symlink on deployment-deploy03 [puppet] - 10https://gerrit.wikimedia.org/r/905297 (https://phabricator.wikimedia.org/T329857) [19:35:26] (03Abandoned) 10Ryan Kemper: elasticsearch::curator: Switch to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/565617 (owner: 10Muehlenhoff) [19:35:59] !log otto@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [19:36:07] !log otto@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:36:09] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:38:45] !log brett@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host lvs5006.eqsin.wmnet with OS bullseye [19:38:54] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs5006.eqsin.wmnet with OS bullseye completed: - lvs5006 (**FAIL**) - Downtimed on Icinga/Aler... [19:38:57] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs5006.eqsin.wmnet with OS bullseye executed with errors: - lvs5006 (**FAIL**) - Downtimed on... [19:41:56] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5006.eqsin.wmnet with OS bullseye [19:42:07] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host lvs5006.eqsin.wmnet with OS bullseye [19:47:17] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:48:09] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10Ladsgroup) Well, I don't opposite it, it's just that this port is used internally and it'll break, you might be able to see the traffic flowing if you d... [19:53:30] (03PS2) 10Jdlrobson: make "advanced mode" default on ptwikinews mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905287 (https://phabricator.wikimedia.org/T290812) [19:57:07] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Jhancock.wm) @MatthewVernon Paul is on vacation this week but I can help with this. I signed into the idrac and I'm only seeing Foreign Disk status on drive 19 (dev/sdy I think).... [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230403T2000). [20:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] I can deploy. [20:00:51] kindrobot present! [20:00:55] Jdlrobson: they look like they're all yours. Are they OK to go together? [20:01:17] um I'd suggest doing https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/905287 last after the other 2 have landed [20:01:31] (and been verified) [20:02:21] Sounds good. I'll do 905264 & 904284 together and then 905287 thereafter. [20:02:26] great [20:03:30] (03CR) 10Ryan Kemper: "Patchset 9 should be ready for review." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) (owner: 10Ryan Kemper) [20:03:46] !log start UTC late backport window [20:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905264 (owner: 10Jdlrobson) [20:04:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904284 (https://phabricator.wikimedia.org/T332809) (owner: 10Jdlrobson) [20:04:24] (03CR) 10Ryan Kemper: "I'll be working on a corresponding patchset to recording rules to store a 90d version of `1 - job_backend:trafficserver_backend_requests:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) (owner: 10Ryan Kemper) [20:04:40] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10Ottomata) HI @nettrom_WMF, I think we need some more info. Is trizek a WMF staff or contractor? Usually, we need an official sponsor contact and a MOU date at which the access s... [20:04:54] (03PS3) 10Stef Dunlap: Disable Vector js/css sharing on pl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904284 (https://phabricator.wikimedia.org/T332809) (owner: 10Jdlrobson) [20:05:34] (03Merged) 10jenkins-bot: [refactor] split out Minerva configuration from main config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905264 (owner: 10Jdlrobson) [20:05:50] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904284 (https://phabricator.wikimedia.org/T332809) (owner: 10Jdlrobson) [20:06:21] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10Ottomata) Oh, `wmf` group already hm, and I see a bevellin@wikimedia.org email associated. If so, then approved! Note to SRE clinic duty: this is ssh-less posix group members... [20:06:31] (03PS4) 10Stef Dunlap: Disable Vector js/css sharing on pl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904284 (https://phabricator.wikimedia.org/T332809) (owner: 10Jdlrobson) [20:06:35] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904284 (https://phabricator.wikimedia.org/T332809) (owner: 10Jdlrobson) [20:07:09] (03Merged) 10jenkins-bot: Disable Vector js/css sharing on pl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904284 (https://phabricator.wikimedia.org/T332809) (owner: 10Jdlrobson) [20:07:17] 10SRE, 10MediaWiki-extensions-TranslationNotifications: Requesting latest logs for ::maintenance::translationnotifications periodic job - https://phabricator.wikimedia.org/T333851 (10Dzahn) {P46012} [20:07:35] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:905264|[refactor] split out Minerva configuration from main config]], [[gerrit:904284|Disable Vector js/css sharing on pl.wikipedia (T332809)]] [20:07:35] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5006.eqsin.wmnet with reason: host reimage [20:07:39] T332809: Please disable Vector js/css sharing on pl.wikipedia - https://phabricator.wikimedia.org/T332809 [20:08:53] !log kindrobot@deploy2002 kindrobot and jdlrobson: Backport for [[gerrit:905264|[refactor] split out Minerva configuration from main config]], [[gerrit:904284|Disable Vector js/css sharing on pl.wikipedia (T332809)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:09:13] Jdlrobson: ready to verify [20:09:28] looking [20:10:55] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5006.eqsin.wmnet with reason: host reimage [20:11:13] (03PS2) 10Ahmon Dancy: beta: Enable /srv/mediawiki symlink on deployment-deploy03 [puppet] - 10https://gerrit.wikimedia.org/r/905297 (https://phabricator.wikimedia.org/T329857) [20:11:15] (03PS1) 10Ahmon Dancy: mediawiki::scap: Ensure Exec['fetch_mediawiki'] resource always exists [puppet] - 10https://gerrit.wikimedia.org/r/905304 (https://phabricator.wikimedia.org/T329857) [20:11:43] (03CR) 10CI reject: [V: 04-1] mediawiki::scap: Ensure Exec['fetch_mediawiki'] resource always exists [puppet] - 10https://gerrit.wikimedia.org/r/905304 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [20:12:53] (03PS2) 10Ahmon Dancy: mediawiki::scap: Ensure Exec['fetch_mediawiki'] resource always exists [puppet] - 10https://gerrit.wikimedia.org/r/905304 (https://phabricator.wikimedia.org/T329857) [20:12:55] (03PS3) 10Ahmon Dancy: beta: Enable /srv/mediawiki symlink on deployment-deploy03 [puppet] - 10https://gerrit.wikimedia.org/r/905297 (https://phabricator.wikimedia.org/T329857) [20:13:06] kindrobot: LGTM please sync [20:14:10] Syncing [20:19:40] !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:905264|[refactor] split out Minerva configuration from main config]], [[gerrit:904284|Disable Vector js/css sharing on pl.wikipedia (T332809)]] (duration: 12m 05s) [20:19:45] T332809: Please disable Vector js/css sharing on pl.wikipedia - https://phabricator.wikimedia.org/T332809 [20:20:04] (03PS3) 10Ahmon Dancy: mediawiki::scap: Ensure Exec['fetch_mediawiki'] resource always exists [puppet] - 10https://gerrit.wikimedia.org/r/905304 (https://phabricator.wikimedia.org/T329857) [20:20:06] (03PS4) 10Ahmon Dancy: beta: Enable /srv/mediawiki symlink on deployment-deploy03 [puppet] - 10https://gerrit.wikimedia.org/r/905297 (https://phabricator.wikimedia.org/T329857) [20:20:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:21:57] Ok, moving on to 905287 [20:22:22] thanks kindrobot [20:24:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905287 (https://phabricator.wikimedia.org/T290812) (owner: 10Jdlrobson) [20:24:20] (03PS3) 10Stef Dunlap: make "advanced mode" default on ptwikinews mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905287 (https://phabricator.wikimedia.org/T290812) (owner: 10Jdlrobson) [20:24:31] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905287 (https://phabricator.wikimedia.org/T290812) (owner: 10Jdlrobson) [20:25:01] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/905304/40508/" [puppet] - 10https://gerrit.wikimedia.org/r/905304 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [20:25:16] (03Merged) 10jenkins-bot: make "advanced mode" default on ptwikinews mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905287 (https://phabricator.wikimedia.org/T290812) (owner: 10Jdlrobson) [20:25:29] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:905287|make "advanced mode" default on ptwikinews mobile (T290812)]] [20:25:34] T290812: make "advanced mode" default on ptwikinews mobile - https://phabricator.wikimedia.org/T290812 [20:25:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:26:44] !log kindrobot@deploy2002 jdlrobson and kindrobot: Backport for [[gerrit:905287|make "advanced mode" default on ptwikinews mobile (T290812)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:27:46] Jdlrobson: ready for verification [20:27:48] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/905297/40509/" [puppet] - 10https://gerrit.wikimedia.org/r/905297 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [20:27:56] kindrobot: thanks .. looking! [20:29:43] (03PS3) 10Ryan Kemper: wdqs: improve reliability of reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (https://phabricator.wikimedia.org/T325114) [20:29:50] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10nettrom_WMF) @Ottomata : as you've discovered, @Trizek-WMF is already assigned to the `wmf` group and is WMF staff (and also subscribed to this task). [20:30:10] (03PS4) 10Ryan Kemper: wdqs: improve reliability of reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (https://phabricator.wikimedia.org/T325114) [20:30:52] kindrobot: LGTM! [20:31:23] syncing [20:31:29] (03CR) 10Ryan Kemper: "@volans got back to this patch after [way too long of] a delay. Patchset 3 should be ready to review." [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (https://phabricator.wikimedia.org/T325114) (owner: 10Ryan Kemper) [20:31:45] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5006.eqsin.wmnet with OS bullseye [20:31:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host lvs5006.eqsin.wmnet with OS bullseye completed: - lvs5006 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled... [20:32:15] (03CR) 10CI reject: [V: 04-1] wdqs: improve reliability of reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (https://phabricator.wikimedia.org/T325114) (owner: 10Ryan Kemper) [20:33:12] (03PS5) 10Ryan Kemper: wdqs: improve reliability of reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (https://phabricator.wikimedia.org/T325114) [20:33:35] (03CR) 10Ryan Kemper: wdqs: improve reliability of reboots (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 (https://phabricator.wikimedia.org/T325114) (owner: 10Ryan Kemper) [20:35:58] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall) [20:36:16] !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:905287|make "advanced mode" default on ptwikinews mobile (T290812)]] (duration: 10m 47s) [20:36:21] T290812: make "advanced mode" default on ptwikinews mobile - https://phabricator.wikimedia.org/T290812 [20:36:52] Sync complete. :D [20:37:02] !log close UTC late backport window [20:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:37] 10SRE, 10MediaWiki-extensions-TranslationNotifications: Requesting latest logs for ::maintenance::translationnotifications periodic job - https://phabricator.wikimedia.org/T333851 (10Dzahn) 05In progress→03Open a:05Dzahn→03None [20:38:56] 10SRE, 10MediaWiki-extensions-TranslationNotifications: Requesting latest logs for ::maintenance::translationnotifications periodic job - https://phabricator.wikimedia.org/T333851 (10Dzahn) 05Open→03Resolved p:05High→03Medium a:03Dzahn [20:39:09] Thanks kindrobot ! [20:39:10] 10SRE, 10MediaWiki-extensions-TranslationNotifications: Requesting latest logs for ::maintenance::translationnotifications periodic job - https://phabricator.wikimedia.org/T333851 (10Dzahn) a:05Dzahn→03None [20:41:45] 10SRE, 10MediaWiki-extensions-TranslationNotifications: Requesting latest logs for ::maintenance::translationnotifications periodic job - https://phabricator.wikimedia.org/T333851 (10MarcoAurelio) Thank you. So the script seems to work, but I didn't get any of the digests the logs claim I was sent so either th... [20:48:19] 10SRE, 10SRE-Access-Requests: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10Dzahn) requested to add them to LDAP nda group for logstash access, same day, different ping, but it fits so well...T333884 [20:49:49] 10SRE, 10LDAP-Access-Requests: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10Dzahn) [20:52:36] 10SRE, 10SRE-Access-Requests: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10sbassett) [20:56:10] 10SRE, 10LDAP-Access-Requests: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10Zabe) {L37} is not an old nda, it's the nda which is still used for stewards, checkusers, oversighters, etc. :) [20:59:11] 10SRE, 10LDAP-Access-Requests: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10Dzahn) It would be a good question for legal whether that should still be used for stewards. Because I know that Legalpad (the app behind L links) was once developed with Legal but also years later... [21:00:06] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230403T2100). [21:01:42] (03PS3) 10Dzahn: trafficserver/wdqs: switch query-preview.wikidata.org to new backend [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) [21:05:14] 10SRE, 10LDAP-Access-Requests: add MarcoAurelio to LDAP nda group - https://phabricator.wikimedia.org/T333884 (10MarcoAurelio) Hmm //tempus fugit//. Looking back in time I did secure-info in 2009/2010, then {L4} in 2015 or so and now {L37} because L4 wording was updated. [21:12:03] PROBLEM - IPMI Sensor Status on db2163 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:12:20] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10bking) [21:12:33] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs1003.eqiad.wmnet,wdqs[1010,1013-1014].eqiad.wmnet with reason: T331882 eqiad row C maint [21:12:38] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [21:12:52] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs1003.eqiad.wmnet,wdqs[1010,1013-1014].eqiad.wmnet with reason: T331882 eqiad row C maint [21:15:50] Hey all - mstyles and I are deploying a quick mitigation to PrivateSettings.php [21:15:52] (03CR) 10Dzahn: "Here is some info about the tests.This is what we check:" [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [21:16:48] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: T331882 eqiad row C maint [21:17:04] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: T331882 eqiad row C maint [21:22:04] !log deployed mitigation for T333140 [21:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:51] !log bking@cumin ban cloudelastic1003 from all cloudelastic clusters T331882 [21:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:56] T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 [21:31:56] (03PS1) 10Dzahn: wdqs/wcqs: switch query.wikidata.org and wcqs to bullseye backends [puppet] - 10https://gerrit.wikimedia.org/r/905317 (https://phabricator.wikimedia.org/T331896) [21:36:08] (03CR) 10Bking: [C: 03+1] "I don't completely understand the ATS config, but as long as this can be rolled back if something goes wrong, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [21:41:10] (03CR) 10Dzahn: "Yes, this can be rolled back easily. And the ATS config part is already used in production by a bunch of other misc sites. thank you" [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [21:42:19] !log undeployed mitigation for T333140 [21:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:04] (03PS4) 10Dzahn: trafficserver/wdqs: switch query-preview.wikidata.org to new backend [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) [21:45:41] (03CR) 10Dzahn: [C: 03+1] "I am removing the edit to internal.yaml so that this change stays limited to query-preview. I still count the review:)" [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [21:46:29] (03CR) 10Dzahn: [C: 03+2] trafficserver/wdqs: switch query-preview.wikidata.org to new backend [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [21:47:28] (03Abandoned) 10Ryan Kemper: wdqs: extract timestamp *after* fetch dumps [cookbooks] - 10https://gerrit.wikimedia.org/r/873782 (https://phabricator.wikimedia.org/T325114) (owner: 10Ryan Kemper) [21:52:59] !log T331896 `sudo -E cumin -b 4 'wdqs*' 'sudo run-puppet-agent'` [21:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:04] T331896: upgrade miscweb VMs to bullseye - https://phabricator.wikimedia.org/T331896 [21:55:40] (03CR) 10Ryan Kemper: "For posterity:" [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [21:58:55] (03CR) 10Dzahn: [C: 03+2] "So.. you ran puppet on wdqs* and I ran puppet on cp4* (ATS in San Francisco) and I can already see in logfiles of miscweb2003 how I hit th" [puppet] - 10https://gerrit.wikimedia.org/r/905292 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [21:59:10] 10SRE, 10Traffic: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10KOfori) [22:25:00] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on db2163:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=db2163 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:43:16] (03PS2) 10Dzahn: planet: update the feed URLs getting 3xx [puppet] - 10https://gerrit.wikimedia.org/r/902515 [22:49:48] (03CR) 10Dzahn: [C: 03+2] planet: update the feed URLs getting 3xx [puppet] - 10https://gerrit.wikimedia.org/r/902515 (owner: 10Dzahn) [22:49:56] (03PS3) 10Dzahn: planet: update the feed URLs getting 3xx [puppet] - 10https://gerrit.wikimedia.org/r/902515 [22:52:30] (JobUnavailable) firing: Reduced availability for job k8s-pods-tls in k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:52:45] (03PS2) 10Dzahn: wdqs: add monitor for query. and query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/900743 (https://phabricator.wikimedia.org/T329587) [22:54:59] (03CR) 10Dzahn: ""preview" site does not need monitoring and will be removed soon. and the other part is already done in https://gerrit.wikimedia.org/r/c/o" [puppet] - 10https://gerrit.wikimedia.org/r/900743 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [22:55:04] (03Abandoned) 10Dzahn: wdqs: add monitor for query. and query-preview.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/900743 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [23:07:23] (03CR) 10Dzahn: gitlab: Disable listening on port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904843 (https://phabricator.wikimedia.org/T238720) (owner: 10BCornwall) [23:13:30] (Storage /var over 50%) firing: Alert for device cloudsw1-b1-codfw.mgmt.codfw.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [23:30:08] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/904102 (owner: 10Slyngshede) [23:46:06] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10RLazarus) I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/892570 would have smoothed this out, at least in part -- we just didn't get it... [23:51:35] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:52:21] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:52:47] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:59:56] (03PS1) 10Andrew Bogott: OpenStack: adopt new scoped tokens and policy rules in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/905327 (https://phabricator.wikimedia.org/T330759)