[00:00:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 23:00:00 on 8 hosts with reason: T376150 non-prod hosts [00:06:45] (03PS2) 10Eevans: cassandra: configurations merged from upstream 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1100549 (https://phabricator.wikimedia.org/T380420) [00:07:55] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100549 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [00:14:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10381891 (10Jclark-ctr) Followed up with Dell. can you confirm that i can power down server again tomorrow to inspect memory @aborrero [00:15:34] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [00:15:52] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [00:15:53] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1085.eqiad.wmnet with OS bullseye [00:16:05] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10381892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-be1085.eqiad.wmnet with OS bullseye complete... [00:17:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10381894 (10VRiley-WMF) [00:17:07] (03PS3) 10Eevans: cassandra: configurations merged from upstream 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1100549 (https://phabricator.wikimedia.org/T380420) [00:19:20] (03PS4) 10Eevans: cassandra: configurations merged from upstream 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1100549 (https://phabricator.wikimedia.org/T380420) [00:20:56] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T380182#10381899 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [00:22:50] (03PS1) 10Aude: Update chart renderer service with locale option support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100557 [00:25:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10381908 (10Jclark-ctr) a:05Jhancock.wm→03Jclark-ctr I did notice it looks like memory is missing from inventory report looks like slots... [00:28:03] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381230#10381913 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Rebalanced Pdu [00:28:20] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - msw1-eqiad.mgmt.eqiad.wmnet - https://phabricator.wikimedia.org/T376547#10381920 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr No active allerts in librenms [00:29:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Q1:eqiad:frack network upgrade tracking task - https://phabricator.wikimedia.org/T371435#10381923 (10Jclark-ctr) 05Open→03Resolved [00:30:47] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm [00:30:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1043.eqiad.wmnet with OS bookworm ex... [00:35:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T371742)', diff saved to https://phabricator.wikimedia.org/P71569 and previous config saved to /var/cache/conftool/dbconfig/20241205-003524-ladsgroup.json [00:35:28] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [00:35:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10381935 (10Jclark-ctr) running into issues with the last two @ABran-WMF es1043 is imaged but will not pass certificate for puppet es1045 will... [00:37:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10381948 (10Jclark-ctr) [00:38:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100561 [00:38:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100561 (owner: 10TrainBranchBot) [00:38:30] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10381937 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [00:39:50] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381540 (10phaultfinder) 03NEW [00:42:07] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:42:15] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:43:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10381950 (10Jclark-ctr) 05Open→03Resolved a:05bking→03Jclark-ctr [00:47:43] (03CR) 10Aude: "We should schedule a time sometime tomorrow to update the chart renderer service." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100557 (owner: 10Aude) [00:48:37] (03PS5) 10Scott French: gateway-check: fix invalid config handling [puppet] - 10https://gerrit.wikimedia.org/r/1084247 [00:49:36] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:49:39] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:50:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P71570 and previous config saved to /var/cache/conftool/dbconfig/20241205-005031-ladsgroup.json [00:57:42] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:58:22] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:58:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100561 (owner: 10TrainBranchBot) [01:03:37] !log re-enabling puppet on A:lvs [post-wdqs merge] [01:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P71571 and previous config saved to /var/cache/conftool/dbconfig/20241205-010539-ladsgroup.json [01:08:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100562 [01:08:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100562 (owner: 10TrainBranchBot) [01:19:40] 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T381543 (10phaultfinder) 03NEW [01:20:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T371742)', diff saved to https://phabricator.wikimedia.org/P71572 and previous config saved to /var/cache/conftool/dbconfig/20241205-012046-ladsgroup.json [01:20:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance [01:20:50] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [01:21:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance [01:21:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T371742)', diff saved to https://phabricator.wikimedia.org/P71573 and previous config saved to /var/cache/conftool/dbconfig/20241205-012108-ladsgroup.json [01:33:28] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100562 (owner: 10TrainBranchBot) [01:54:28] FIRING: [3x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:18:25] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:52:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T371742)', diff saved to https://phabricator.wikimedia.org/P71574 and previous config saved to /var/cache/conftool/dbconfig/20241205-025230-ladsgroup.json [02:52:39] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [02:56:31] (03PS1) 10C. Scott Ananian: Enable Parsoid Fragment mode on Chart pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100570 (https://phabricator.wikimedia.org/T381436) [02:57:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100570 (https://phabricator.wikimedia.org/T381436) (owner: 10C. Scott Ananian) [02:59:09] (03CR) 10Seddon: [C:03+1] Enable Parsoid Fragment mode on Chart pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100570 (https://phabricator.wikimedia.org/T381436) (owner: 10C. Scott Ananian) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:27] (03PS1) 10Kevin Bazira: ml-services: update article-country image in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100571 (https://phabricator.wikimedia.org/T371897) [03:07:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P71575 and previous config saved to /var/cache/conftool/dbconfig/20241205-030737-ladsgroup.json [03:11:10] (03CR) 10C. Scott Ananian: [C:04-2] "Blocked by T380758" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661) (owner: 10C. Scott Ananian) [03:22:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P71576 and previous config saved to /var/cache/conftool/dbconfig/20241205-032245-ladsgroup.json [03:37:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T371742)', diff saved to https://phabricator.wikimedia.org/P71577 and previous config saved to /var/cache/conftool/dbconfig/20241205-033751-ladsgroup.json [03:37:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1211.eqiad.wmnet with reason: Maintenance [03:37:55] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [03:37:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1211.eqiad.wmnet with reason: Maintenance [03:38:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T371742)', diff saved to https://phabricator.wikimedia.org/P71578 and previous config saved to /var/cache/conftool/dbconfig/20241205-033803-ladsgroup.json [03:39:44] FIRING: [2x] IPv6AnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [03:44:44] RESOLVED: [2x] IPv6AnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [04:11:08] (03CR) 10AikoChou: [C:03+1] ml-services: update article-country image in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100571 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [04:14:51] "Error: 502, Broken pipe at 2024-12-05 04:14:15 GMT" [04:15:06] "cp1108.eqiad.wmnet, ATS/9.2.6" :o [04:15:20] PROBLEM - MariaDB read only pc4 #page on pc2015 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:15:20] PROBLEM - MariaDB Event Scheduler pc4 on pc2015 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:15:43] PROBLEM - MariaDB Replica SQL: pc4 on pc1016 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:15:49] PROBLEM - MariaDB Replica IO: pc4 on pc1016 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:15:51] Getting reports of connection issues from people. [04:15:57] FIRING: [29x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:16:05] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers wikikube-worker2141.codfw.wmnet, wikikube-worker2174.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2063.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2017.codfw.wmnet, wikikube-worker2026.codfw.wmnet, mw2338.codfw.wmnet, wikikube-worker2084.codfw.wmnet, wikikube-worker2155.codfw.wmnet, wikikube-worke [04:16:05] dfw.wmnet, kubernetes2014.codfw.wmnet, wikikube-worker2132.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2138.codfw.wmnet, wikikube-worker2092.codfw.wmnet, wikikube-worker2007.codfw.wmnet, wikikube-worker2161.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube-worker2030.codfw.wmnet, wikikube-worker2097.codfw.wmnet, wikikube-worker2008.codfw.wmnet, wikikube-worker2023.codfw.wmnet, wikikube-worker2151.codfw.wmnet, wikikube-wor [04:16:05] codfw.wmnet, wikikube-worker2041.codfw.wmnet, wikikube-worker2159.codfw.wmnet, wikikube-worker2055.codfw.wmnet, wikikube-worker2014.codfw.wmnet, kubernetes2039.codfw.wmnet, wikikube-wor https://wikitech.wikimedia.org/wiki/PyBal [04:16:05] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers wikikube-worker2021.codfw.wmnet, wikikube-worker2141.codfw.wmnet, wikikube-worker2174.codfw.wmnet, kubernetes2056.codfw.wmnet, wikikube-worker2102.codfw.wmnet, wikikube-worker2081.codfw.wmnet, wikikube-worker2017.codfw.wmnet, mw2375.codfw.wmnet, wikikube-worker2026.codfw.wmnet, kubernetes2024.codfw.wmnet, wikikube-worker2084.codf [04:16:05] wikikube-worker2150.codfw.wmnet, kubernetes2014.codfw.wmnet, wikikube-worker2136.codfw.wmnet, wikikube-worker2091.codfw.wmnet, wikikube-worker2040.codfw.wmnet, wikikube-worker2083.codfw.wmnet, wikikube-worker2071.codfw.wmnet, kubernetes2050.codfw.wmnet, wikikube-worker2022.codfw.wmnet, wikikube-worker2086.codfw.wmnet, wikikube-worker2007.codfw.wmnet, wikikube-worker2161.codfw.wmnet, wikikube-worker2157.codfw.wmnet, wikikube-worker2130.co [04:16:05] t, wikikube-worker2096.codfw.wmnet, wikikube-worker2065.codfw.wmnet, wikikube-worker2125.codfw.wmnet, wikikube-worker2041.codfw.wmnet, mw2359.codfw.wmnet, wikikube-worker2124.codfw.wmne https://wikitech.wikimedia.org/wiki/PyBal [04:16:17] PROBLEM - MariaDB read only pc4 on pc1016 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:16:17] PROBLEM - MariaDB Event Scheduler pc4 on pc1016 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:16:45] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers wikikube-worker1322.eqiad.wmnet, mw1433.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, parse1011.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1320.eqiad.wmnet, wikikube-worker1025.eqiad.wmnet, mw1430.eqiad.wmnet, mw1415.eqiad.wmnet, wikikube-worker10 [04:16:45] .wmnet, wikikube-worker1273.eqiad.wmnet, kubernetes1030.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, mw1435.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, mw1454.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worker1015.eqiad.wmnet, wikikube-worker1037.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, mw1425.eqiad.wmnet, wikikube-worker1020.eqiad. [04:16:45] w1483.eqiad.wmnet, kubernetes1059.eqiad.wmnet, kubernetes1058.eqiad.wmnet, wikikube-worker1272.eqiad.wmnet, wikikube-worker1305.eqiad.wmnet, parse1001.eqiad.wmnet, wikikube-worker1267.e https://wikitech.wikimedia.org/wiki/PyBal [04:16:47] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers wikikube-worker1280.eqiad.wmnet, parse1011.eqiad.wmnet, parse1013.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, mw1419.eqiad.wmnet, mw1442.eqiad.wmnet, wikikube-worker1007.eqiad.wmnet, wikikube-worker1036.eqiad.wmnet, wikikube-worker1315.eqiad.wmnet, mw1430.eqiad.wmnet, mw1480.eqiad.wmnet, parse10 [04:16:47] .wmnet, mw1484.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, kubernetes1030.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, kubernetes1038.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, mw1424.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, mw1488.eqiad.wmnet, mw1454.eqiad.wmnet, wikikube-worker1313.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, parse1005.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-wo [04:16:47] .eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, mw1425.eqiad.wmnet, kubernetes1033.eqiad.wmnet, mw1466.eqiad.wmnet, mw1483.eqiad.wmnet, mw1469.eqiad.wmnet, wikikube-worker1022.eqiad.wmne https://wikitech.wikimedia.org/wiki/PyBal [04:17:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 3.728% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:17:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [04:19:13] FIRING: [9x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:19:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-web (k8s) 20.32s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:20:14] !incidents [04:20:15] 5509 (UNACKED) pc2015 (paged)/MariaDB read only pc4 (paged) [04:20:15] 5510 (UNACKED) [29x] ProbeDown sre (probes/service) [04:20:15] 5511 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [04:20:15] 5512 (UNACKED) Manual (paged) by Seddon (jseddon@wikimedia.org): Site inaccessible to logged in users. [04:20:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:20:57] FIRING: [30x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:21:41] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:22:08] RESOLVED: [17x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:22:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:22:43] PROBLEM - MariaDB Replica Lag: pc4 on pc1016 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:22:45] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:22:47] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:24:15] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 3.292s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:24:28] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:25:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:25:57] FIRING: [30x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:26:53] FIRING: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [04:27:05] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:27:05] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:27:08] FIRING: [14x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:29:13] RESOLVED: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:29:28] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:29:51] FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:30:03] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:30:57] FIRING: [30x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:31:52] RESOLVED: VarnishUnavailable: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [04:33:21] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:33:36] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.codfw.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:34:13] FIRING: [14x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:34:28] FIRING: [7x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:51] FIRING: [4x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:35:57] FIRING: [30x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:36:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [04:37:08] FIRING: [14x] ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:37:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:39:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:40:57] FIRING: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:41:33] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:42:08] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:45:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:45:57] FIRING: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:50:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:50:57] FIRING: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:51:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [04:52:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [04:53:19] RECOVERY - MariaDB read only pc4 on pc1016 is OK: Version 10.6.14-MariaDB-log, Uptime 14s, read_only: False, event_scheduler: True, 549.78 QPS, connection latency: 0.039957s, query latency: 0.000992s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:53:19] RECOVERY - MariaDB Event Scheduler pc4 on pc1016 is OK: Version 10.6.14-MariaDB-log, Uptime 14s, read_only: False, event_scheduler: True, 552.85 QPS, connection latency: 0.037481s, query latency: 0.001266s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:53:21] RECOVERY - MariaDB Event Scheduler pc4 on pc2015 is OK: Version 10.6.18-MariaDB-log, Uptime 38s, read_only: False, event_scheduler: True, 2695.95 QPS, connection latency: 0.037786s, query latency: 0.001148s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [04:53:22] RECOVERY - MariaDB read only pc4 #page on pc2015 is OK: Version 10.6.18-MariaDB-log, Uptime 38s, read_only: False, event_scheduler: True, 2699.28 QPS, connection latency: 0.039377s, query latency: 0.000919s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:53:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [04:53:45] RECOVERY - MariaDB Replica SQL: pc4 on pc1016 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:45] RECOVERY - MariaDB Replica Lag: pc4 on pc1016 is OK: OK slave_sql_lag Replication lag: 47.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:49] RECOVERY - MariaDB Replica IO: pc4 on pc1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:13] FIRING: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:54:51] FIRING: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [04:55:57] RESOLVED: [14x] ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:57:08] RESOLVED: [13x] ProbeDown: Service mw-web:4450 has failed probes (http_mw-web_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:57:15] RESOLVED: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid at codfw: 4.392% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:57:59] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [04:59:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid (k8s) 2.048s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:59:28] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:59:51] RESOLVED: [6x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from mw-web-ro.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [05:03:01] !incidents [05:03:01] 5512 (RESOLVED) Manual (paged) by Seddon (jseddon@wikimedia.org): Site inaccessible to logged in users. [05:03:01] 5514 (RESOLVED) [3x] ATSBackendErrorsHigh cache_text sre (mw-web-ro.discovery.wmnet) [05:03:02] 5515 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [05:03:02] 5510 (RESOLVED) [29x] ProbeDown sre (probes/service) [05:03:02] 5509 (RESOLVED) pc2015 (paged)/MariaDB read only pc4 (paged) [05:03:02] 5511 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [05:03:02] 5513 (RESOLVED) VarnishUnavailable global sre (varnish-text thanos-rule) [05:04:44] 10ops-codfw, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549 (10Marostegui) 03NEW [05:05:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 10%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71583 and previous config saved to /var/cache/conftool/dbconfig/20241205-050550-root.json [05:06:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 10%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71584 and previous config saved to /var/cache/conftool/dbconfig/20241205-050609-root.json [05:07:51] (03PS1) 10Marostegui: instances.yaml: Add es2043 [puppet] - 10https://gerrit.wikimedia.org/r/1100580 (https://phabricator.wikimedia.org/T381259) [05:08:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T371742)', diff saved to https://phabricator.wikimedia.org/P71585 and previous config saved to /var/cache/conftool/dbconfig/20241205-050858-ladsgroup.json [05:09:02] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [05:09:15] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es2043 [puppet] - 10https://gerrit.wikimedia.org/r/1100580 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [05:15:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2043 depooled T381259', diff saved to https://phabricator.wikimedia.org/P71586 and previous config saved to /var/cache/conftool/dbconfig/20241205-051545-marostegui.json [05:15:50] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [05:16:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 1%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71587 and previous config saved to /var/cache/conftool/dbconfig/20241205-051604-root.json [05:17:10] (03PS1) 10Marostegui: es2043: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100581 [05:17:42] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:18:30] (03CR) 10Marostegui: [C:03+2] es2043: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100581 (owner: 10Marostegui) [05:20:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1159,1213,1217].eqiad.wmnet with reason: m3 master switchover T381365 [05:20:10] T381365: Switchover m3 (phabricator) master db1159 -> db1213 - https://phabricator.wikimedia.org/T381365 [05:20:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1159,1213,1217].eqiad.wmnet with reason: m3 master switchover T381365 [05:20:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 25%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71588 and previous config saved to /var/cache/conftool/dbconfig/20241205-052056-root.json [05:21:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 25%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71589 and previous config saved to /var/cache/conftool/dbconfig/20241205-052114-root.json [05:21:41] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:22:59] (03PS1) 10Marostegui: mariadb: Promote db1213 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1100583 (https://phabricator.wikimedia.org/T381365) [05:24:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P71590 and previous config saved to /var/cache/conftool/dbconfig/20241205-052405-ladsgroup.json [05:24:28] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:25:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:26:14] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1213 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1100583 (https://phabricator.wikimedia.org/T381365) (owner: 10Marostegui) [05:27:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:28:45] !log Failover m3 from db1159 to db1213 - T381365 [05:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:47] T381365: Switchover m3 (phabricator) master db1159 -> db1213 - https://phabricator.wikimedia.org/T381365 [05:29:28] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:03] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:31:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71591 and previous config saved to /var/cache/conftool/dbconfig/20241205-053109-root.json [05:31:33] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:32:30] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [05:34:20] (03PS1) 10Marostegui: db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100586 (https://phabricator.wikimedia.org/T381550) [05:35:04] (03CR) 10Marostegui: [C:03+2] db1159: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100586 (https://phabricator.wikimedia.org/T381550) (owner: 10Marostegui) [05:36:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 50%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71592 and previous config saved to /var/cache/conftool/dbconfig/20241205-053601-root.json [05:36:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 50%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71593 and previous config saved to /var/cache/conftool/dbconfig/20241205-053620-root.json [05:39:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P71595 and previous config saved to /var/cache/conftool/dbconfig/20241205-053912-ladsgroup.json [05:40:23] (03PS1) 10Marostegui: mariadb: Productionize es2044 [puppet] - 10https://gerrit.wikimedia.org/r/1100587 (https://phabricator.wikimedia.org/T381259) [05:41:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es2025 to es5 master T381259', diff saved to https://phabricator.wikimedia.org/P71596 and previous config saved to /var/cache/conftool/dbconfig/20241205-054114-marostegui.json [05:41:18] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [05:41:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es2025.codfw.wmnet with reason: cloning [05:41:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2025.codfw.wmnet with reason: cloning [05:42:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2023 to clone es2044', diff saved to https://phabricator.wikimedia.org/P71597 and previous config saved to /var/cache/conftool/dbconfig/20241205-054200-marostegui.json [05:42:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es2023.codfw.wmnet with reason: cloning [05:42:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es2023.codfw.wmnet with reason: cloning [05:42:41] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es2044 [puppet] - 10https://gerrit.wikimedia.org/r/1100587 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [05:44:13] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:46:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 25%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71598 and previous config saved to /var/cache/conftool/dbconfig/20241205-054615-root.json [05:51:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 75%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71599 and previous config saved to /var/cache/conftool/dbconfig/20241205-055106-root.json [05:51:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 75%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71600 and previous config saved to /var/cache/conftool/dbconfig/20241205-055125-root.json [05:52:30] RESOLVED: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [05:54:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T371742)', diff saved to https://phabricator.wikimedia.org/P71601 and previous config saved to /var/cache/conftool/dbconfig/20241205-055420-ladsgroup.json [05:54:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance [05:54:24] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [05:54:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance [05:54:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T371742)', diff saved to https://phabricator.wikimedia.org/P71602 and previous config saved to /var/cache/conftool/dbconfig/20241205-055442-ladsgroup.json [06:01:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 50%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71603 and previous config saved to /var/cache/conftool/dbconfig/20241205-060121-root.json [06:06:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2022 (re)pooling @ 100%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71604 and previous config saved to /var/cache/conftool/dbconfig/20241205-060612-root.json [06:06:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2024 (re)pooling @ 100%: Repooling cloning', diff saved to https://phabricator.wikimedia.org/P71605 and previous config saved to /var/cache/conftool/dbconfig/20241205-060631-root.json [06:06:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:16:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71606 and previous config saved to /var/cache/conftool/dbconfig/20241205-061626-root.json [06:31:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Pooling in es5', diff saved to https://phabricator.wikimedia.org/P71607 and previous config saved to /var/cache/conftool/dbconfig/20241205-063132-root.json [06:42:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241205T0700) [07:00:05] marostegui, Amir1, and arnaudb: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241205T0700). [07:12:37] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country image in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100571 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [07:13:47] (03Merged) 10jenkins-bot: ml-services: update article-country image in experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100571 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [07:16:46] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [07:22:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T371742)', diff saved to https://phabricator.wikimedia.org/P71608 and previous config saved to /var/cache/conftool/dbconfig/20241205-072223-ladsgroup.json [07:22:27] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [07:31:16] (03PS1) 10Jelto: Rename kubernetes102[5-6] to wikikube-worker104[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1100749 (https://phabricator.wikimedia.org/T377876) [07:32:55] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1025-1026].eqiad.wmnet [07:36:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1025-1026].eqiad.wmnet [07:37:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P71609 and previous config saved to /var/cache/conftool/dbconfig/20241205-073730-ladsgroup.json [07:52:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P71610 and previous config saved to /var/cache/conftool/dbconfig/20241205-075237-ladsgroup.json [07:57:21] (03CR) 10Gmodena: [C:03+1] "LGTM. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100460 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol) [08:00:07] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241205T0800) [08:00:07] No Gerrit patches in the queue for this window AFAICS. [08:07:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T371742)', diff saved to https://phabricator.wikimedia.org/P71611 and previous config saved to /var/cache/conftool/dbconfig/20241205-080745-ladsgroup.json [08:07:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [08:07:49] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [08:08:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [08:13:22] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes102[5-6] to wikikube-worker104[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1100749 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [08:25:47] (03CR) 10JMeybohm: "Thanks for taking the time. Unfortunately I'm not able to reproduce the issue you saw." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099837 (owner: 10Wziko) [08:29:00] (03CR) 10JMeybohm: "I now realize that this might depend on the helm version used. We're currently using v3.11.3, would you share yours?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099837 (owner: 10Wziko) [08:42:03] (03CR) 10Kosta Harlan: [C:03+1] Update MediaModeration module to run scans automatically [puppet] - 10https://gerrit.wikimedia.org/r/1100427 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [08:44:09] (03CR) 10Brouberol: [C:03+2] mw-dump-rev-content-reconcile-enrich: rename namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100460 (https://phabricator.wikimedia.org/T381322) (owner: 10Brouberol) [08:44:59] (03CR) 10Jelto: [C:03+2] Rename kubernetes102[5-6] to wikikube-worker104[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1100749 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [08:46:20] !log rebalance Ganeti eqiad/D following server refreshes [08:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:47:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:48:23] (03CR) 10Slyngshede: [C:03+2] P:idp enable JMX exporter [puppet] - 10https://gerrit.wikimedia.org/r/1098023 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [08:49:38] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1025 to wikikube-worker1044 [08:49:59] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:50:54] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:54:48] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1025 to wikikube-worker1044 - jelto@cumin1002" [08:55:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1025 to wikikube-worker1044 - jelto@cumin1002" [08:55:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:55:04] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1044 [08:57:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1044 [08:58:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1025 to wikikube-worker1044 [08:58:46] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1026 to wikikube-worker1045 [08:58:55] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:03:11] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1026 to wikikube-worker1045 - jelto@cumin1002" [09:03:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1026 to wikikube-worker1045 - jelto@cumin1002" [09:03:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:03:51] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1045 [09:04:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1045 [09:04:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1026 to wikikube-worker1045 [09:04:47] (03CR) 10Santiago Faci: Add Metrics Platform stream configuration for translate_extension (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [09:05:06] 06SRE, 06Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547#10382536 (10cmooney) >>! In T344547#9301201, @cmooney wrote: > One other observation is that the MED setting does not optimize the outbound path where we are us... [09:06:34] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1044.eqiad.wmnet wikikube-worker1045.eqiad.wmnet on all recursors [09:06:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1044.eqiad.wmnet wikikube-worker1045.eqiad.wmnet on all recursors [09:06:55] (03PS1) 10Slyngshede: Enable JMX exporter for IDP [dns] - 10https://gerrit.wikimedia.org/r/1100761 [09:09:11] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1044.eqiad.wmnet with OS bookworm [09:10:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:14:06] (03CR) 10Fabfur: [C:03+2] hiera: enable haproxykafka on drmrs and magru [puppet] - 10https://gerrit.wikimedia.org/r/1090814 (https://phabricator.wikimedia.org/T378578) (owner: 10Fabfur) [09:15:53] !log deploying haproxykafka also on magru and drmrs (T378578) [09:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:56] T378578: Rollout haproxykafka on all hosts - https://phabricator.wikimedia.org/T378578 [09:16:47] RESOLVED: PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-logging-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:20:41] !log destroyed unused expiring puppet certs - T381474 [09:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:45] T381474: Handle expiring puppet certificates - https://phabricator.wikimedia.org/T381474 [09:25:28] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1044.eqiad.wmnet with reason: host reimage [09:28:02] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1100761 (owner: 10Slyngshede) [09:29:28] FIRING: [4x] SystemdUnitFailed: kafka-mirror-main-codfw_to_main-eqiad@0.service on kafka-main1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:31:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1044.eqiad.wmnet with reason: host reimage [09:34:52] Hi all, I'm planning to run a maintenance script to add wikidata support for idwikivoyage as per T381083. Is that disruptive to any current activities? [09:34:53] T381083: Add Wikidata support for idwikivoyage - https://phabricator.wikimedia.org/T381083 [09:37:36] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2448-2449].codfw.wmnet [09:38:46] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2448-2449].codfw.wmnet [09:38:53] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw[2448-2451].codfw.wmnet with reason: reimage [09:39:02] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw[2448-2451].codfw.wmnet with reason: reimage [09:39:13] !log jayme@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2450-2451].codfw.wmnet [09:40:23] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2450-2451].codfw.wmnet [09:44:00] (03PS1) 10JMeybohm: Rename mw[2448-2451] to wikikube-worker217[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1100765 (https://phabricator.wikimedia.org/T377877) [09:45:59] (03CR) 10Jelto: [C:03+1] Rename mw[2448-2451] to wikikube-worker217[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1100765 (https://phabricator.wikimedia.org/T377877) (owner: 10JMeybohm) [09:47:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1044.eqiad.wmnet with OS bookworm [09:48:32] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1045.eqiad.wmnet with OS bookworm [09:49:37] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:33] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:53:25] (03CR) 10Muehlenhoff: [C:03+2] debmonitor: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1093350 (owner: 10Muehlenhoff) [09:54:06] ACKNOWLEDGEMENT - MD RAID on mw2448 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 2, Failed: 0, Spare: 1 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T381558 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:54:11] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2448 - https://phabricator.wikimedia.org/T381558 (10ops-monitoring-bot) 03NEW [09:54:23] (03CR) 10Slyngshede: [C:03+2] Enable JMX exporter for IDP [dns] - 10https://gerrit.wikimedia.org/r/1100761 (owner: 10Slyngshede) [09:54:37] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2448.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:55:08] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2449.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:55:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1226.eqiad.wmnet with reason: Maintenance [09:55:39] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2450.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:55:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1226.eqiad.wmnet with reason: Maintenance [09:55:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T371742)', diff saved to https://phabricator.wikimedia.org/P71614 and previous config saved to /var/cache/conftool/dbconfig/20241205-095554-ladsgroup.json [09:55:58] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [09:56:08] !log jayme@cumin2002 START - Cookbook sre.hosts.provision for host mw2451.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:02:35] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 220, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:02:41] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 304, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:04:59] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1045.eqiad.wmnet with reason: host reimage [10:06:43] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:08:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1045.eqiad.wmnet with reason: host reimage [10:10:55] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10382761 (10Volans) If we go the backport way let's make sure to test it on a bunch of canary hosts first with different hardware as in the past we ha... [10:11:11] (03PS1) 10Muehlenhoff: Fix prometheus-web firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1100768 [10:12:40] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:12:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:14:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100768 (owner: 10Muehlenhoff) [10:14:50] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:15:16] (03PS1) 10Slyngshede: JMX collector for IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) [10:16:50] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:20:23] (03PS2) 10Slyngshede: JMX collector for IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) [10:20:48] (03PS1) 10Volans: style: a pass of black on all files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100772 [10:22:28] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4644/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [10:24:25] jouncebot: next [10:24:25] In 0 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241205T1100) [10:27:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1045.eqiad.wmnet with OS bookworm [10:27:54] (03CR) 10Filippo Giunchedi: [C:03+1] Fix prometheus-web firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1100768 (owner: 10Muehlenhoff) [10:28:00] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:28:37] !log homer 'lsw1-f3-eqiad*' commit 'T377876' [10:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:41] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [10:29:03] (03CR) 10Slyngshede: [V:03+1] "The JMX response is about 200K, so would it make sense to tweak the scraping interval for this one, maybe every two minutes (if we have a " [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [10:29:53] (03PS3) 10Slyngshede: P:prometheus::ops JMX collector for IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) [10:33:02] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1044-1045].eqiad.wmnet [10:33:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1044-1045].eqiad.wmnet [10:33:54] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10382835 (10Jelto) [10:37:18] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2448.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:37:22] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2449.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:37:25] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2450.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:37:30] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2451.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:37:38] (03CR) 10JMeybohm: [C:03+2] Rename mw[2448-2451] to wikikube-worker217[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/1100765 (https://phabricator.wikimedia.org/T377877) (owner: 10JMeybohm) [10:37:48] (03CR) 10Muehlenhoff: [C:03+2] Assign builder role to build2002 [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [10:41:50] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2448 to wikikube-worker2176 [10:42:00] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:42:13] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2449 to wikikube-worker2177 [10:42:34] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2450 to wikikube-worker2178 [10:42:38] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from mw2449 to wikikube-worker2177 [10:42:45] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2451 to wikikube-worker2179 [10:43:56] !log reindexed all wikidata entity schemas (T376252) [10:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:00] T376252: [ES-M3]: Create a EntitySearchHelper implementation that uses elastic - https://phabricator.wikimedia.org/T376252 [10:45:39] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2448 to wikikube-worker2176 - jayme@cumin2002" [10:46:04] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2448 to wikikube-worker2176 - jayme@cumin2002" [10:46:04] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:46:05] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2176 [10:46:21] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2176 [10:46:32] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:47:01] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2448 to wikikube-worker2176 [10:48:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:20] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2450 to wikikube-worker2178 - jayme@cumin2002" [10:50:25] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2450 to wikikube-worker2178 - jayme@cumin2002" [10:50:25] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:50:26] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2178 [10:50:33] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [10:50:40] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2178 [10:51:20] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2450 to wikikube-worker2178 [10:51:54] 06SRE, 06Infrastructure-Foundations, 10netops: Export routes generated from ARP/ND in EVPN - https://phabricator.wikimedia.org/T329369#10382861 (10cmooney) Huh so I've been looking at some of these old tasks while working on the Nokia testing. It's clear in the above the before / after are both the AFTE... [10:52:54] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:52:55] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2179 [10:53:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:53:45] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM! I checked the response and it is ~2k metrics, which is totally fine in terms of load" [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [10:54:15] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2179 [10:54:55] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2451 to wikikube-worker2179 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241205T1100) [11:01:41] (03CR) 10Muehlenhoff: [C:03+2] Fix prometheus-web firewall config [puppet] - 10https://gerrit.wikimedia.org/r/1100768 (owner: 10Muehlenhoff) [11:05:37] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [11:05:38] status [11:05:38] !log jayme@cumin2002 START - Cookbook sre.hosts.rename from mw2449 to wikikube-worker2177 [11:05:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [11:05:47] status [11:05:49] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [11:09:22] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2179.codfw.wmnet with OS bookworm [11:09:27] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2449 to wikikube-worker2177 - jayme@cumin2002" [11:10:20] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2179.codfw.wmnet with OS bookworm [11:11:07] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2179.codfw.wmnet with OS bookworm [11:11:17] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2179 [11:11:24] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [11:11:42] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2449 to wikikube-worker2177 - jayme@cumin2002" [11:11:42] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:11:43] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2177 [11:12:09] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2178.codfw.wmnet with OS bookworm [11:12:28] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2177 [11:13:09] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2449 to wikikube-worker2177 [11:13:34] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2177.codfw.wmnet with OS bookworm [11:13:46] (03CR) 10Muehlenhoff: P:prometheus::ops JMX collector for IDP hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [11:14:23] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2176.codfw.wmnet with OS bookworm [11:15:00] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2179 - jayme@cumin2002" [11:15:06] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2179 - jayme@cumin2002" [11:15:07] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:15:07] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2179.codfw.wmnet 207.48.192.10.in-addr.arpa 7.0.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:15:10] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2179.codfw.wmnet 207.48.192.10.in-addr.arpa 7.0.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:15:11] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2179 [11:15:25] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2179 [11:15:25] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2179 [11:15:31] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2177 [11:15:32] !log jayme@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2176.codfw.wmnet with OS bookworm [11:15:49] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [11:15:58] !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2176.codfw.wmnet with OS bookworm [11:19:03] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10382937 (10JMeybohm) [11:19:30] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2177 - jayme@cumin2002" [11:19:36] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2177 - jayme@cumin2002" [11:19:36] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:19:36] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2177.codfw.wmnet 83.48.192.10.in-addr.arpa 3.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:19:39] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2177.codfw.wmnet 83.48.192.10.in-addr.arpa 3.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:19:40] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2177 [11:19:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2177 [11:19:51] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2177 [11:19:57] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2176 [11:19:59] (03CR) 10Hnowlan: [C:03+1] gateway-check: fix invalid config handling [puppet] - 10https://gerrit.wikimedia.org/r/1084247 (owner: 10Scott French) [11:20:04] !log jayme@cumin2002 START - Cookbook sre.dns.netbox [11:24:59] !log jayme@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2176 - jayme@cumin2002" [11:25:04] !log jayme@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2176 - jayme@cumin2002" [11:25:04] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:25:05] !log jayme@cumin2002 START - Cookbook sre.dns.wipe-cache wikikube-worker2176.codfw.wmnet 81.48.192.10.in-addr.arpa 1.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:25:08] !log jayme@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2176.codfw.wmnet 81.48.192.10.in-addr.arpa 1.8.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:25:09] !log jayme@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2176 [11:25:19] !log jayme@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2176 [11:25:19] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2176 [11:26:40] !log jayme@cumin2002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2178 [11:30:12] PROBLEM - BGP status on lsw1-d6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:30:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T371742)', diff saved to https://phabricator.wikimedia.org/P71615 and previous config saved to /var/cache/conftool/dbconfig/20241205-113048-ladsgroup.json [11:30:52] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:33:42] 06SRE, 07SRE-Unowned, 10Maps: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 (10MoritzMuehlenhoff) 03NEW [11:34:53] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2179.codfw.wmnet with reason: host reimage [11:38:33] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2179.codfw.wmnet with reason: host reimage [11:38:40] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2177.codfw.wmnet with reason: host reimage [11:42:15] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2177.codfw.wmnet with reason: host reimage [11:42:19] (03PS4) 10Hnowlan: mediawiki: add multi-job support to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) [11:43:56] !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2176.codfw.wmnet with reason: host reimage [11:44:16] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:45:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P71616 and previous config saved to /var/cache/conftool/dbconfig/20241205-114555-ladsgroup.json [11:46:45] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2176.codfw.wmnet with reason: host reimage [11:49:29] (03PS1) 10Muehlenhoff: maps: Remove support for osm2pgsql as OSM engine [puppet] - 10https://gerrit.wikimedia.org/r/1100784 (https://phabricator.wikimedia.org/T381565) [11:50:25] FIRING: SystemdUnitFailed: user-runtime-dir@499.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:31] (03CR) 10Nik Gkountas: [C:03+1] Add Metrics Platform stream configuration for translate_extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097499 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [11:52:05] (03PS4) 10Slyngshede: P:prometheus::ops JMX collector for IDP hosts [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) [11:54:10] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4645/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [11:56:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100784 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:57:20] (03CR) 10Slyngshede: [V:03+1] P:prometheus::ops JMX collector for IDP hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100771 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [11:58:09] !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2179.codfw.wmnet with OS bookworm [16:04:28] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:04:55] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:04:57] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:05:05] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:05:07] !log jiji@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:05:30] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:05:30] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:05:42] !log jiji@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:05:44] !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:05:52] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.7 point update - https://phabricator.wikimedia.org/T373783#10383886 (10MoritzMuehlenhoff) [16:05:59] !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:06:00] !log jiji@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [16:06:10] (03PS1) 10Muehlenhoff: Revert build2002 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1100837 (https://phabricator.wikimedia.org/T379343) [16:06:22] (03CR) 10Effie Mouzeli: [C:03+2] Update various kafka-main connection strings for kafka-main1010 Replacing kafka-main1005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100827 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [16:06:38] !log jiji@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [16:06:39] !log jiji@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:06:57] !log jiji@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:06:59] !log jiji@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:07:04] (03CR) 10JHathaway: [C:03+1] apt::repository: Fix configuration of source-only repositories on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1100814 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [16:07:07] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1026.eqiad.wmnet w/ force delete existing files, repooling source-only afterwards [16:07:10] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [16:07:16] !log jiji@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:07:18] !log jiji@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:07:55] !log jiji@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:07:57] !log jiji@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [16:07:58] (03Merged) 10jenkins-bot: Update various kafka-main connection strings for kafka-main1010 Replacing kafka-main1005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100827 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [16:08:36] !log jiji@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:08:53] !log jiji@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main1010.eqiad.wmnet [16:08:54] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main1010.eqiad.wmnet [16:10:07] (03CR) 10Alexandros Kosiaris: [C:03+1] maps: Remove support for osm2pgsql as OSM engine [puppet] - 10https://gerrit.wikimedia.org/r/1100784 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:11:00] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests: Add dbrant to wmf-deployment - https://phabricator.wikimedia.org/T381591 (10Jelto) 03NEW [16:11:08] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [16:11:20] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [16:11:22] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:11:26] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:11:34] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [16:11:48] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [16:11:50] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2018.codfw.wmnet w/ force delete existing files, repooling source-only afterwards [16:11:52] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:12:38] (03CR) 10Muehlenhoff: [C:03+2] Revert build2002 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1100837 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [16:12:54] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [16:13:33] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [16:13:39] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1032 to wikikube-worker1051 [16:14:01] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:14:42] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:14:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:14:51] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudelastic - jclark@cumin1002" [16:15:10] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [16:15:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudelastic - jclark@cumin1002" [16:15:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:15:14] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [16:15:25] FIRING: SystemdUnitFailed: user@499.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:01] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:17:14] (03PS1) 10Effie Mouzeli: mw-videoscaler: update kafka brokers in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100841 [16:17:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:17:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:17:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:17:58] (03CR) 10Hnowlan: [C:03+1] mw-videoscaler: update kafka brokers in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100841 (owner: 10Effie Mouzeli) [16:18:13] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [16:18:31] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [16:18:32] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [16:19:19] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [16:19:44] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [16:20:01] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [16:20:02] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [16:20:04] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:20:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:20:27] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [16:21:14] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1032 to wikikube-worker1051 - jelto@cumin1002" [16:21:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1032 to wikikube-worker1051 - jelto@cumin1002" [16:21:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:21:19] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1051 [16:22:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1051 [16:22:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:22:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:22:55] (03PS1) 10Kamila Součková: Rename mw143[0-5] to wikikube-worker105[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) [16:23:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1032 to wikikube-worker1051 [16:23:28] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1048.eqiad.wmnet wikikube-worker1049.eqiad.wmnet wikikube-worker1050.eqiad.wmnet wikikube-worker1051.eqiad.wmnet on all recursors [16:23:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1048.eqiad.wmnet wikikube-worker1049.eqiad.wmnet wikikube-worker1050.eqiad.wmnet wikikube-worker1051.eqiad.wmnet on all recursors [16:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381540#10383979 (10phaultfinder) [16:25:36] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1048.eqiad.wmnet with OS bookworm [16:26:03] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1049.eqiad.wmnet with OS bookworm [16:26:19] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10383988 (10Jhancock.wm) the part arrived finally (delivery driver said they didn't have security clearance to deliver the part yesterday. which me and the worker at the dock... [16:26:28] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1050.eqiad.wmnet with OS bookworm [16:26:46] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1051.eqiad.wmnet with OS bookworm [16:27:42] (03PS5) 10Tiziano Fogli: blackbox/icmp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100782 (https://phabricator.wikimedia.org/T381561) [16:27:43] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10383997 (10MoritzMuehlenhoff) Ack, I'm going to re-add it to the cluster tomorrow. [16:27:46] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:27:51] (03PS3) 10Tiziano Fogli: blackbox/http: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100838 (https://phabricator.wikimedia.org/T381561) [16:27:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:28:02] (03PS3) 10Tiziano Fogli: blackbox/tcp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100839 (https://phabricator.wikimedia.org/T381561) [16:28:17] (03PS4) 10Tiziano Fogli: cloudgw: move icmp checks under wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) [16:28:21] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [16:28:38] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:28:39] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:29:08] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:29:29] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [16:29:32] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [16:29:42] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on mw2448 - https://phabricator.wikimedia.org/T381558#10384000 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm T 381568 - renamed server to wikikube-worker2176 T 358489 - probably false alert from this ticket. [16:29:57] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [16:30:08] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:31:00] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10384005 (10MoritzMuehlenhoff) [16:31:08] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:31:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:36:18] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:36:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [16:41:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10384077 (10Jhancock.wm) I need to do a few maintenance things first, per the error code instructions, but I will open a ticket with Dell about this to get the part replaced. I'll keep... [16:43:11] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1048.eqiad.wmnet with reason: host reimage [16:43:46] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1049.eqiad.wmnet with reason: host reimage [16:43:53] (03PS3) 10Thcipriani: Reinstate the banner for the developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar) [16:45:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudelastic1011.eqiad.wmnet with OS bullseye [16:45:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10384098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudelastic... [16:46:21] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudelastic1012.eqiad.wmnet with OS bullseye [16:46:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10384102 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudelastic... [16:46:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1048.eqiad.wmnet with reason: host reimage [16:47:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100158 (https://phabricator.wikimedia.org/T377128) (owner: 10Ebernhardson) [16:47:16] (03CR) 10Urbanecm: [C:03+1] "Resolved in T380518, removing CR-1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087584 (https://phabricator.wikimedia.org/T356294) (owner: 10Tchanders) [16:47:51] (03CR) 10Hnowlan: [C:03+2] mw-videoscaler: update kafka brokers in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100841 (owner: 10Effie Mouzeli) [16:49:02] (03Merged) 10jenkins-bot: mw-videoscaler: update kafka brokers in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100841 (owner: 10Effie Mouzeli) [16:49:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1049.eqiad.wmnet with reason: host reimage [16:58:38] (03PS6) 10Hnowlan: mediawiki: add multi-job support to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099752 (https://phabricator.wikimedia.org/T371701) [17:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241205T1700). [17:00:05] Dreamy_Jazz and MatmaRex: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:00:15] \o [17:00:16] hello [17:00:28] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:01:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:01:30] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:03:00] jhathaway: are you able to look at the puppet window? otherwise I can but I'm running a few minutes behind [17:03:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10384171 (10Marostegui) Sure, you can do whatever is needed. The host has no data and it's not in production. Thanks! [17:03:44] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:03:53] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1012.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:03:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:04:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:05:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1048.eqiad.wmnet with OS bookworm [17:05:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10384174 (10Jclark-ctr) [17:06:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10384189 (10Jclark-ctr) @elukey Hey luca these two are failing to provision these are custom configs [17:06:58] (03PS4) 10Thcipriani: Reinstate the banner for the developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar) [17:08:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1049.eqiad.wmnet with OS bookworm [17:09:46] (03CR) 10Thcipriani: [C:03+1] "works for me locally" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar) [17:10:04] MatmaRex: sorry, I'd like you to get these reviewed by someone more familiar with the beta cluster setup than I am [17:10:22] rzl: who would that be? [17:10:28] I can take care of the puppet merge if you still need that, but I'm not comfortable being the sole reviewer on them [17:10:32] I don't know, sorry [17:10:50] hmm [17:11:31] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1050.eqiad.wmnet with OS bookworm [17:11:39] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1051.eqiad.wmnet with OS bookworm [17:12:43] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1050.eqiad.wmnet with OS bookworm [17:12:52] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10384206 (10Jhancock.wm) Dell Service Request Submitted: SR202125350 [17:14:18] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4647/co" [puppet] - 10https://gerrit.wikimedia.org/r/1100427 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [17:15:19] (03CR) 10RLazarus: [V:03+1 C:03+2] Update MediaModeration module to run scans automatically [puppet] - 10https://gerrit.wikimedia.org/r/1100427 (https://phabricator.wikimedia.org/T355169) (owner: 10Dreamy Jazz) [17:17:15] thanks rzl for taking care of the puppet window, I had stepped away from the computer for moment [17:17:39] Dreamy_Jazz: merged, and the next regular puppet run will happen on mwmaint2002 before the first job kicks off at :34 -- I assume you're okay with waiting until then to see how it goes? [17:17:53] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T381568#10384244 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:17:55] jhathaway: no worries! [17:18:12] Yeah. It's best to avoid having more than one job running at once, so would prefer to wait for it to be automatically started. Thanks. [17:18:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10384254 (10Jhancock.wm) a:03Jhancock.wm [17:18:42] Dreamy_Jazz: sgtm, feel free ping me later if you need any followups [17:18:47] *to [17:18:47] Thanks! [17:19:15] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:19:38] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1051.eqiad.wmnet with OS bookworm [17:19:49] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:20:29] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10384256 (10Jhancock.wm) [17:21:27] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:21:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:22:07] (03CR) 10Krinkle: [C:03+1] MediaWiki: Redirect auth domain root to wikimedia.org portal [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [17:25:30] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1125.eqiad.wmnet with reason: Test setup should not alert [17:25:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1125.eqiad.wmnet with reason: Test setup should not alert [17:25:58] (03CR) 10Krinkle: [C:03+1] MediaWiki: Remove duplicate ErrorDocument 404 from beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1100533 (owner: 10Bartosz Dziewoński) [17:30:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10384322 (10Jhancock.wm) [17:30:48] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1050.eqiad.wmnet with reason: host reimage [17:34:28] FIRING: [2x] SystemdUnitFailed: mediawiki_job_purge_parsercache_pc4.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:34:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1050.eqiad.wmnet with reason: host reimage [17:34:53] (03PS3) 10Bartosz Dziewoński: MediaWiki: Redirect auth domain root to wikimedia.org portal [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) [17:36:46] !log upgrading facter on bullseye puppet nodes [17:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:54] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1051.eqiad.wmnet with reason: host reimage [17:39:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1051.eqiad.wmnet with reason: host reimage [17:40:56] !log pt1979@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:41:23] !log pt1979@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [17:46:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [17:47:48] (03CR) 10Bartosz Dziewoński: "This seems to make https://auth.wikimedia.beta.wmflabs.org/enwiki/ redirect as well, not sure why. I don't think that's desired." [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [17:51:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [17:53:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1050.eqiad.wmnet with OS bookworm [17:53:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:57:44] (03PS1) 10Isabelle Hurbain-Palatin: Reactivate Parsoid+Kartographer on hewiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100850 (https://phabricator.wikimedia.org/T373454) [17:57:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1051.eqiad.wmnet with OS bookworm [17:59:00] !log homer 'cr*eqiad*' commit 'T377876' [17:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:04] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [18:00:05] bd808: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241205T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241205T1800) [18:00:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1043.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:00:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm [18:00:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10384502 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm [18:04:29] 06SRE, 06Traffic: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608 (10ssingh) 03NEW [18:04:34] 06SRE, 06Traffic: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#10384526 (10ssingh) p:05Triage→03Low [18:07:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10384545 (10Jhancock.wm) es1043 is gonna fail again puppetmaster1001:~$ sudo puppet cert --list Warning: `puppet cert` is deprecated and will... [18:07:39] (03CR) 10Isabelle Hurbain-Palatin: "We're probably not going to deploy this this year (... I think?), but at least the patch is around to not forget about it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100850 (https://phabricator.wikimedia.org/T373454) (owner: 10Isabelle Hurbain-Palatin) [18:10:58] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:12:23] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10384568 (10Jelto) [18:13:01] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1048-1049,1051].eqiad.wmnet [18:13:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1048-1049,1051].eqiad.wmnet [18:44:26] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm [18:44:32] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10384668 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm... [18:56:53] (03CR) 10CDobbins: [C:03+2] lvs: Deploy node_ferm_mss exporter on ferm based realservers [puppet] - 10https://gerrit.wikimedia.org/r/1099792 (https://phabricator.wikimedia.org/T365689) (owner: 10CDobbins) [19:00:05] jeena and hashar: Deploy window MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241205T1900) [19:05:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:13:46] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100856 (https://phabricator.wikimedia.org/T375665) [19:13:48] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100856 (https://phabricator.wikimedia.org/T375665) (owner: 10TrainBranchBot) [19:14:33] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100856 (https://phabricator.wikimedia.org/T375665) (owner: 10TrainBranchBot) [19:28:26] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.6 refs T375665 [19:28:30] T375665: 1.44.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T375665 [19:31:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:34:33] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2019.codfw.wmnet w/ force delete existing files, repooling both afterwards [19:34:36] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [19:40:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.089s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:45:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.089s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:49:26] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:49:26] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:49:56] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 12, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:51:54] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:55:11] (03PS1) 10Hnowlan: Revert "jobqueue: disable webVideoTranscodePrioritized" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100857 [19:56:41] (03CR) 10Hnowlan: [C:03+2] Revert "jobqueue: disable webVideoTranscodePrioritized" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100857 (owner: 10Hnowlan) [19:57:44] (03Merged) 10jenkins-bot: Revert "jobqueue: disable webVideoTranscodePrioritized" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100857 (owner: 10Hnowlan) [20:09:22] (03CR) 10Ebomani: [C:03+1] "Nice and smooth! Great job :))" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar) [20:12:03] (03PS5) 10Eevans: cassandra: configurations merged from upstream 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1100549 (https://phabricator.wikimedia.org/T380420) [20:12:03] (03PS1) 10Eevans: cassandra_dev: upgrade Cassandra to 'dev' (aka 4.1.7) [puppet] - 10https://gerrit.wikimedia.org/r/1100859 (https://phabricator.wikimedia.org/T380420) [20:15:25] FIRING: SystemdUnitFailed: user@499.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:40] (03PS2) 10Eevans: cassandra_dev: upgrade Cassandra to 'dev' (aka 4.1.7) [puppet] - 10https://gerrit.wikimedia.org/r/1100859 (https://phabricator.wikimedia.org/T380420) [20:17:40] (03PS1) 10Eevans: cassandra: pin 'dev' to cassandra 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1100860 (https://phabricator.wikimedia.org/T380420) [20:20:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:21:06] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100860 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [20:27:09] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2019.codfw.wmnet w/ force delete existing files, repooling both afterwards [20:27:12] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [20:28:52] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100859 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [20:35:14] (03PS15) 10Dzahn: phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) [20:39:18] (03CR) 10Eevans: [C:03+2] cassandra: configurations merged from upstream 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1100549 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [20:40:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:42:22] (03CR) 10Eevans: [C:03+2] cassandra: pin 'dev' to cassandra 4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1100860 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [20:51:52] (03PS1) 10JHathaway: facter: fix facter conf location, attempt 3 [puppet] - 10https://gerrit.wikimedia.org/r/1100861 (https://phabricator.wikimedia.org/T330490) [20:52:07] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100861 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [20:58:01] (03CR) 10JHathaway: [C:03+2] facter: fix facter conf location, attempt 3 [puppet] - 10https://gerrit.wikimedia.org/r/1100861 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241205T2100). [21:00:05] ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:04:02] ebernhardson: are you able to self-deploy? otherwise happy to deploy for you if needed [21:05:34] (03CR) 10Gergő Tisza: [C:03+1] MediaWiki: Ensure nice 404 instead of php-fpm 404 on auth domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100530 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [21:08:49] (03CR) 10Gergő Tisza: [C:03+1] MediaWiki: Define wikimedia.org portal on beta cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100531 (https://phabricator.wikimedia.org/T173887) (owner: 10Bartosz Dziewoński) [21:17:35] (03CR) 10CDanis: [C:03+2] Update chart renderer service with locale option support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100557 (owner: 10Aude) [21:19:29] (03Merged) 10jenkins-bot: Update chart renderer service with locale option support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1100557 (owner: 10Aude) [21:19:45] !log cdanis@deploy2002 helmfile [staging] START helmfile.d/services/chart-renderer: apply [21:20:20] !log cdanis@deploy2002 helmfile [staging] DONE helmfile.d/services/chart-renderer: apply [21:21:04] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/chart-renderer: apply [21:21:34] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply [21:23:41] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/chart-renderer: apply [21:24:17] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply [21:34:28] FIRING: [2x] SystemdUnitFailed: mediawiki_job_purge_parsercache_pc4.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:38] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:46:16] (03CR) 10Gergő Tisza: "In theory, Apache would rewrite `/enwiki/` to `/` and then apply the rewrite rule for the redirect again. This is somewhat hidden in the d" [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [21:47:34] (03PS1) 10Cwhite: Disable stats collection when WMF_MAINTENANCE_OFFLINE is set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100864 (https://phabricator.wikimedia.org/T380609) [21:48:23] (03CR) 10CI reject: [V:04-1] Disable stats collection when WMF_MAINTENANCE_OFFLINE is set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100864 (https://phabricator.wikimedia.org/T380609) (owner: 10Cwhite) [21:50:07] (03PS2) 10Cwhite: Disable stats collection when WMF_MAINTENANCE_OFFLINE is set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100864 (https://phabricator.wikimedia.org/T380609) [22:00:53] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2020.codfw.wmnet w/ force delete existing files, repooling both afterwards [22:00:57] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [22:08:10] (03CR) 10Cwhite: "If you have any tips on how to see if this fixes the issue outside of trying a scap deploy, please share!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100864 (https://phabricator.wikimedia.org/T380609) (owner: 10Cwhite) [22:13:29] (03PS1) 10Ryan Kemper: wdqs-internal: fix graph split monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) [22:14:16] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:16:30] (03CR) 10Bking: [C:03+1] wdqs-internal: fix graph split monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:18:37] (03PS2) 10Ryan Kemper: wdqs-internal: fix graph split monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) [22:18:50] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:19:16] (03CR) 10CI reject: [V:04-1] wdqs-internal: fix graph split monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:20:14] (03PS3) 10Ryan Kemper: wdqs-internal: fix graph split monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) [22:20:32] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:23:25] (03PS4) 10Ryan Kemper: wdqs-internal: fix graph split monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) [22:23:37] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:26:46] (03PS1) 10Ryan Kemper: wdqs-internal: remove absented monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/1100868 (https://phabricator.wikimedia.org/T379329) [22:27:10] (03CR) 10Bking: [C:03+1] wdqs-internal: fix graph split monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:27:21] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: fix graph split monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/1100866 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:30:43] (03PS1) 10Jdlrobson: Enable A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100869 (https://phabricator.wikimedia.org/T378115) [22:30:51] (03CR) 10CI reject: [V:04-1] Enable A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100869 (https://phabricator.wikimedia.org/T378115) (owner: 10Jdlrobson) [22:30:55] (03PS2) 10Jdlrobson: Enable Empty search A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100869 (https://phabricator.wikimedia.org/T378115) [22:31:03] (03CR) 10CI reject: [V:04-1] Enable Empty search A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100869 (https://phabricator.wikimedia.org/T378115) (owner: 10Jdlrobson) [22:31:12] (03PS3) 10Jdlrobson: Enable Empty search A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100869 (https://phabricator.wikimedia.org/T378115) [22:31:31] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10385144 (10BCornwall) Unfortunately, it appears that we're still having throttling issues in magru: ` brett@cumin2002:~$ sudo -i cumin 'A:cp' 'zgrep "Core temperature is... [22:32:25] (03CR) 10Bking: [C:03+1] wdqs-internal: remove absented monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/1100868 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:32:39] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T381464#10385158 (10Dzahn) ☑ https://app.betterworks.com/app/#/profile/710331 🠢 https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group [22:32:58] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T381464#10385161 (10Dzahn) 05Open→03In progress [22:33:53] (03CR) 10Ryan Kemper: [C:03+2] wdqs-internal: remove absented monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/1100868 (https://phabricator.wikimedia.org/T379329) (owner: 10Ryan Kemper) [22:34:57] (03PS4) 10Bartosz Dziewoński: MediaWiki: Define wikimedia.org portal on beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1100531 (https://phabricator.wikimedia.org/T173887) [22:35:10] (03CR) 10Bartosz Dziewoński: MediaWiki: Define wikimedia.org portal on beta cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100531 (https://phabricator.wikimedia.org/T173887) (owner: 10Bartosz Dziewoński) [22:35:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:35:51] (03PS1) 10Ryan Kemper: Revert "wdqs-internal: remove absented monitoring check" [puppet] - 10https://gerrit.wikimedia.org/r/1100870 [22:36:01] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] Revert "wdqs-internal: remove absented monitoring check" [puppet] - 10https://gerrit.wikimedia.org/r/1100870 (owner: 10Ryan Kemper) [22:38:37] (03PS1) 10Ryan Kemper: wdqs-internal: remove absented monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/1100871 (https://phabricator.wikimedia.org/T379329) [22:38:56] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:38:56] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv6: Connect - Telxius, AS12956/IPv4: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:39:26] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:39:28] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:41:25] (03CR) 10Bartosz Dziewoński: "Huh, it doesn't redirect for me either now, but it definitely was redirecting when I wrote that comment. I had a similar situation with an" [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) (owner: 10Bartosz Dziewoński) [22:44:56] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:44:56] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:45:26] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:45:34] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:50:48] (03CR) 10Jdrewniak: [C:03+2] Enable Empty search A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100869 (https://phabricator.wikimedia.org/T378115) (owner: 10Jdlrobson) [22:51:30] (03Merged) 10jenkins-bot: Enable Empty search A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100869 (https://phabricator.wikimedia.org/T378115) (owner: 10Jdlrobson) [22:53:43] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T376150, initialize wdqs internal main tier) xfer wikidata_main from wdqs2021.codfw.wmnet -> wdqs2020.codfw.wmnet w/ force delete existing files, repooling both afterwards [22:53:47] T376150: Prepare hosts to serve wdqs-internal-main & wdqs-internal-scholarly - https://phabricator.wikimedia.org/T376150 [22:56:34] (03PS4) 10Bartosz Dziewoński: MediaWiki: Redirect auth domain root to wikimedia.org portal [puppet] - 10https://gerrit.wikimedia.org/r/1100532 (https://phabricator.wikimedia.org/T380551) [22:59:12] (03PS3) 10Bartosz Dziewoński: MediaWiki: Remove duplicate ErrorDocument 404 from beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/1100533 [22:59:25] (03PS3) 10Bartosz Dziewoński: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T380551) [23:00:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:04:28] FIRING: [8x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs2020:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:35] FIRING: [16x] ProbeDown: Service wdqs1026:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:05:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:15:46] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:19:30] RESOLVED: [2x] ProbeDown: Service wdqs2020:443 has failed probes (http_wdqs_internal_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:23:54] !log Start revalidateLinkRecommendations.php for Add Link-enabled wikis via mwscript-k8s (T380455) [23:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:58] T380455: Run revalidateLinkRecommendations.php for wikis with more than 25 excluded sections - https://phabricator.wikimedia.org/T380455 [23:25:49] (03CR) 10Eevans: [C:03+2] cassandra_dev: upgrade Cassandra to 'dev' (aka 4.1.7) [puppet] - 10https://gerrit.wikimedia.org/r/1100859 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [23:26:27] !log looking at puppet failures on an-workers [23:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:55:22] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10385350 (10BCornwall) Some observations: * [[ https://grafana.wikimedia.org/goto/_53fKoVHR?orgId=1 | magru has the highest average CPU temperature by site yet the lowest...