[00:00:06] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:47] (03PS1) 10Zabe: beta: Remove deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888814 (https://phabricator.wikimedia.org/T329577) [00:01:09] (03PS2) 10Dzahn: serviceops-collab: switch alert severity to 'task' globally [puppet] - 10https://gerrit.wikimedia.org/r/888813 (https://phabricator.wikimedia.org/T329587) [00:01:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2439.codfw.wmnet with reason: host reimage [00:01:33] PROBLEM - Host mc-gp2003 is DOWN: PING CRITICAL - Packet loss = 100% [00:01:39] (03CR) 10Zabe: [C: 03+2] beta: Remove deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888814 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [00:02:18] (03Merged) 10jenkins-bot: beta: Remove deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888814 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [00:02:59] (03CR) 10Dzahn: [C: 03+2] serviceops-collab: switch alert severity to 'task' globally [puppet] - 10https://gerrit.wikimedia.org/r/888813 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [00:03:27] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:27] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:07] PROBLEM - Disk space on thanos-be2002 is CRITICAL: DISK CRITICAL - free space: / 1886 MB (3% inode=98%): /tmp 1886 MB (3% inode=98%): /var/tmp 1886 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [00:04:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44540 and previous config saved to /var/cache/conftool/dbconfig/20230214-000419-ladsgroup.json [00:04:23] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [00:04:48] (03PS3) 10Dzahn: planet: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884390 (https://phabricator.wikimedia.org/T327977) [00:04:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp2003'] [00:05:37] (03CR) 10Dzahn: [C: 03+2] planet: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884390 (https://phabricator.wikimedia.org/T327977) (owner: 10Dzahn) [00:06:05] RECOVERY - Host mc-gp2003 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [00:06:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P44541 and previous config saved to /var/cache/conftool/dbconfig/20230214-000629-marostegui.json [00:06:59] PROBLEM - Disk space on thanos-be2003 is CRITICAL: DISK CRITICAL - free space: / 293 MB (0% inode=98%): /tmp 293 MB (0% inode=98%): /var/tmp 293 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [00:10:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P44542 and previous config saved to /var/cache/conftool/dbconfig/20230214-001053-marostegui.json [00:13:52] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:17:14] (03PS3) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) [00:17:30] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10Papaul) [00:17:56] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:19:07] (03CR) 10CI reject: [V: 04-1] doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [00:21:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T328817)', diff saved to https://phabricator.wikimedia.org/P44543 and previous config saved to /var/cache/conftool/dbconfig/20230214-002136-marostegui.json [00:21:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [00:21:40] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [00:21:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance [00:21:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:22:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:22:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T328817)', diff saved to https://phabricator.wikimedia.org/P44544 and previous config saved to /var/cache/conftool/dbconfig/20230214-002214-marostegui.json [00:22:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:22:41] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2439.codfw.wmnet with OS buster [00:22:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:22:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2438.codfw.wmnet with OS buster [00:22:47] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2439.codfw.wmnet with OS buster completed: - mw2439 (**PASS**) - Removed from Pupp... [00:22:51] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2438.codfw.wmnet with OS buster completed: - mw2438 (**PASS**) - Removed from Pupp... [00:23:13] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:24:45] RECOVERY - Disk space on thanos-be2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [00:25:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44545 and previous config saved to /var/cache/conftool/dbconfig/20230214-002559-marostegui.json [00:26:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [00:26:03] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [00:26:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [00:26:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T329203)', diff saved to https://phabricator.wikimedia.org/P44546 and previous config saved to /var/cache/conftool/dbconfig/20230214-002620-marostegui.json [00:27:43] RECOVERY - Disk space on thanos-be2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [00:32:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T329203)', diff saved to https://phabricator.wikimedia.org/P44547 and previous config saved to /var/cache/conftool/dbconfig/20230214-003201-marostegui.json [00:32:05] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [00:34:18] (03PS4) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) [00:34:39] (03CR) 10CI reject: [V: 04-1] doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [00:35:23] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: run-dashboards-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:55] (03PS1) 10Zabe: beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888817 [00:36:20] (03PS2) 10Zabe: beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888817 [00:36:27] (03CR) 10Zabe: [C: 03+2] beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888817 (owner: 10Zabe) [00:36:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888817 (owner: 10Zabe) [00:37:05] (03Merged) 10jenkins-bot: beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888817 (owner: 10Zabe) [00:39:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2440.codfw.wmnet with OS buster [00:39:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster [00:40:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2441.codfw.wmnet with OS buster [00:40:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2441.codfw.wmnet with OS buster [00:42:42] (03CR) 10Cwhite: [C: 03+1] Add logs-api service [puppet] - 10https://gerrit.wikimedia.org/r/888700 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [00:43:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2442.codfw.wmnet with OS buster [00:43:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2442.codfw.wmnet with OS buster [00:45:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2443.codfw.wmnet with OS buster [00:46:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2443.codfw.wmnet with OS buster [00:47:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P44548 and previous config saved to /var/cache/conftool/dbconfig/20230214-004707-marostegui.json [00:50:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:29] (03PS5) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) [00:52:21] (03CR) 10CI reject: [V: 04-1] doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [01:00:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:18] (03PS1) 10Superpes15: [blkwiki] Add an alias for "SPECIAL:" Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888821 (https://phabricator.wikimedia.org/T317598) [01:01:48] (03CR) 10CI reject: [V: 04-1] [blkwiki] Add an alias for "SPECIAL:" Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888821 (https://phabricator.wikimedia.org/T317598) (owner: 10Superpes15) [01:02:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P44549 and previous config saved to /var/cache/conftool/dbconfig/20230214-010214-marostegui.json [01:04:13] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888821 (https://phabricator.wikimedia.org/T317598) (owner: 10Superpes15) [01:04:15] (03PS6) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) [01:04:38] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10phaultfinder) [01:05:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:06:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:08:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49567 bytes in 7.597 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:08:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:15:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T329203)', diff saved to https://phabricator.wikimedia.org/P44550 and previous config saved to /var/cache/conftool/dbconfig/20230214-011720-marostegui.json [01:17:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [01:17:25] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [01:17:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [01:17:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:17:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:17:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44551 and previous config saved to /var/cache/conftool/dbconfig/20230214-011758-marostegui.json [01:19:14] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:20:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:22:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T328817)', diff saved to https://phabricator.wikimedia.org/P44552 and previous config saved to /var/cache/conftool/dbconfig/20230214-012230-marostegui.json [01:22:34] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [01:23:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44553 and previous config saved to /var/cache/conftool/dbconfig/20230214-012312-marostegui.json [01:23:16] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [01:32:06] (03PS6) 10Urbanecm: [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) [01:32:19] (03CR) 10Urbanecm: [tox] Make running `tox` work (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [01:35:22] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2440.codfw.wmnet with OS buster [01:35:29] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster executed with errors: - mw2440 (**FAIL**) - Remove... [01:37:10] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2441.codfw.wmnet with OS buster [01:37:17] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2441.codfw.wmnet with OS buster executed with errors: - mw2441 (**FAIL**) - Remove... [01:37:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P44554 and previous config saved to /var/cache/conftool/dbconfig/20230214-013736-marostegui.json [01:38:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P44555 and previous config saved to /var/cache/conftool/dbconfig/20230214-013818-marostegui.json [01:39:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2442.codfw.wmnet with OS buster [01:40:01] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2442.codfw.wmnet with OS buster executed with errors: - mw2442 (**FAIL**) - Remove... [01:42:54] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2443.codfw.wmnet with OS buster [01:43:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2443.codfw.wmnet with OS buster executed with errors: - mw2443 (**FAIL**) - Remove... [01:51:44] (03CR) 10Urbanecm: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [01:52:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2440.codfw.wmnet with OS buster [01:52:26] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster [01:52:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P44556 and previous config saved to /var/cache/conftool/dbconfig/20230214-015242-marostegui.json [01:53:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P44557 and previous config saved to /var/cache/conftool/dbconfig/20230214-015325-marostegui.json [01:54:14] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:01:31] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2440.codfw.wmnet with OS buster [02:01:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster executed with errors: - mw2440 (**FAIL**) - Remove... [02:04:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @Jhancock.wm can you please take a look at mw244[0-3], it looks like you connected the network cable to NIC 2 and not NIC 1. Thank you [02:07:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [02:07:29] (JobUnavailable) firing: (4) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T328817)', diff saved to https://phabricator.wikimedia.org/P44558 and previous config saved to /var/cache/conftool/dbconfig/20230214-020748-marostegui.json [02:07:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [02:07:53] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [02:08:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [02:08:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44559 and previous config saved to /var/cache/conftool/dbconfig/20230214-020831-marostegui.json [02:08:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [02:08:35] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [02:08:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [02:08:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44560 and previous config saved to /var/cache/conftool/dbconfig/20230214-020852-marostegui.json [02:11:50] (03PS4) 10Superpes15: [blkwiki] Add an alias for "SPECIAL:" namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888821 (https://phabricator.wikimedia.org/T317598) [02:12:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2444.codfw.wmnet with OS buster [02:12:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2444.codfw.wmnet with OS buster [02:13:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44561 and previous config saved to /var/cache/conftool/dbconfig/20230214-021358-marostegui.json [02:14:02] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [02:18:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2445.codfw.wmnet with OS buster [02:18:35] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2445.codfw.wmnet with OS buster [02:19:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2446.codfw.wmnet with OS buster [02:19:25] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2446.codfw.wmnet with OS buster [02:20:27] (03Abandoned) 10Superpes15: [blkwiki] Add an alias for "SPECIAL:" namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888821 (https://phabricator.wikimedia.org/T317598) (owner: 10Superpes15) [02:22:29] (JobUnavailable) firing: (4) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P44562 and previous config saved to /var/cache/conftool/dbconfig/20230214-022904-marostegui.json [02:31:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2444.codfw.wmnet with reason: host reimage [02:34:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2444.codfw.wmnet with reason: host reimage [02:38:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2445.codfw.wmnet with reason: host reimage [02:39:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2446.codfw.wmnet with reason: host reimage [02:41:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2445.codfw.wmnet with reason: host reimage [02:43:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2446.codfw.wmnet with reason: host reimage [02:44:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2447.codfw.wmnet with OS buster [02:44:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P44563 and previous config saved to /var/cache/conftool/dbconfig/20230214-024410-marostegui.json [02:44:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2447.codfw.wmnet with OS buster [02:52:08] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:53:30] !log tgr: Deployed security patch for T328643 [02:55:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:55:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2444.codfw.wmnet with OS buster [02:56:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2444.codfw.wmnet with OS buster completed: - mw2444 (**PASS**) - Removed from Pupp... [02:57:51] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:58:40] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [02:59:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44564 and previous config saved to /var/cache/conftool/dbconfig/20230214-025917-marostegui.json [02:59:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [02:59:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [02:59:21] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T0300) [03:03:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [03:03:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [03:03:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T329203)', diff saved to https://phabricator.wikimedia.org/P44565 and previous config saved to /var/cache/conftool/dbconfig/20230214-030345-marostegui.json [03:04:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2447.codfw.wmnet with reason: host reimage [03:04:28] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:04:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2446.codfw.wmnet with OS buster [03:04:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:04:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2445.codfw.wmnet with OS buster [03:04:35] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2446.codfw.wmnet with OS buster completed: - mw2446 (**PASS**) - Removed from Pupp... [03:04:40] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2445.codfw.wmnet with OS buster completed: - mw2445 (**PASS**) - Removed from Pupp... [03:05:54] (03PS1) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) [03:07:03] (03PS1) 10Legoktm: gitlab_runner: Set pull_policy = ["always", "if-not-present"] [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) [03:07:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2447.codfw.wmnet with reason: host reimage [03:07:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.23 [core] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/888729 (https://phabricator.wikimedia.org/T325586) [03:07:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.23 [core] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/888729 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [03:08:10] (03PS2) 10Legoktm: gitlab_runner: Set pull_policy = ["always", "if-not-present"] [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) [03:09:32] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [03:10:28] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:11:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T329203)', diff saved to https://phabricator.wikimedia.org/P44566 and previous config saved to /var/cache/conftool/dbconfig/20230214-031059-marostegui.json [03:11:04] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [03:22:43] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:22:49] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.23 [core] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/888729 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [03:26:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P44567 and previous config saved to /var/cache/conftool/dbconfig/20230214-032606-marostegui.json [03:29:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [03:29:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2447.codfw.wmnet with OS buster [03:29:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2447.codfw.wmnet with OS buster completed: - mw2447 (**PASS**) - Removed from Pupp... [03:41:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P44568 and previous config saved to /var/cache/conftool/dbconfig/20230214-034112-marostegui.json [03:56:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T329203)', diff saved to https://phabricator.wikimedia.org/P44569 and previous config saved to /var/cache/conftool/dbconfig/20230214-035618-marostegui.json [03:56:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [03:56:23] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [03:56:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [03:56:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T329203)', diff saved to https://phabricator.wikimedia.org/P44570 and previous config saved to /var/cache/conftool/dbconfig/20230214-035639-marostegui.json [03:58:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T329203)', diff saved to https://phabricator.wikimedia.org/P44571 and previous config saved to /var/cache/conftool/dbconfig/20230214-035852-marostegui.json [03:59:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [03:59:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance [03:59:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T328817)', diff saved to https://phabricator.wikimedia.org/P44572 and previous config saved to /var/cache/conftool/dbconfig/20230214-035922-marostegui.json [03:59:26] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T0400) [04:06:30] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:13:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P44573 and previous config saved to /var/cache/conftool/dbconfig/20230214-041359-marostegui.json [04:21:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T328817)', diff saved to https://phabricator.wikimedia.org/P44574 and previous config saved to /var/cache/conftool/dbconfig/20230214-042104-marostegui.json [04:21:08] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [04:24:14] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:29:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P44575 and previous config saved to /var/cache/conftool/dbconfig/20230214-042905-marostegui.json [04:36:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P44576 and previous config saved to /var/cache/conftool/dbconfig/20230214-043610-marostegui.json [04:44:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T329203)', diff saved to https://phabricator.wikimedia.org/P44577 and previous config saved to /var/cache/conftool/dbconfig/20230214-044411-marostegui.json [04:44:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [04:44:16] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [04:44:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [04:44:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T329203)', diff saved to https://phabricator.wikimedia.org/P44578 and previous config saved to /var/cache/conftool/dbconfig/20230214-044432-marostegui.json [04:47:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T329203)', diff saved to https://phabricator.wikimedia.org/P44579 and previous config saved to /var/cache/conftool/dbconfig/20230214-044745-marostegui.json [04:51:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P44580 and previous config saved to /var/cache/conftool/dbconfig/20230214-045117-marostegui.json [05:02:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P44581 and previous config saved to /var/cache/conftool/dbconfig/20230214-050252-marostegui.json [05:06:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T328817)', diff saved to https://phabricator.wikimedia.org/P44582 and previous config saved to /var/cache/conftool/dbconfig/20230214-050623-marostegui.json [05:06:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [05:06:27] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [05:06:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance [05:06:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T328817)', diff saved to https://phabricator.wikimedia.org/P44583 and previous config saved to /var/cache/conftool/dbconfig/20230214-050644-marostegui.json [05:17:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P44584 and previous config saved to /var/cache/conftool/dbconfig/20230214-051758-marostegui.json [05:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:28:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T328817)', diff saved to https://phabricator.wikimedia.org/P44585 and previous config saved to /var/cache/conftool/dbconfig/20230214-052854-marostegui.json [05:28:59] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [05:33:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T329203)', diff saved to https://phabricator.wikimedia.org/P44586 and previous config saved to /var/cache/conftool/dbconfig/20230214-053304-marostegui.json [05:33:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [05:33:09] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [05:33:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [05:33:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T329203)', diff saved to https://phabricator.wikimedia.org/P44587 and previous config saved to /var/cache/conftool/dbconfig/20230214-053325-marostegui.json [05:35:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T329203)', diff saved to https://phabricator.wikimedia.org/P44588 and previous config saved to /var/cache/conftool/dbconfig/20230214-053538-marostegui.json [05:44:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P44589 and previous config saved to /var/cache/conftool/dbconfig/20230214-054400-marostegui.json [05:50:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P44590 and previous config saved to /var/cache/conftool/dbconfig/20230214-055044-marostegui.json [05:57:29] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:59:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P44591 and previous config saved to /var/cache/conftool/dbconfig/20230214-055906-marostegui.json [06:05:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P44592 and previous config saved to /var/cache/conftool/dbconfig/20230214-060551-marostegui.json [06:14:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T328817)', diff saved to https://phabricator.wikimedia.org/P44593 and previous config saved to /var/cache/conftool/dbconfig/20230214-061413-marostegui.json [06:14:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [06:14:17] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [06:14:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance [06:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T328817)', diff saved to https://phabricator.wikimedia.org/P44594 and previous config saved to /var/cache/conftool/dbconfig/20230214-061434-marostegui.json [06:20:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T329203)', diff saved to https://phabricator.wikimedia.org/P44595 and previous config saved to /var/cache/conftool/dbconfig/20230214-062057-marostegui.json [06:20:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [06:21:01] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [06:21:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [06:21:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T329203)', diff saved to https://phabricator.wikimedia.org/P44596 and previous config saved to /var/cache/conftool/dbconfig/20230214-062118-marostegui.json [06:22:29] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:36:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T328817)', diff saved to https://phabricator.wikimedia.org/P44597 and previous config saved to /var/cache/conftool/dbconfig/20230214-063617-marostegui.json [06:36:22] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [06:39:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T329203)', diff saved to https://phabricator.wikimedia.org/P44598 and previous config saved to /var/cache/conftool/dbconfig/20230214-063933-marostegui.json [06:39:38] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [06:40:25] 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) 05Open→03Resolved Thanks everyone! [06:41:17] 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui) Thanks @Jhancock.wm - I am starting the host again and I will close this task once it is repooled. The memory count also looks good from my side. [06:48:55] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Marostegui) They should probably be skipped as we have two masters being written (one per DC) and they need to remai... [06:51:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P44599 and previous config saved to /var/cache/conftool/dbconfig/20230214-065123-marostegui.json [06:54:07] (03PS1) 10Marostegui: db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/888987 (https://phabricator.wikimedia.org/T329478) [06:54:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P44600 and previous config saved to /var/cache/conftool/dbconfig/20230214-065440-marostegui.json [06:54:41] (03CR) 10Marostegui: [C: 03+2] db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/888987 (https://phabricator.wikimedia.org/T329478) (owner: 10Marostegui) [06:56:12] (03PS1) 10Marostegui: mariadb: Decommission db1099 [puppet] - 10https://gerrit.wikimedia.org/r/888988 (https://phabricator.wikimedia.org/T329181) [06:57:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1099.eqiad.wmnet [06:58:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1099 [puppet] - 10https://gerrit.wikimedia.org/r/888988 (https://phabricator.wikimedia.org/T329181) (owner: 10Marostegui) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T0700) [07:00:05] kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T0700). [07:01:38] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [07:02:56] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:06:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P44601 and previous config saved to /var/cache/conftool/dbconfig/20230214-070630-marostegui.json [07:09:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P44602 and previous config saved to /var/cache/conftool/dbconfig/20230214-070946-marostegui.json [07:10:45] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1099.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:11:08] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10Marostegui) a:05Marostegui→03None [07:12:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1099.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [07:12:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:12:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1099.eqiad.wmnet [07:12:16] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1099.eqiad.wmnet` - db1099.eqiad.wmnet (**WARN**) - Downtimed host on Icinga/Alertm... [07:12:28] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10Marostegui) This is ready for DC-Ops [07:12:39] 10ops-eqiad, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10Marostegui) [07:21:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T328817)', diff saved to https://phabricator.wikimedia.org/P44603 and previous config saved to /var/cache/conftool/dbconfig/20230214-072136-marostegui.json [07:21:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [07:21:41] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:21:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance [07:21:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T328817)', diff saved to https://phabricator.wikimedia.org/P44604 and previous config saved to /var/cache/conftool/dbconfig/20230214-072157-marostegui.json [07:24:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T329203)', diff saved to https://phabricator.wikimedia.org/P44605 and previous config saved to /var/cache/conftool/dbconfig/20230214-072452-marostegui.json [07:24:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:24:56] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [07:25:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:38:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance [07:38:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance [07:43:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance [07:43:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T328817)', diff saved to https://phabricator.wikimedia.org/P44606 and previous config saved to /var/cache/conftool/dbconfig/20230214-074335-marostegui.json [07:43:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance [07:43:39] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:43:50] (03PS1) 10Marostegui: Revert "db2181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/888776 [07:45:38] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] fix(presto): do not set query.max*per-node config on coordinator [puppet] - 10https://gerrit.wikimedia.org/r/888685 (owner: 10Nicolas Fraison) [07:49:25] (03PS2) 10Slyngshede: P:installserver::dhcp remove dhcp config for VMs [puppet] - 10https://gerrit.wikimedia.org/r/888692 [07:51:57] (03CR) 10Marostegui: [C: 03+2] Revert "db2181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/888776 (owner: 10Marostegui) [07:52:44] (03PS2) 10Nicolas Fraison: fix(presto): create intermediate ${data_dir}/var fodler [puppet] - 10https://gerrit.wikimedia.org/r/888760 (https://phabricator.wikimedia.org/T329361) [07:54:11] (03CR) 10Elukey: fix(presto): do not set query.max*per-node config on coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888685 (owner: 10Nicolas Fraison) [07:55:29] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39553/console" [puppet] - 10https://gerrit.wikimedia.org/r/888760 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [07:56:29] 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui) Host caught up - doing data checks now. [07:57:04] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] fix(presto): create intermediate ${data_dir}/var fodler [puppet] - 10https://gerrit.wikimedia.org/r/888760 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison) [07:57:47] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [07:57:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [07:57:54] (03CR) 10Slyngshede: [C: 03+2] P:installserver::dhcp remove dhcp config for VMs [puppet] - 10https://gerrit.wikimedia.org/r/888692 (owner: 10Slyngshede) [07:58:03] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [07:58:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [07:58:13] !log enable CF in esams [07:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P44607 and previous config saved to /var/cache/conftool/dbconfig/20230214-075842-marostegui.json [08:00:04] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T0800). [08:00:04] No Gerrit patches in the queue for this window AFAICS. [08:04:24] (03PS1) 10Muehlenhoff: Fix Cumin alias for etcd/ML [puppet] - 10https://gerrit.wikimedia.org/r/889046 [08:04:34] (03PS2) 10Muehlenhoff: Fix Cumin alias for etcd/ML [puppet] - 10https://gerrit.wikimedia.org/r/889046 [08:06:56] (03CR) 10Elukey: [C: 03+1] "Thanks and sorry! :)" [puppet] - 10https://gerrit.wikimedia.org/r/889046 (owner: 10Muehlenhoff) [08:07:43] (03PS1) 10Elukey: sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) [08:09:44] (03CR) 10Muehlenhoff: [C: 03+2] Fix Cumin alias for etcd/ML [puppet] - 10https://gerrit.wikimedia.org/r/889046 (owner: 10Muehlenhoff) [08:10:12] (03CR) 10Filippo Giunchedi: wmnet: add logs-api svc records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/888696 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [08:10:44] (03PS2) 10Filippo Giunchedi: wmnet: add logs-api svc records [dns] - 10https://gerrit.wikimedia.org/r/888696 (https://phabricator.wikimedia.org/T320702) [08:13:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P44608 and previous config saved to /var/cache/conftool/dbconfig/20230214-081348-marostegui.json [08:14:28] (03CR) 10Elukey: [C: 03+1] Remove non-kafka logstash nodes from kafka configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/886862 (https://phabricator.wikimedia.org/T329142) (owner: 10Cwhite) [08:16:06] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: add logs-api svc records [dns] - 10https://gerrit.wikimedia.org/r/888696 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [08:17:29] (03PS2) 10Elukey: admin_ng: update ml-staging-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) [08:17:33] (03PS3) 10Elukey: admin_ng: update ml-staging-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) [08:17:35] (03CR) 10CI reject: [V: 04-1] admin_ng: update ml-staging-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [08:18:32] (03PS4) 10Elukey: admin_ng: update ml-staging-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) [08:19:25] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add Santiago Faci (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888045 (https://phabricator.wikimedia.org/T329296) (owner: 10Filippo Giunchedi) [08:19:40] (03CR) 10Filippo Giunchedi: elasticsearch: service depends on tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi) [08:20:15] (03CR) 10Muehlenhoff: swift::ring_manager: Enable profile::auto_restarts::service for rsyncd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888170 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:22:55] (03PS1) 10Vgutierrez: cache::haproxy: Update to 2.6.8 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/889053 (https://phabricator.wikimedia.org/T321775) [08:24:48] (03PS3) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [08:24:52] (03PS7) 10DCausse: flink-app: add support for custom config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 [08:24:54] (03PS10) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) [08:25:05] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm6001.drmrs.wmnet [08:25:13] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39554/console" [puppet] - 10https://gerrit.wikimedia.org/r/889053 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [08:26:15] !log rolling upgrade to HAProxy 2.6.8 in eqsin - T321775 [08:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:19] T321775: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 [08:26:19] (03CR) 10Elukey: [C: 03+2] admin_ng: update ml-staging-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [08:26:26] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Update to 2.6.8 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/889053 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez) [08:27:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:28:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T328817)', diff saved to https://phabricator.wikimedia.org/P44609 and previous config saved to /var/cache/conftool/dbconfig/20230214-082854-marostegui.json [08:28:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [08:28:59] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:29:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance [08:29:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T328817)', diff saved to https://phabricator.wikimedia.org/P44610 and previous config saved to /var/cache/conftool/dbconfig/20230214-082915-marostegui.json [08:30:21] (03PS2) 10Filippo Giunchedi: Add logs-api service [puppet] - 10https://gerrit.wikimedia.org/r/888700 (https://phabricator.wikimedia.org/T320702) [08:30:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T328817)', diff saved to https://phabricator.wikimedia.org/P44611 and previous config saved to /var/cache/conftool/dbconfig/20230214-083022-marostegui.json [08:30:32] (03PS1) 10Muehlenhoff: Remove testvm6001 [puppet] - 10https://gerrit.wikimedia.org/r/889055 [08:31:01] (03CR) 10Klausman: [C: 03+1] sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [08:32:11] (03CR) 10Muehlenhoff: [C: 03+2] Remove testvm6001 [puppet] - 10https://gerrit.wikimedia.org/r/889055 (owner: 10Muehlenhoff) [08:32:34] (03CR) 10Filippo Giunchedi: [C: 03+2] Add logs-api service [puppet] - 10https://gerrit.wikimedia.org/r/888700 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [08:33:24] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:37:29] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:39:14] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:40:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44612 and previous config saved to /var/cache/conftool/dbconfig/20230214-084020-root.json [08:42:25] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) That's awesome! * Usecase #1 is to populate: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/r... [08:44:29] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm6001.drmrs.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:50:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm6001.drmrs.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [08:50:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:50:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm6001.drmrs.wmnet [08:50:50] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm6001.drmrs.wmnet` - testvm6001.drmrs.wmnet (**PASS**) - Downtimed host on Icinga/Alertma... [08:51:05] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_codfw_logs-api.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:51:51] (03CR) 10DCausse: [C: 03+2] flink-app: add support for custom config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse) [08:51:56] godog: ^^ [08:52:54] gah of course, I'll take a look! service being setup [08:53:07] vgutierrez: I'll followup shortly with the pybal bits FYI [08:53:13] also, happy name day ! [08:53:22] cheers :) [08:54:49] !log filippo@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=logs-api [08:55:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44613 and previous config saved to /var/cache/conftool/dbconfig/20230214-085525-root.json [08:55:55] !log filippo@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=logs-api,dc=codfw [08:56:11] sigh that's a confctl bug ^ that actually did nothing [08:56:47] (03Merged) 10jenkins-bot: flink-app: add support for custom config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse) [08:57:17] !log filippo@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=codfw,service=logs-api [08:57:21] (03PS1) 10Ayounsi: Remove pfw BFD special case [puppet] - 10https://gerrit.wikimedia.org/r/889062 (https://phabricator.wikimedia.org/T329272) [09:00:26] godog: there is nothing for service=logs-api in codfw according to confctl [09:00:58] (03PS1) 10Filippo Giunchedi: conftool-data: add logs-api codfw [puppet] - 10https://gerrit.wikimedia.org/r/889063 (https://phabricator.wikimedia.org/T320702) [09:01:06] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_codfw_logs-api.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:01:22] vgutierrez: yeah I came to the same conclusion, fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/889063 [09:01:32] BTW, filippo@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=logs-api --> if it wasn't the case already, that pooled everything for logs-api in eqiad [09:01:47] yeah that was intended, new service [09:02:15] I was surprised by confctl effectively doing nothing yet announcing [09:02:44] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: add logs-api codfw [puppet] - 10https://gerrit.wikimedia.org/r/889063 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:02:46] (03CR) 10Vgutierrez: [C: 03+1] conftool-data: add logs-api codfw [puppet] - 10https://gerrit.wikimedia.org/r/889063 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:03:15] (03CR) 10MVernon: [C: 03+1] swift::ring_manager: Enable profile::auto_restarts::service for rsyncd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888170 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:04:44] !log filippo@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=logs-api [09:04:55] ok confd should be happy now [09:05:35] brb [09:09:56] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) Usecase #2 is to replace the hardcoded values from: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/pupp... [09:10:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44614 and previous config saved to /var/cache/conftool/dbconfig/20230214-091030-root.json [09:10:54] (03PS4) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [09:11:05] (ConfdResourceFailed) firing: (3) confd resource _srv_config-master_pybal_codfw_logs-api.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:11:19] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Volans) @Marostegui Just to clarify and avoid confusion, are you suggesting to remove `x2` entirely from the spicer... [09:13:23] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) Usecase #3 is to generate https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production... [09:15:44] (03PS1) 10Filippo Giunchedi: hieradata: logs-api to lvs_setup state [puppet] - 10https://gerrit.wikimedia.org/r/889066 (https://phabricator.wikimedia.org/T320702) [09:16:06] (ConfdResourceFailed) firing: (3) confd resource _srv_config-master_pybal_codfw_logs-api.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:21:05] (ConfdResourceFailed) resolved: (3) confd resource _srv_config-master_pybal_codfw_logs-api.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:21:07] (03PS5) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [09:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:25:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44615 and previous config saved to /var/cache/conftool/dbconfig/20230214-092535-root.json [09:27:42] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:29:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1109.eqiad.wmnet with reason: Maintenance [09:29:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1109.eqiad.wmnet with reason: Maintenance [09:29:50] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39559/console" [puppet] - 10https://gerrit.wikimedia.org/r/889066 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:32:12] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Marostegui) @Volans I am unsure. How do we treat parsercache at the moment? x2 is special in the sense that it does... [09:34:30] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: logs-api to lvs_setup state [puppet] - 10https://gerrit.wikimedia.org/r/889066 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [09:34:32] (03PS8) 10Clément Goubert: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [09:35:28] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Volans) >>! In T329533#8613691, @Marostegui wrote: > @Volans I am unsure. How do we treat parsercache at the moment?... [09:37:21] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [09:39:09] (03Merged) 10jenkins-bot: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto) [09:39:24] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.79:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:40:02] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 121 connections established with conf1007.eqiad.wmnet:4001 (min=122) https://wikitech.wikimedia.org/wiki/PyBal [09:40:06] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 75 connections established with conf1007.eqiad.wmnet:4001 (min=76) https://wikitech.wikimedia.org/wiki/PyBal [09:40:23] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Marostegui) >>! In T329533#8613695, @Volans wrote: >>>! In T329533#8613691, @Marostegui wrote: >> @Volans I am unsur... [09:40:24] known/expected ^ pending pybal restart [09:40:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44616 and previous config saved to /var/cache/conftool/dbconfig/20230214-094040-root.json [09:41:46] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.79:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:42:23] (03PS3) 10Clément Goubert: sre.discovery.datacenter: fix rollback logic [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) (owner: 10Giuseppe Lavagetto) [09:42:34] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 70 connections established with conf2005.codfw.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal [09:42:34] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 88 connections established with conf2004.codfw.wmnet:4001 (min=89) https://wikitech.wikimedia.org/wiki/PyBal [09:43:38] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.79:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:44:24] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.79:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:45:52] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: fix rollback logic [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) (owner: 10Giuseppe Lavagetto) [09:46:03] (03PS6) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [09:47:20] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39560/console" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [09:47:41] (03Merged) 10jenkins-bot: sre.discovery.datacenter: fix rollback logic [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) (owner: 10Giuseppe Lavagetto) [09:48:54] (03PS2) 10Elukey: sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) [09:49:14] (03CR) 10Elukey: sre.k8s.upgrade-cluster: simplify etcd cluster procedure (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:49:42] (03PS7) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [09:50:08] (03PS3) 10Elukey: sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) [09:50:10] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:50:14] !log roll-restart pybal in eqiad/codfw to pick up logs-api service - T320702 [09:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:17] T320702: Jaeger secure access to OpenSearch cluster - https://phabricator.wikimedia.org/T320702 [09:50:40] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:51:12] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39561/console" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [09:51:43] (03CR) 10Elukey: "Ben was this tried in the test cluster? It should be relatively easy, just to make sure if we see exceptions or not before hitting the res" [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [09:52:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:52:15] (03PS1) 10Volans: Makefile.deploy: fix bundle CA linking [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889068 [09:52:20] (03PS1) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) [09:52:36] (03PS8) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [09:52:40] (03CR) 10CI reject: [V: 04-1] k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [09:52:43] (03PS2) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) [09:52:53] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "after https://phabricator.wikimedia.org/T329611 I think this patch requires further discussion with the team." [puppet] - 10https://gerrit.wikimedia.org/r/888347 (https://phabricator.wikimedia.org/T329467) (owner: 10Majavah) [09:53:08] (03CR) 10CI reject: [V: 04-1] k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [09:53:22] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:53:26] (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:53:33] (03PS4) 10Elukey: sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) [09:53:35] (03CR) 10Elukey: [V: 03+2] sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:54:07] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39562/console" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [09:54:08] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 89 connections established with conf2004.codfw.wmnet:4001 (min=89) https://wikitech.wikimedia.org/wiki/PyBal [09:54:09] (03CR) 10Ayounsi: [C: 03+1] Makefile.deploy: fix bundle CA linking [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889068 (owner: 10Volans) [09:54:41] (03CR) 10Ayounsi: [C: 04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [09:55:29] (03CR) 10Volans: [V: 03+2 C: 03+2] Makefile.deploy: fix bundle CA linking [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889068 (owner: 10Volans) [09:55:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44618 and previous config saved to /var/cache/conftool/dbconfig/20230214-095544-root.json [09:57:08] (03PS9) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [09:57:24] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 122 connections established with conf1007.eqiad.wmnet:4001 (min=122) https://wikitech.wikimedia.org/wiki/PyBal [09:57:29] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:57:48] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - logs-api_443: Servers logstash1032.eqiad.wmnet, logstash1030.eqiad.wmnet, logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:58:21] looking ^ [09:58:46] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39563/console" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [09:59:52] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 71 connections established with conf2005.codfw.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal [10:00:16] (03CR) 10Btullis: [V: 03+1] Try libmariadb-java with sqoop on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:00:56] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:01:53] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [10:02:06] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - logs-api_443: Servers logstash1032.eqiad.wmnet, logstash1025.eqiad.wmnet, logstash1024.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:02:28] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:03:14] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 76 connections established with conf1007.eqiad.wmnet:4001 (min=76) https://wikitech.wikimedia.org/wiki/PyBal [10:03:21] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [10:07:29] (JobUnavailable) firing: (3) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:07:38] (03CR) 10Muehlenhoff: Try libmariadb-java with sqoop on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:08:11] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [10:08:17] (03CR) 10Muehlenhoff: [C: 03+2] Add binder to the kernel module block list [puppet] - 10https://gerrit.wikimedia.org/r/888709 (owner: 10Muehlenhoff) [10:09:08] (03CR) 10Jelto: [V: 03+1] "thanks for opening the change!" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [10:09:24] (03PS2) 10Btullis: Try libmariadb-java with sqoop on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) [10:09:38] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [10:10:16] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10akosiaris) I 've found this task via a different pathway, trying to help editors in T328875. Debugging that one I ended up dealing with a swift ghost from 2017. While this is old enough to... [10:10:23] (03PS3) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) [10:10:44] (03CR) 10CI reject: [V: 04-1] k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [10:11:28] (03PS4) 10Clément Goubert: sre.discovery.datacenter: add --fast-insecure switch for pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto) [10:12:13] (03CR) 10Clément Goubert: sre.discovery.datacenter: add --fast-insecure switch for pool/depool (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto) [10:12:31] (03PS8) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 [10:13:28] (03CR) 10Btullis: Try libmariadb-java with sqoop on bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:13:42] (03PS4) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) [10:15:53] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [10:16:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:17:06] (03CR) 10Elukey: [C: 03+1] Try libmariadb-java with sqoop on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:17:30] (03CR) 10Btullis: [C: 03+2] Try libmariadb-java with sqoop on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:17:37] (JobUnavailable) firing: (3) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:19:41] !log installing imagemagick security updates on bullseye [10:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:10] (03CR) 10Btullis: [C: 03+2] "Doh!" [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:21:18] (03PS1) 10Volans: Makefile.deploy: restart services [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889077 [10:22:05] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 (owner: 10Clément Goubert) [10:22:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49567 bytes in 5.775 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:22:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.788 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:23:51] (03Merged) 10jenkins-bot: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 (owner: 10Clément Goubert) [10:23:59] (03PS2) 10Gehel: miscweb / query_service: remove ability to list directories [puppet] - 10https://gerrit.wikimedia.org/r/883272 (https://phabricator.wikimedia.org/T324667) [10:25:27] (03CR) 10Ayounsi: [C: 03+1] Makefile.deploy: restart services [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889077 (owner: 10Volans) [10:25:29] (03PS1) 10Btullis: Fix a compilation error in bigtop::mysql_jdbc [puppet] - 10https://gerrit.wikimedia.org/r/889079 (https://phabricator.wikimedia.org/T329363) [10:26:00] (03PS5) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) [10:26:51] (03CR) 10Gehel: [C: 03+2] miscweb / query_service: remove ability to list directories [puppet] - 10https://gerrit.wikimedia.org/r/883272 (https://phabricator.wikimedia.org/T324667) (owner: 10Gehel) [10:27:05] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [10:27:07] (03CR) 10Btullis: "A fix for my previous error." [puppet] - 10https://gerrit.wikimedia.org/r/889079 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:27:09] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39564/console" [puppet] - 10https://gerrit.wikimedia.org/r/889079 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:28:04] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix a compilation error in bigtop::mysql_jdbc [puppet] - 10https://gerrit.wikimedia.org/r/889079 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:32:00] (03CR) 10Muehlenhoff: [C: 03+2] swift::ring_manager: Enable profile::auto_restarts::service for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/888170 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:36:52] (03PS1) 10Gehel: miscweb / query_service: remove ability to list directories [puppet] - 10https://gerrit.wikimedia.org/r/889080 (https://phabricator.wikimedia.org/T324667) [10:37:10] (03PS5) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [10:39:49] (03CR) 10Hnowlan: [C: 03+2] api-gateway: reformat templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/887991 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [10:41:17] (03CR) 10Volans: [V: 03+2 C: 03+2] Makefile.deploy: restart services [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889077 (owner: 10Volans) [10:41:48] (03PS1) 10Btullis: Fix the bigtop::jdbc class on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889081 (https://phabricator.wikimedia.org/T329363) [10:42:46] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39565/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [10:43:20] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39566/console" [puppet] - 10https://gerrit.wikimedia.org/r/889081 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:43:31] (03CR) 10Gehel: [C: 03+2] miscweb / query_service: remove ability to list directories [puppet] - 10https://gerrit.wikimedia.org/r/889080 (https://phabricator.wikimedia.org/T324667) (owner: 10Gehel) [10:43:46] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the bigtop::jdbc class on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889081 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:44:53] (03CR) 10Hnowlan: fluent-bit: install wmf-certificates (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883605 (owner: 10Hnowlan) [10:45:01] (03Merged) 10jenkins-bot: api-gateway: reformat templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/887991 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [10:48:15] (03PS1) 10Elukey: role::etcd::v3::ml_etcd::staging: use PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/889082 (https://phabricator.wikimedia.org/T329556) [10:49:55] (03PS1) 10Filippo Giunchedi: logs-api: allow GET / only for health check [puppet] - 10https://gerrit.wikimedia.org/r/889083 (https://phabricator.wikimedia.org/T320702) [10:50:05] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39567/console" [puppet] - 10https://gerrit.wikimedia.org/r/889082 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [10:52:04] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10MoritzMuehlenhoff) >>! In T159412#8599008, @Dzahn wrote: > @Muehlenhoff Here was my attempt to fix the "mediaw... [10:52:45] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::etcd::v3::ml_etcd::staging: use PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/889082 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [10:53:02] (03PS6) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [10:54:11] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39568/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [10:56:02] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [10:56:13] !log volans@cumin1001 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [10:56:45] !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [10:58:19] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001 [11:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1100) [11:04:20] (03PS4) 10EoghanGaffney: Add insetup puppet role for aphlict vm in codfw [puppet] - 10https://gerrit.wikimedia.org/r/888690 (https://phabricator.wikimedia.org/T322369) [11:05:21] (03PS7) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [11:05:54] (03CR) 10EoghanGaffney: [C: 03+2] Add insetup puppet role for aphlict vm in codfw [puppet] - 10https://gerrit.wikimedia.org/r/888690 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [11:06:34] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39569/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:10:19] (03PS1) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [11:10:40] (03CR) 10CI reject: [V: 04-1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [11:11:07] (03PS34) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [11:11:28] (03PS2) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [11:11:30] (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [11:16:22] (03PS3) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [11:16:41] (03PS1) 10Arturo Borrero Gonzalez: tools-manifests: don't collect statsd metrics [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889085 (https://phabricator.wikimedia.org/T244809) [11:16:46] (03PS1) 10Arturo Borrero Gonzalez: tools-manifest: refresh reference to obsolete 'labs' things [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889106 [11:20:57] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet [11:23:27] (03PS4) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [11:24:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet [11:25:33] (03PS5) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [11:25:52] (03CR) 10CI reject: [V: 04-1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [11:28:22] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet [11:28:55] (03PS2) 10Arturo Borrero Gonzalez: tools-manifest: refresh reference to obsolete 'labs' things [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889106 [11:29:01] (03PS1) 10Arturo Borrero Gonzalez: tools-manifest: add d/gbp.conf file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889110 [11:29:07] (03PS1) 10Arturo Borrero Gonzalez: gitignore: ignore nano .swp file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889111 [11:29:13] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.25 buster [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889112 [11:30:16] PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:17] (03PS4) 10Clément Goubert: sre.switchdc.services: Exclude wdqs and wdqs-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) [11:30:19] (03PS3) 10Clément Goubert: sre.switchdc.services: import sre.discovery.datacenter excludes [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [11:32:02] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet [11:32:40] (03CR) 10Clément Goubert: sre.switchdc.services: Exclude wdqs and wdqs-ssl (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert) [11:38:50] (03PS1) 10Volans: python_deploy: call also a post-deploy target [puppet] - 10https://gerrit.wikimedia.org/r/889113 [11:39:04] (03PS6) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [11:39:26] (03CR) 10CI reject: [V: 04-1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [11:40:15] (03PS7) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [11:40:25] (03CR) 10Clément Goubert: [C: 03+2] sre.mediawiki.restart-appservers: Fix clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert) [11:40:36] (03CR) 10CI reject: [V: 04-1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [11:40:57] (03PS1) 10Volans: Makefile.deploy: add post-deploy target [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889116 [11:41:24] (03PS1) 10Volans: Rake taskgen: use shellcheck from $PATH [puppet] - 10https://gerrit.wikimedia.org/r/889117 [11:41:32] (03PS8) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [11:42:51] (03PS9) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [11:43:22] (03CR) 10Ayounsi: [C: 03+1] python_deploy: call also a post-deploy target [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans) [11:43:53] (03PS1) 10Zabe: beta: Add deployment-db11 and deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889118 (https://phabricator.wikimedia.org/T329577) [11:43:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39578/console" [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [11:44:09] (03CR) 10Ayounsi: [C: 03+1] "Should we make the post-deploy optional? Eg. not fail if it doesn't exist?" [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans) [11:44:11] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet [11:44:23] (03CR) 10Zabe: [C: 03+2] beta: Add deployment-db11 and deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889118 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [11:44:34] (03CR) 10Ayounsi: [C: 03+1] Makefile.deploy: add post-deploy target [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889116 (owner: 10Volans) [11:44:55] (03PS2) 10Clément Goubert: sre.discovery.datacenter: status improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 [11:45:24] (03Merged) 10jenkins-bot: beta: Add deployment-db11 and deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889118 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [11:46:21] (03PS10) 10JMeybohm: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [11:46:38] (03CR) 10Jbond: "lgtm minor comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans) [11:46:45] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) Whatever you've found is not the same issue as with ghost objects - a ghost object as defined here is one which appears in `swift list` (or asking swift for the contents of a... [11:47:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889116 (owner: 10Volans) [11:47:06] (03PS1) 10Volans: Makefile.deploy: add post-deploy target [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/889119 [11:47:35] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) But yes, thumbnails are transient, so it should always be OK to delete them. [11:47:45] (03CR) 10Ayounsi: [C: 03+1] Makefile.deploy: add post-deploy target [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/889119 (owner: 10Volans) [11:47:52] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2003.codfw.wmnet [11:48:21] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39579/console" [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [11:49:06] (03CR) 10Jbond: "seems my comment was lost, here it is again" [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans) [11:49:24] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet [11:49:45] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [11:51:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44619 and previous config saved to /var/cache/conftool/dbconfig/20230214-115144-root.json [11:51:55] (03CR) 10Jbond: [C: 03+1] Makefile.deploy: add post-deploy target [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/889119 (owner: 10Volans) [11:53:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet [11:54:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] tools-manifests: don't collect statsd metrics [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889085 (https://phabricator.wikimedia.org/T244809) (owner: 10Arturo Borrero Gonzalez) [11:54:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] tools-manifest: refresh reference to obsolete 'labs' things [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889106 (owner: 10Arturo Borrero Gonzalez) [11:54:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] tools-manifest: add d/gbp.conf file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889110 (owner: 10Arturo Borrero Gonzalez) [11:54:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] gitignore: ignore nano .swp file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889111 (owner: 10Arturo Borrero Gonzalez) [11:54:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: generate entry for 0.25 buster [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889112 (owner: 10Arturo Borrero Gonzalez) [11:54:28] (03PS2) 10Volans: python_deploy: call also a post-deploy target [puppet] - 10https://gerrit.wikimedia.org/r/889113 [11:54:30] (03PS2) 10Volans: Rake taskgen: use shellcheck from $PATH [puppet] - 10https://gerrit.wikimedia.org/r/889117 [11:54:39] (03Merged) 10jenkins-bot: tools-manifests: don't collect statsd metrics [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889085 (https://phabricator.wikimedia.org/T244809) (owner: 10Arturo Borrero Gonzalez) [11:54:42] (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans) [11:55:34] (03Abandoned) 10Volans: Makefile.deploy: add post-deploy target [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/889119 (owner: 10Volans) [11:56:01] (03CR) 10Volans: [V: 03+2 C: 03+2] Makefile.deploy: add post-deploy target [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889116 (owner: 10Volans) [11:56:46] (03Merged) 10jenkins-bot: tools-manifest: refresh reference to obsolete 'labs' things [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889106 (owner: 10Arturo Borrero Gonzalez) [11:56:51] (03Merged) 10jenkins-bot: tools-manifest: add d/gbp.conf file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889110 (owner: 10Arturo Borrero Gonzalez) [11:56:57] (03Merged) 10jenkins-bot: gitignore: ignore nano .swp file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889111 (owner: 10Arturo Borrero Gonzalez) [11:57:35] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) Usecase #4 is to centrally manage the list BGP routers (core routers or ToR switches) used for host to configure t... [11:57:45] (03Merged) 10jenkins-bot: d/changelog: generate entry for 0.25 buster [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889112 (owner: 10Arturo Borrero Gonzalez) [11:59:54] (03PS35) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [12:01:00] PROBLEM - Check systemd state on ms-fe1010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:26] (03CR) 10Ayounsi: [C: 03+1] python_deploy: call also a post-deploy target [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans) [12:03:33] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39580/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [12:06:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44620 and previous config saved to /var/cache/conftool/dbconfig/20230214-120649-root.json [12:07:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks, LGTM. I suggest we collect a +1 from Cathal as well." [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [12:08:02] (03PS1) 10Muehlenhoff: swift::ring_manager: Only enable auto restart on active ring manager nodes [puppet] - 10https://gerrit.wikimedia.org/r/889122 [12:20:07] (03CR) 10FNegri: [C: 03+1] "LGTM, let's wait for Cathal to confirm 8972 is the best value to use." [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [12:21:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44621 and previous config saved to /var/cache/conftool/dbconfig/20230214-122154-root.json [12:26:19] (03CR) 10Hnowlan: [C: 03+1] changeprop: use a more generic name for events in liftwing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/888653 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [12:27:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:36:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44622 and previous config saved to /var/cache/conftool/dbconfig/20230214-123659-root.json [12:37:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [12:42:34] PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44623 and previous config saved to /var/cache/conftool/dbconfig/20230214-125203-root.json [12:58:33] (03CR) 10Ottomata: "I know you are still working, just a couple of thoughts on latest patches." [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [13:00:49] (03PS3) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [13:02:34] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [13:05:35] (03PS4) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [13:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44624 and previous config saved to /var/cache/conftool/dbconfig/20230214-130708-root.json [13:07:14] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [13:08:22] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1001.eqiad.wmnet with OS bullseye [13:08:34] (03CR) 10David Caro: [V: 03+1] node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:08:54] (03CR) 10Volans: [C: 03+2] python_deploy: call also a post-deploy target [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans) [13:08:57] (03PS8) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [13:15:16] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) [13:20:47] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1001.eqiad.wmnet with reason: host reimage [13:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:23:02] (03PS3) 10Clément Goubert: sre.discovery.datacenter: status improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 [13:23:53] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1001.eqiad.wmnet with reason: host reimage [13:26:23] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:27:19] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:28:47] (03CR) 10Volans: [C: 03+1] "I didn't test the output of status but LGTM, optional nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 (owner: 10Clément Goubert) [13:32:43] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 2 others: Deleted files can remain on swift due to race conditions - https://phabricator.wikimedia.org/T168002 (10zhuyifei1999) [13:35:32] 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui) The data checksum was clean, so I am repooling this host. [13:35:48] (03PS1) 10Zabe: beta: Add deployment-db11 and deployment-db12 (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889126 (https://phabricator.wikimedia.org/T329577) [13:36:12] (03CR) 10Zabe: [C: 03+2] beta: Add deployment-db11 and deployment-db12 (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889126 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [13:36:49] (03Merged) 10jenkins-bot: beta: Add deployment-db11 and deployment-db12 (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889126 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [13:36:51] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) > alarms: true we can set based on the device model (false by default as we have more mx204s, then if mx480: true) J... [13:38:20] (03CR) 10Muehlenhoff: [C: 03+2] mediawiki: drop pybal-check user [puppet] - 10https://gerrit.wikimedia.org/r/886478 (https://phabricator.wikimedia.org/T111899) (owner: 10Majavah) [13:39:14] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:29] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:43] (03PS1) 10Muehlenhoff: Remove further files related to removed pybal health checks [puppet] - 10https://gerrit.wikimedia.org/r/889127 (https://phabricator.wikimedia.org/T111899) [13:44:06] (03CR) 10MVernon: [C: 03+1] "Good catch, sorry I missed this in the first review." [puppet] - 10https://gerrit.wikimedia.org/r/889122 (owner: 10Muehlenhoff) [13:44:23] (03PS1) 10Zabe: beta: Pool deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889128 (https://phabricator.wikimedia.org/T329577) [13:45:10] (03CR) 10Zabe: [C: 03+2] beta: Pool deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889128 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [13:45:41] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.02387 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:45:52] (03Merged) 10jenkins-bot: beta: Pool deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889128 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [13:46:07] (03PS4) 10Clément Goubert: sre.discovery.datacenter: status improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 [13:46:20] (03CR) 10Clément Goubert: sre.discovery.datacenter: status improvements (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 (owner: 10Clément Goubert) [13:46:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889127 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff) [13:50:40] (03CR) 10Cathal Mooney: "Overall LGTM, and definitely a good idea. However looking at a cloudcephmon host it has an interface MTU of 1500 set on it? Perhaps I di" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:51:30] (03CR) 10Cathal Mooney: node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:53:22] Group[pybal-check] --> that's triggering issues in puppet [13:53:54] taavi: that seems triggered by 1e2f1c0814cdd3547b20c0279b13d45fae07a926 [13:53:55] (03PS1) 10Zabe: Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889097 [13:54:09] (03CR) 10Zabe: [C: 03+2] Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889097 (owner: 10Zabe) [13:54:43] (03CR) 10David Caro: node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [13:54:45] (03Merged) 10jenkins-bot: Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889097 (owner: 10Zabe) [13:54:53] moritzm: ^^ [13:56:18] Feb 14 13:52:02 mw1351 puppet-agent[28879]: Could not delete group pybal-check: Execution of '/usr/sbin/groupdel pybal-check' returned 8: groupdel: cannot remove the primary group of user 'pybal-check' [13:56:18] Feb 14 13:52:02 mw1351 puppet-agent[28879]: (/Stage[main]/Mediawiki::Users/Group[pybal-check]/ensure) change from 'present' to 'absent' failed: Could not delete group pybal-check: Execution of '/usr/sbin/groupdel pybal-check' returned 8: groupdel: cannot remove the primary group of user 'pybal-check' [13:57:29] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:58:03] (03PS1) 10Elukey: Replace underscores with hypens in ml-staging's SRV records [dns] - 10https://gerrit.wikimedia.org/r/889134 (https://phabricator.wikimedia.org/T329556) [13:58:58] hmm seems like a second puppet run clears the issue [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1400) [14:02:40] (03CR) 10Elukey: [C: 03+2] Replace underscores with hypens in ml-staging's SRV records [dns] - 10https://gerrit.wikimedia.org/r/889134 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [14:03:03] 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui) 05Open→03Resolved Thanks everyone for all the help! [14:03:32] (03CR) 10Andrew Bogott: [C: 03+2] wmcs ceph:Move cloudcephosd1001/1002 to e4 [puppet] - 10https://gerrit.wikimedia.org/r/888659 (https://phabricator.wikimedia.org/T329498) (owner: 10David Caro) [14:04:25] (03PS11) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [14:04:27] (03PS1) 10Elukey: role::etcd::v3::ml_etcd::staging: replace discovery endpoint [puppet] - 10https://gerrit.wikimedia.org/r/889136 (https://phabricator.wikimedia.org/T329556) [14:05:17] (03CR) 10Elukey: [C: 03+2] role::etcd::v3::ml_etcd::staging: replace discovery endpoint [puppet] - 10https://gerrit.wikimedia.org/r/889136 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [14:05:57] (03PS1) 10Jgiannelos: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889138 [14:06:11] (03CR) 10Muehlenhoff: [C: 03+2] swift::ring_manager: Only enable auto restart on active ring manager nodes [puppet] - 10https://gerrit.wikimedia.org/r/889122 (owner: 10Muehlenhoff) [14:09:04] (03PS12) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) [14:11:14] !log installing libde265 security updates [14:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:54] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889138 (owner: 10Jgiannelos) [14:12:45] (03CR) 10Atieno: [C: 03+1] Bump Thumbor minor version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290) (owner: 10Hnowlan) [14:13:26] (03CR) 10Elukey: [C: 03+2] changeprop: use a more generic name for events in liftwing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/888653 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [14:13:44] (03PS1) 10Jgiannelos: proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889139 [14:14:51] (03CR) 10CDanis: [C: 03+1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [14:16:57] (03Merged) 10jenkins-bot: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889138 (owner: 10Jgiannelos) [14:17:37] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:45] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:18:18] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:18:35] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:19:06] (03CR) 10Jgiannelos: [C: 03+2] proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889139 (owner: 10Jgiannelos) [14:19:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede) [14:19:28] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:20:17] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:20:32] (03PS9) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [14:21:01] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:21:16] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:21:45] (03PS1) 10Cathal Mooney: Adjust interface names for cloudcephosd1001 and cloudcephosd1002 [puppet] - 10https://gerrit.wikimedia.org/r/889142 (https://phabricator.wikimedia.org/T329498) [14:22:33] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:22:38] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889142 (https://phabricator.wikimedia.org/T329498) (owner: 10Cathal Mooney) [14:22:51] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [14:22:53] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:23:25] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [14:23:27] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:23:29] (03CR) 10David Caro: [C: 03+2] Adjust interface names for cloudcephosd1001 and cloudcephosd1002 [puppet] - 10https://gerrit.wikimedia.org/r/889142 (https://phabricator.wikimedia.org/T329498) (owner: 10Cathal Mooney) [14:23:34] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [14:23:36] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [14:23:49] (03Merged) 10jenkins-bot: proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889139 (owner: 10Jgiannelos) [14:24:30] (03CR) 10Elukey: [C: 03+2] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [14:24:33] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [14:25:28] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:26:37] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [14:27:26] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39581/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [14:28:33] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [14:28:43] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [14:30:22] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [14:40:08] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10Ladsgroup) Super stupid question: Would this help here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/888657/2/modules/thumbor/fil... [14:40:13] (03CR) 10Herron: [C: 03+1] "LGTM, it'd give a slightly more accurate representation of node health too" [puppet] - 10https://gerrit.wikimedia.org/r/889083 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [14:40:18] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin2002" [14:41:08] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [14:41:21] !log andrew@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin2002" [14:41:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1001.eqiad.wmnet with OS bullseye [14:42:21] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.000994 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:43:59] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [14:47:29] (JobUnavailable) firing: (3) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:33] (03CR) 10Filippo Giunchedi: [C: 03+2] logs-api: allow GET / only for health check [puppet] - 10https://gerrit.wikimedia.org/r/889083 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi) [14:49:14] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:15] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:54:05] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:54:49] !log roll-restart pybal in eqiad/codfw to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/889083 [14:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:31] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1002.eqiad.wmnet with OS bullseye [14:59:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2448.codfw.wmnet with OS buster [14:59:47] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2448.codfw.wmnet with OS buster [15:01:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2449.codfw.wmnet with OS buster [15:01:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2449.codfw.wmnet with OS buster [15:04:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2450.codfw.wmnet with OS buster [15:04:24] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2450.codfw.wmnet with OS buster [15:05:49] !log installing openjdk-11 security updates [15:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2451.codfw.wmnet with OS buster [15:07:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2451.codfw.wmnet with OS buster [15:08:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [15:10:37] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage [15:11:43] (03PS1) 10Zabe: beta: Pool deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889150 (https://phabricator.wikimedia.org/T329577) [15:13:19] (03CR) 10Zabe: [C: 03+2] beta: Pool deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889150 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [15:13:42] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage [15:13:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889150 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [15:13:56] (03Merged) 10jenkins-bot: beta: Pool deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889150 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe) [15:14:04] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 (owner: 10Clément Goubert) [15:15:30] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) We're due another full backup of swift contents in the next few days, but I think we need a cookbook or similar to script handling these. In outline, assuming we specify eqia... [15:15:32] (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: status improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 (owner: 10Clément Goubert) [15:16:04] (03CR) 10Volans: "suggestion inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 (owner: 10Clément Goubert) [15:17:12] (03Merged) 10jenkins-bot: sre.discovery.datacenter: status improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 (owner: 10Clément Goubert) [15:19:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2448.codfw.wmnet with reason: host reimage [15:21:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2449.codfw.wmnet with reason: host reimage [15:21:29] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1002.eqiad.wmnet with OS bullseye [15:22:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2448.codfw.wmnet with reason: host reimage [15:23:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2450.codfw.wmnet with reason: host reimage [15:24:03] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1002.eqiad.wmnet with OS bullseye [15:24:45] (03PS1) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) [15:25:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2449.codfw.wmnet with reason: host reimage [15:26:31] (03PS2) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) [15:27:19] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2451.codfw.wmnet with reason: host reimage [15:27:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2450.codfw.wmnet with reason: host reimage [15:30:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2451.codfw.wmnet with reason: host reimage [15:32:08] (03PS1) 10Muehlenhoff: Fail over to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/889153 [15:33:20] (03CR) 10Btullis: [C: 03+1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey) [15:34:58] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage [15:35:54] (03PS10) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [15:38:01] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage [15:39:09] (03PS1) 10Hnowlan: changeprop, jobqueue: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/889154 [15:39:22] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:39:26] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39582/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:41:25] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:43:30] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1002.eqiad.wmnet with OS bullseye [15:44:04] (03PS1) 10Bking: rdf-streaming-updater: Use S3 instead of Swift for bucket access [deployment-charts] - 10https://gerrit.wikimedia.org/r/889155 (https://phabricator.wikimedia.org/T304914) [15:44:36] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:45:32] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:45:53] (03CR) 10Hnowlan: [C: 03+2] changeprop, jobqueue: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/889154 (owner: 10Hnowlan) [15:48:07] (03PS1) 10CDanis: pki: Add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) [15:48:24] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1002.eqiad.wmnet with OS bullseye [15:49:14] (03PS2) 10CDanis: pki: Add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) [15:49:59] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [15:50:04] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [15:50:10] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [15:50:25] jouncebot: now [15:50:25] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [15:50:33] !log bking@deploy1002 'deploying rdf-streaming-updater prod eqiad T304914' [15:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:37] T304914: Remove the presto client for swift from the flink image - https://phabricator.wikimedia.org/T304914 [15:51:49] (03Merged) 10jenkins-bot: changeprop, jobqueue: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/889154 (owner: 10Hnowlan) [15:52:30] !log uploaded src:icu67 67.1-7~wmf1 to buster-wikimedia/component/icu67 T329491 [15:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:34] T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 [15:53:36] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [15:54:06] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [15:55:15] (03PS3) 10CDanis: pki: Add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) [15:55:20] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [15:55:50] PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 1985 MB (3% inode=97%): /srv/swift-storage/sda3 10261 MB (5% inode=99%): /tmp 1985 MB (3% inode=97%): /var/tmp 1985 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [15:56:21] (03CR) 10Ahmon Dancy: "ok w/ me." [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [15:56:25] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [15:56:46] (03PS1) 10Krinkle: webperf: Remove broken HeaderName/ReadmeName for arclamp file listing [puppet] - 10https://gerrit.wikimedia.org/r/889161 [15:57:00] (03PS4) 10CDanis: pki: Add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) [15:57:05] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [15:57:38] (03PS1) 10Ottomata: Produce rc1.mediawik.page_change to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889162 [15:59:03] (03PS2) 10Ottomata: Produce rc1.mediawik.page_change to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889162 [15:59:07] (03Abandoned) 10Sbailey: Enable Linter migration scripts for namespace and tag and template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888111 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey) [15:59:12] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:59:18] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage [16:00:49] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [16:01:42] (03PS1) 10Btullis: Remove the ores::base class from the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) [16:02:27] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10Ladsgroup) hmm, it's not too complicated, my only concern is the order they should go in, I don't think that... [16:02:41] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage [16:04:00] (03CR) 10Ottomata: [C: 03+2] Produce rc1.mediawik.page_change to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889162 (owner: 10Ottomata) [16:04:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:04:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2449.codfw.wmnet with OS buster [16:04:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:04:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2450.codfw.wmnet with OS buster [16:04:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:04:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2448.codfw.wmnet with OS buster [16:04:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:04:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2451.codfw.wmnet with OS buster [16:04:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2449.codfw.wmnet with OS buster completed: - mw2449 (**PASS**) - Removed from Pupp... [16:04:34] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2450.codfw.wmnet with OS buster completed: - mw2450 (**PASS**) - Removed from Pupp... [16:04:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2448.codfw.wmnet with OS buster completed: - mw2448 (**PASS**) - Removed from Pupp... [16:04:40] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2451.codfw.wmnet with OS buster completed: - mw2451 (**PASS**) - Removed from Pupp... [16:04:53] (03Merged) 10jenkins-bot: Produce rc1.mediawik.page_change to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889162 (owner: 10Ottomata) [16:05:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:05:57] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39583/console" [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:06:05] (03CR) 10Zabe: [C: 03+1] Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [16:06:13] (03CR) 10Zabe: [C: 03+1] Remove aliases 'minnan' and 'zh-cfr' [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [16:09:11] (03CR) 10David Caro: puppet: improvements to replica_cnf_api functional tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [16:09:36] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:11:37] (03CR) 10JHathaway: [V: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [16:12:07] (03CR) 10CDanis: [C: 03+2] pki: Add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [16:12:17] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) > The OOB ones are tricky and should probably be kept for last, probably by fetching the OOB circuits, and not the d... [16:12:21] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:12:56] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/889117 (owner: 10Volans) [16:13:17] (03PS11) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [16:13:19] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:14:51] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39584/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [16:14:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:15:05] (03CR) 10David Caro: node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [16:15:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [16:16:13] (03CR) 10Volans: [C: 03+2] Rake taskgen: use shellcheck from $PATH [puppet] - 10https://gerrit.wikimedia.org/r/889117 (owner: 10Volans) [16:16:34] !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin2002" [16:16:37] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39586/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [16:17:12] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: wgEventStreams - Produce rc1.mediawiki.page_change to eventgate-main (duration: 09m 01s) [16:18:32] (03PS12) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [16:18:34] (03PS1) 10Btullis: Do not install spark2 on bullseye or later [puppet] - 10https://gerrit.wikimedia.org/r/889166 (https://phabricator.wikimedia.org/T329363) [16:19:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:22:19] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [16:22:42] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39588/console" [puppet] - 10https://gerrit.wikimedia.org/r/889166 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:22:45] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:23:39] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39589/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [16:25:29] (03CR) 10Elukey: Remove the ores::base class from the analytics cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:26:27] (03CR) 10Elukey: [C: 03+1] k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:27:02] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron) [16:27:19] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [16:27:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:27:42] !log andrew@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin2002" [16:27:48] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1002.eqiad.wmnet with OS bullseye [16:27:52] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:28:02] (03CR) 10David Caro: [V: 03+1] "New version ready for review, here's an output of a run of the script (manually copied from the pcc output):" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [16:28:54] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:29:07] (03PS13) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) [16:29:15] (03PS2) 10Btullis: Remove the ores::base class from the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) [16:29:27] (03CR) 10Btullis: Remove the ores::base class from the analytics cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:29:45] (03PS1) 10Ottomata: eventgate-main - bump to image version 2023-02-14-162241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889171 [16:29:47] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:30:53] (03PS2) 10Ottomata: eventgate-main - bump to image version 2023-02-14-162241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889171 [16:31:35] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:32:19] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [16:32:40] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-main - bump to image version 2023-02-14-162241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889171 (owner: 10Ottomata) [16:33:10] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [16:33:13] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1002.eqiad.wmnet with OS bullseye [16:33:13] (03PS5) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) [16:33:22] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:34:01] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [16:34:20] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [16:34:33] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [16:34:51] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [16:35:53] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [16:36:01] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [16:36:08] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:36:32] (03CR) 10BCornwall: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/889127 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff) [16:36:58] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [16:37:03] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39591/console" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [16:37:45] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [16:38:28] (03PS1) 10Bking: rdf-streaming-updater: Increase memory alloc from 2 to 3GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) [16:38:41] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [16:39:01] (03CR) 10Herron: [C: 03+2] rsync: remove rsync::server::wrap_with_stunnel [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron) [16:39:48] (03CR) 10Herron: [C: 03+2] "thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron) [16:44:06] (03CR) 10Andrew Bogott: "One comment inline; the setup/teardown seems good!" [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [16:45:48] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage [16:47:07] (03PS1) 10Ottomata: wgEventStreams - rc1.mediawiki.page_change: enable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889174 [16:48:28] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [16:48:29] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage [16:49:24] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - rc1.mediawiki.page_change: enable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889174 (owner: 10Ottomata) [16:50:03] (03Merged) 10jenkins-bot: wgEventStreams - rc1.mediawiki.page_change: enable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889174 (owner: 10Ottomata) [16:50:28] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [16:50:50] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [16:52:56] (03PS2) 10Clément Goubert: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 [16:53:15] (03CR) 10Clément Goubert: sre.discovery.datacenter: ConfctlError handling (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 (owner: 10Clément Goubert) [16:53:41] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10akosiaris) >>! In T327253#8614005, @MatthewVernon wrote: > Whatever you've found is not the same issue as with ghost objects - a ghost object as defined here is one which appears in `swift... [16:53:47] (03CR) 10Vgutierrez: [C: 04-1] "from PCC output proxyfetch URL doesn't look good: proxyfetch.url = ["http://prometheus/"]" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [16:55:43] (03CR) 10JMeybohm: [C: 04-1] sre.k8s.upgrade-cluster: wrap run_sync actions with try/except (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:56:06] (03PS3) 10Clément Goubert: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 [16:56:25] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1002.eqiad.wmnet with OS bullseye [16:57:32] (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [16:58:04] (03CR) 10CI reject: [V: 04-1] sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 (owner: 10Clément Goubert) [16:58:35] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1002.eqiad.wmnet with OS bullseye [16:59:03] (03PS3) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) [16:59:22] (03CR) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [17:00:04] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:16] (03CR) 10Elukey: [C: 03+1] Remove the ores::base class from the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [17:01:34] (03PS6) 10Jcrespo: Add unit tests & coverage report [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428 [17:03:03] (03PS1) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) [17:04:21] (03PS4) 10Clément Goubert: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 [17:05:02] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: wgEventStreams - rc1.mediawiki.page_change: enable on all wikis (duration: 07m 11s) [17:05:18] (03CR) 10DCausse: [C: 03+1] "lgtm," [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [17:05:47] (03PS1) 10CDanis: pki: dummy secrets for k8s_aux intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/889176 (https://phabricator.wikimedia.org/T329633) [17:05:57] (03CR) 10JMeybohm: "This won't work as the current maximum memory request of a container is 3Gi by default (see helmfile.d/admin_ng/values/common.yaml)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [17:06:01] (03CR) 10JMeybohm: [C: 04-1] rdf-streaming-updater: Increase memory alloc from 2 to 3GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking) [17:06:19] (03CR) 10CDanis: [V: 03+2 C: 03+2] pki: dummy secrets for k8s_aux intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/889176 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [17:07:01] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [17:09:48] (03CR) 10Btullis: [C: 03+2] Remove the ores::base class from the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [17:09:48] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage [17:09:56] (03PS2) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) [17:10:11] (03PS4) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) [17:10:13] (03PS3) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) [17:12:50] (03PS3) 10Nray: Enable Page Tools for logged in users across all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888764 (https://phabricator.wikimedia.org/T328692) (owner: 10Bernard Wang) [17:12:56] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage [17:13:09] (03CR) 10David Caro: node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [17:13:53] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [17:16:33] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:17:45] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:18:15] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [17:18:28] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database [17:19:51] (03PS13) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [17:20:43] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:21:46] (03PS5) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) [17:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:23:27] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001" [17:23:29] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond) [17:24:43] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001" [17:24:43] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:25:02] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:27:04] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001" [17:28:08] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001" [17:28:08] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:28:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS bullseye [17:28:18] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS bullseye [17:28:24] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:29:11] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1002.eqiad.wmnet with OS bullseye [17:29:26] (03PS6) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) [17:30:26] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001" [17:31:32] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001" [17:31:32] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:32:13] PROBLEM - Host 2620:0:863:1:198:35:26:8 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:14] (03PS9) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [17:32:29] (03CR) 10Elukey: services: add the first lift wing stream to change-prop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [17:32:35] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:32:38] ^ expected [17:32:41] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:32:42] ack [17:32:59] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:33:17] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:33:26] (03PS4) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) [17:33:28] (03PS1) 10CDanis: pki: Again add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889181 (https://phabricator.wikimedia.org/T329633) [17:33:53] PROBLEM - Recursive DNS on 198.35.26.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:34:02] ^ expected [17:34:32] (03CR) 10CDanis: [C: 03+2] pki: Again add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889181 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [17:34:40] (03CR) 10Legoktm: [C: 03+1] "Thanks, your changes make sense to me!" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [17:35:12] (03PS10) 10Legoktm: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) [17:35:36] (03PS7) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) [17:36:27] (03CR) 10Vgutierrez: service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [17:36:58] (03CR) 10CI reject: [V: 04-1] gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm) [17:37:24] (03PS11) 10Legoktm: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) [17:37:29] (JobUnavailable) firing: (8) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:37:39] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39593/console" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [17:39:34] (03PS5) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) [17:43:28] (03CR) 10Herron: [V: 03+1] service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [17:44:29] (03PS1) 10CDanis: rename k8s aux [labs/private] - 10https://gerrit.wikimedia.org/r/889185 (https://phabricator.wikimedia.org/T329633) [17:44:42] (03CR) 10CDanis: [V: 03+2 C: 03+2] rename k8s aux [labs/private] - 10https://gerrit.wikimedia.org/r/889185 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [17:45:48] (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [17:45:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [17:46:05] RECOVERY - Host 2620:0:863:1:198:35:26:8 is UP: PING OK - Packet loss = 0%, RTA = 70.93 ms [17:48:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [17:49:29] PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:49:45] (03PS6) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) [17:49:51] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [17:51:06] (03CR) 10JMeybohm: [C: 03+1] pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [17:51:18] (03CR) 10David Caro: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [17:53:19] (03CR) 10CDanis: [C: 03+2] pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [17:53:45] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39595/console" [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [17:57:29] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1800) [18:00:18] (03CR) 10JMeybohm: [C: 03+1] sre.k8s.upgrade-cluster: wrap run_sync actions with try/except (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [18:02:25] (03CR) 10David Caro: "I got some questions, can you elaborate on what errors are you seeing with the tests currently?" [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe) [18:05:48] (03PS1) 10Bas dehaan: Added extended confirmed on nlwiki Implemented configuration changes regarding page protection for nlwiki, per request of local community. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) [18:05:50] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan) [18:06:17] (03PS1) 10CDanis: role::aux_k8s: upgrade cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/889189 (https://phabricator.wikimedia.org/T329633) [18:06:20] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10Papaul) [18:06:25] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1002'] [18:10:30] !log refactored failed security patch for T278365 [18:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:13] !log upgrading firmware on mc-gp1002 [18:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:50] 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10Papaul) [18:16:19] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889191 (https://phabricator.wikimedia.org/T325586) [18:16:21] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889191 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [18:16:26] 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10Papaul) 05Open→03Resolved Complete [18:16:45] (03PS1) 10CDanis: admin_ng: update aux's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/889194 [18:16:47] (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 [18:16:59] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889191 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [18:17:21] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.23 refs T325586 [18:17:25] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [18:17:44] ok [18:18:56] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (owner: 10Jbond) [18:20:05] PROBLEM - Host mc-gp1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:49] RECOVERY - Host mc-gp1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [18:24:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-gp1002'] [18:29:41] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1002'] [18:30:32] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10Papaul) @Marostegui we are getting the error below on the interface where db2099 is connected. It might be a bad cable or bad port or something else. I tried clearing the statistics last week end but this came back up. I will pi... [18:31:09] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Migrate sre.switchdc.mediawiki to spicerack class API - https://phabricator.wikimedia.org/T328908 (10Krinkle) [18:31:20] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Krinkle) [18:31:30] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Krinkle) [18:31:41] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Krinkle) [18:31:58] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Krinkle) [18:32:08] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Krinkle) [18:32:17] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 2 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10Krinkle) [18:32:35] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889189 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [18:33:01] (03CR) 10JHathaway: [C: 03+1] "looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/889194 (owner: 10CDanis) [18:33:30] Krinkle: Sorry, I should clean the tags up when creating subtask. It's kind of annoying that phabricator does that. [18:34:00] claime: aye, no problem. +1 at T239378 if you like :) [18:34:00] T239378: Disable parent task metadata by default for new sub tasks - https://phabricator.wikimedia.org/T239378 [18:34:13] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10JMeybohm) [18:34:30] fwiw, it is sometimes intended, but usually not indeed. I'd say it's easy enough to set them directly when creating the subtask if/when it is intended. [18:34:39] PROBLEM - Host mc-gp1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:34:41] yea, I find myself just creating a task first and then linking it as subtask after the fact. because more often than not I dont want all the subscribers and tags [18:35:54] !log cdanis@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: upgrade to v1.23 [18:35:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:36:54] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns4004.wikimedia.org with OS bullseye [18:37:03] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS bullseye executed with errors: - dns4004 (**FAIL**) - Downtimed o... [18:37:20] !log cdanis@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: upgrade to v1.23 [18:37:33] RECOVERY - Host mc-gp1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:37:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mc-gp1002'] [18:38:00] !log reimage dns4004 back to buster to resolve pdns-rec Prometheus endpoit issues: T321309 [18:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:04] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [18:38:14] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS buster [18:38:24] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster [18:39:14] (03PS1) 10Herron: pontoon: don't deploy benthos instances with prod config [puppet] - 10https://gerrit.wikimedia.org/r/889198 [18:39:43] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - aux-k8s-ctrl_6443: Servers aux-k8s-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:40:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:41:58] (03CR) 10Cwhite: [C: 03+1] pontoon: don't deploy benthos instances with prod config [puppet] - 10https://gerrit.wikimedia.org/r/889198 (owner: 10Herron) [18:42:21] (03PS2) 10Herron: pontoon: don't deploy benthos instances with prod config [puppet] - 10https://gerrit.wikimedia.org/r/889198 [18:42:29] (JobUnavailable) firing: (9) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:43:27] (03CR) 10CDanis: [C: 03+2] sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [18:44:14] (JobUnavailable) firing: (12) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:45:12] (03Merged) 10jenkins-bot: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [18:45:25] (03PS1) 10Ssingh: P:dns::recursor: skip installation of prometheus-pdns-rec-exporter [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309) [18:46:29] PROBLEM - Host 2620:0:863:1:198:35:26:8 is DOWN: CRITICAL - Destination Unreachable (2620:0:863:1:198:35:26:8) [18:46:38] !log cdanis@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: upgrade to v1.23 [18:47:17] 2620:0:863:1:198:35:26:8 down is expected (dns4004) [18:47:37] (03CR) 10CDanis: [C: 03+2] role::aux_k8s: upgrade cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/889189 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis) [18:47:41] thanks, I was just wondering why I have never seen a MAC there [18:47:53] eh, v6 IP of course [18:47:54] !log cdanis@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: upgrade to v1.23 [18:48:10] :) [18:48:58] (03PS2) 10Ssingh: P:dns::recursor: skip installation of prometheus-pdns-rec-exporter [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309) [18:49:58] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39597/console" [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [18:51:16] (03PS1) 10CDanis: k8s.upgrade-cluster: fix bug in re-enabling Puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/889200 [18:51:56] (03CR) 10CDanis: [C: 03+2] k8s.upgrade-cluster: fix bug in re-enabling Puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/889200 (owner: 10CDanis) [18:53:46] (03Merged) 10jenkins-bot: k8s.upgrade-cluster: fix bug in re-enabling Puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/889200 (owner: 10CDanis) [18:54:15] RECOVERY - Host 2620:0:863:1:198:35:26:8 is UP: PING OK - Packet loss = 0%, RTA = 70.93 ms [18:54:27] !log cdanis@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: upgrade to v1.23 [18:55:14] !log cdanis@cumin1001 START - Cookbook sre.ganeti.reimage for host aux-k8s-ctrl1001.eqiad.wmnet with OS bullseye [18:55:49] (03CR) 10Dzahn: ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [18:56:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [18:58:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [18:59:42] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:47] (03PS1) 10Ahmon Dancy: Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/889202 [19:00:05] dduvall and ^demon: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1900). [19:00:32] (03PS1) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) [19:00:42] (03CR) 10Ahmon Dancy: [C: 03+2] Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/889202 (owner: 10Ahmon Dancy) [19:00:53] (03CR) 10CI reject: [V: 04-1] dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:01:18] (03Merged) 10jenkins-bot: Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/889202 (owner: 10Ahmon Dancy) [19:01:38] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39598/console" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:01:50] (03PS2) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) [19:01:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:02:11] (03CR) 10CI reject: [V: 04-1] dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:03:20] (03CR) 10Dzahn: "Antoine said on IRC: "I think the change to the zuul_merger_hosts variable should not be in this change, but the jenkins_master_hosts sho" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [19:03:56] (03CR) 10Dzahn: ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [19:06:22] !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [19:06:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1003'] [19:07:03] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp1003'] [19:07:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1003'] [19:07:29] (JobUnavailable) firing: (12) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:08:24] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.23 refs T325586 (duration: 51m 03s) [19:08:28] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [19:09:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:09:26] !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [19:09:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:09:52] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools on mobile at almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889204 (https://phabricator.wikimedia.org/T328940) [19:09:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:09:55] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - aux-k8s-ctrl_6443: Servers aux-k8s-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:09:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:10:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [19:10:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [19:10:53] (03PS3) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) [19:11:50] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39599/console" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:14:56] (03PS6) 10Dzahn: ci: move lists of contint hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593 [19:15:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [19:15:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance [19:15:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T328255)', diff saved to https://phabricator.wikimedia.org/P44628 and previous config saved to /var/cache/conftool/dbconfig/20230214-191550-ladsgroup.json [19:15:54] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [19:16:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:17:05] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond) [19:17:10] !log upgrading firmware on mc-gp1003 [19:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:17] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond) p:05Triage→03Medium [19:19:23] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond) [19:20:12] (03PS4) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) [19:20:47] PROBLEM - Host mc-gp1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:21:09] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39601/console" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:21:43] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:21:56] !log cdanis@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aux-k8s-ctrl1001.eqiad.wmnet with OS bullseye [19:22:00] (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 [19:22:06] !log cdanis@cumin1001 START - Cookbook sre.ganeti.reimage for host aux-k8s-ctrl1002.eqiad.wmnet with OS bullseye [19:22:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T328255)', diff saved to https://phabricator.wikimedia.org/P44629 and previous config saved to /var/cache/conftool/dbconfig/20230214-192242-ladsgroup.json [19:22:46] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [19:23:41] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (owner: 10Jbond) [19:25:31] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp1003'] [19:25:37] RECOVERY - Host mc-gp1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [19:26:57] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond) [19:27:13] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1003'] [19:28:59] we are out of disk space on deploy1002 :/ [19:29:08] in /srv [19:29:14] (JobUnavailable) firing: (12) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:30:02] doh! [19:31:04] PROBLEM - Host mc-gp1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:31:11] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Varnent) a:05Varnent→03None [19:31:31] !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [19:31:48] !log scap sync-world failed due to lack of disk space on deploy1002 /srv (cc T325586) [19:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:52] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [19:31:58] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:33:23] (03CR) 10BCornwall: [C: 03+1] P:dns::recursor: skip installation of prometheus-pdns-rec-exporter [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [19:33:47] !log running `docker system prune` on deploy1002 to free up disk space on /srv [19:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:58] nooo [19:34:00] dduvall: a huge chunk of space is used by "deployment.T307349" which seems like a copy of the normal deployment dir. so that ticket number is probably where we should comment [19:34:01] T307349: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 [19:34:06] that'll make k8s build slow [19:34:20] T307349 [19:34:27] scap is broken atm. seems like it's worth it? [19:34:33] !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [19:34:36] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp1003'] [19:34:38] RECOVERY - Host mc-gp1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [19:34:38] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Varnent) I believe it is in pipeline for any requests lingering - but probably best to check with @CKoerner_WMF for diff. For the other two - while not her dire... [19:34:40] Definitely delete /srv/deployment.T307349 [19:34:40] reclaimable: 111G [19:35:20] 10SRE, 10Deployments, 10bacula, 10Parsoid (Tracking), 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10Dzahn) There is a directory "deployment.T307349" under /srv/ on deploy1002 that uses 47GB. And the... [19:36:25] dduvall: Did you capture a `docker system df -v` ahead of time? I'd like to see it if so [19:36:28] alright. but we should implement some docker clean up job or move its store elsewhere i think [19:36:34] !log root@deploy1002:/srv# rm -rf deployment.T307349/ [19:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:37] i haven't pruned [19:36:41] nod.. I added a note to self to run docker-gc on deploy1002 [19:36:52] /dev/mapper/vg0-srv 277G 216G 47G 83% /srv [19:36:54] ah good. then I'll look myself. [19:36:56] here you go [19:37:01] https://www.irccloud.com/pastebin/f9tUZ4EU/ [19:37:09] mutante: thank you :) [19:37:13] yw [19:37:25] !log did not run `docker system prune` due to objections [19:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:37:45] 10SRE, 10Deployments, 10bacula, 10Parsoid (Tracking), 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10Dzahn) I deleted it. 19:36 < mutante> !log root@deploy1002:/srv# rm -rf deployment.T307349/ 19:36... [19:37:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P44630 and previous config saved to /var/cache/conftool/dbconfig/20230214-193748-ladsgroup.json [19:38:41] (03PS1) 10Gehel: conftool / cirrus: elastic2069 in wrong LVS pool [puppet] - 10https://gerrit.wikimedia.org/r/889212 (https://phabricator.wikimedia.org/T329145) [19:38:56] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10Papaul) [19:39:15] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10Papaul) 05Open→03Resolved a:03Papaul @jijiki complete [19:39:25] dancy: i'm confused though. wouldn't today's image be the base for subsequent builds? [19:39:33] since it's `scap stage-train -Dfull_image_build:True --yes auto` [19:39:45] oh right.. first train of the week [19:39:48] ok. objection removed. [19:39:53] :) [19:40:17] we have space again. i'll leave it for now [19:40:23] thx.. that'll make testing docker-gc easier [19:40:31] (03CR) 10Ryan Kemper: [C: 03+1] conftool / cirrus: elastic2069 in wrong LVS pool [puppet] - 10https://gerrit.wikimedia.org/r/889212 (https://phabricator.wikimedia.org/T329145) (owner: 10Gehel) [19:40:40] w00t [19:40:41] (03CR) 10Gehel: [C: 03+2] conftool / cirrus: elastic2069 in wrong LVS pool [puppet] - 10https://gerrit.wikimedia.org/r/889212 (https://phabricator.wikimedia.org/T329145) (owner: 10Gehel) [19:40:43] (03CR) 10Bking: [C: 03+1] conftool / cirrus: elastic2069 in wrong LVS pool [puppet] - 10https://gerrit.wikimedia.org/r/889212 (https://phabricator.wikimedia.org/T329145) (owner: 10Gehel) [19:41:36] (03PS3) 10BCornwall: Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [19:41:37] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.23 refs T325586 [19:41:40] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [19:42:31] (03CR) 10Dzahn: ci: move lists of contint hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [19:42:36] dancy: fyi i'm running `stage-train` again (but without the full image build), just because... idempotence [19:43:02] Sounds good. [19:43:14] !log gehel@puppetmaster1001 conftool action : set/pooled=active; selector: name=elastic2069.cofdw.wmnet [19:43:21] !log gehel@puppetmaster1001 conftool action : set/weight=10; selector: name=elastic2069.cofdw.wmnet [19:45:02] !log gehel@puppetmaster1001 conftool action : set/weight=10; selector: name=elastic2069.cofdw.wmnet,service=elasticsearch-psi-ssl [19:46:41] !log gehel@puppetmaster1001 conftool action : set/weight=10; selector: name=elastic2069.codfw.wmnet,service=elasticsearch-psi-ssl [19:46:52] !log cdanis@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aux-k8s-ctrl1002.eqiad.wmnet with OS bullseye [19:46:53] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2069.codfw.wmnet,service=elasticsearch-psi-ssl [19:47:29] (JobUnavailable) firing: (12) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:48:45] (03CR) 10Volans: "post-merge comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [19:50:51] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.23 refs T325586 (duration: 09m 14s) [19:50:55] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [19:51:38] RECOVERY - Disk space on deploy1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1002&var-datasource=eqiad+prometheus/ops [19:51:45] 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) @Dzahn Thanks for following up. I am just seeing this ticket. I will follow up and get back to you. [19:52:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P44631 and previous config saved to /var/cache/conftool/dbconfig/20230214-195255-ladsgroup.json [19:53:03] !log dduvall@deploy1002 Pruned MediaWiki: 1.40.0-wmf.21 (duration: 02m 10s) [19:53:20] (03CR) 10Ssingh: [C: 03+1] Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:03:20] 10SRE, 10ops-eqsin, 10ops-ulsfo, 10DC-Ops: eqsin & ulsfo: new R450s drawing far more power than R440s (power over contracted caps in both sites) - https://phabricator.wikimedia.org/T328957 (10wiki_willy) Tim's previous suggestion was from T315398. However, that applies primarily to mediawiki servers and w... [20:04:35] (ConfdResourceFailed) firing: (64) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:04:43] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889217 (https://phabricator.wikimedia.org/T325586) [20:04:45] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889217 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [20:05:20] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889217 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot) [20:06:15] (03PS3) 10BCornwall: Remove aliases 'minnan' and 'zh-cfr' [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:07:51] (03CR) 10Dzahn: [C: 03+2] "valid ISO-639-3 code - https://iso639-3.sil.org/code/vro" [puppet] - 10https://gerrit.wikimedia.org/r/527915 (https://phabricator.wikimedia.org/T31186) (owner: 10Fomafix) [20:08:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T328255)', diff saved to https://phabricator.wikimedia.org/P44632 and previous config saved to /var/cache/conftool/dbconfig/20230214-200801-ladsgroup.json [20:08:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [20:08:05] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [20:08:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [20:08:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T328255)', diff saved to https://phabricator.wikimedia.org/P44633 and previous config saved to /var/cache/conftool/dbconfig/20230214-200822-ladsgroup.json [20:09:29] (03CR) 10Vgutierrez: service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [20:09:35] (ConfdResourceFailed) firing: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:09:46] !log bking@cumin1001 START - Cookbook sre.ganeti.reimage for host an-airflow1005.eqiad.wmnet with OS bullseye [20:10:26] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39604/console" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:11:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T328255)', diff saved to https://phabricator.wikimedia.org/P44634 and previous config saved to /var/cache/conftool/dbconfig/20230214-201114-ladsgroup.json [20:11:55] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39605/console" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:12:18] (03CR) 10Dzahn: [C: 03+1] "valid ISO-639-3 code - https://www.ethnologue.com/language/sgs" [dns] - 10https://gerrit.wikimedia.org/r/481539 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix) [20:12:34] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.23 refs T325586 [20:12:38] T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586 [20:12:42] (03CR) 10Dzahn: [C: 03+2] Add 'sgs' as alias for 'bat-smg' [dns] - 10https://gerrit.wikimedia.org/r/481539 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix) [20:12:46] (03PS5) 10Dzahn: Add 'sgs' as alias for 'bat-smg' [dns] - 10https://gerrit.wikimedia.org/r/481539 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix) [20:14:24] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39608/console" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:14:35] (ConfdResourceFailed) resolved: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:15:01] (03CR) 10Dzahn: "also bat-smg matches sgs per https://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered_by_code" [dns] - 10https://gerrit.wikimedia.org/r/481539 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix) [20:17:13] sukhe: authdns-update returns an error currently because dns4004 does not have /usr/sbin/gdnsd yet but is in the list of sync hosts [20:17:19] yep [20:17:23] (03PS1) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) [20:17:24] trying to figure it out [20:17:34] is it ok if I keep using it though? [20:17:34] worse case, I will just remove it from there [20:17:35] (ConfdResourceFailed) firing: (64) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:17:35] thanks [20:17:49] yeaH I think I am going to just remove it [20:17:56] alright [20:18:07] mutante: are you merging a dns change? [20:18:13] if you can wait for a bit, please do [20:18:15] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10jhathaway) a:03jhathaway [20:18:20] (03PS1) 10Sbailey: Change linter maintenance scripts to use existing config varaibles [extensions/Linter] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889220 (https://phabricator.wikimedia.org/T329342) [20:18:25] I will either resolve it or not and failing which just remove dns4004 [20:18:34] sukhe: I have like 10 of them :) [20:18:38] I will wait [20:19:04] thanks [20:19:04] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 2 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10Ladsgroup) Option 5 sounds good too, I think we can also reuse this solution in toolhub too (T329319) [20:19:31] (03CR) 10CI reject: [V: 04-1] Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [20:20:19] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns4004.wikimedia.org with OS buster [20:20:32] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster executed with errors: - dns4004 (**FAIL**) - Downtimed on... [20:20:32] yeah this will take a while as it did last time, we have some weird Puppet dependency failure here [20:20:36] 10SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Dzahn) sgs.wikpedia.org has been added to DNS now sgs.wikipedia.org is... [20:20:38] for now I am going to remove dns4004 from the list [20:20:44] (03CR) 10Sbailey: "I think this is correctly constructed" [extensions/Linter] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889220 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey) [20:21:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dns4004.wikimedia.org with reason: failure during reimaging [20:21:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dns4004.wikimedia.org with reason: failure during reimaging [20:21:24] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-airflow1005.eqiad.wmnet with reason: host reimage [20:22:35] (ConfdResourceFailed) firing: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:22:46] (03CR) 10Arlolra: [C: 03+1] Change linter maintenance scripts to use existing config varaibles [extensions/Linter] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889220 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey) [20:23:29] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39611/console" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:23:40] mutante: patch coming [20:23:43] (03PS2) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) [20:24:04] sukhe: thank you, no rush. but let me know when it's ok to sync again [20:24:04] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-airflow1005.eqiad.wmnet with reason: host reimage [20:24:26] I think should be OK [20:24:29] (03PS2) 10Dzahn: Add 'rup' as alias for 'roa-rup' [dns] - 10https://gerrit.wikimedia.org/r/527916 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix) [20:24:29] but let's make it clean [20:25:17] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "Sorry for the braindead PCC operations." [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:25:31] (03PS1) 10Ssingh: hiera: temporarily remove references to dns4004 [puppet] - 10https://gerrit.wikimedia.org/r/889221 (https://phabricator.wikimedia.org/T321309) [20:25:50] (03CR) 10CI reject: [V: 04-1] Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [20:26:20] (03CR) 10Ssingh: [C: 03+2] hiera: temporarily remove references to dns4004 [puppet] - 10https://gerrit.wikimedia.org/r/889221 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:26:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P44635 and previous config saved to /var/cache/conftool/dbconfig/20230214-202620-ladsgroup.json [20:27:11] (03CR) 10Dzahn: "needs deployment from serviceops team (might need apache restarts across cluster and syncing to k8s)" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:27:29] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:27:47] (03CR) 10Dzahn: [C: 03+1] "valid per https://iso639-3.sil.org/code/rup | matches https://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered_by_code" [dns] - 10https://gerrit.wikimedia.org/r/527916 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix) [20:29:48] (03CR) 10Dzahn: [C: 03+1] "valid code per https://iso639-3.sil.org/code/egl but not yet in https://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered_by" [dns] - 10https://gerrit.wikimedia.org/r/527932 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix) [20:30:25] (03PS4) 10BCornwall: Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:30:38] (03CR) 10Dzahn: [C: 03+1] "valid per https://iso639-3.sil.org/code/cbk but not yet in https://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered_by_code" [dns] - 10https://gerrit.wikimedia.org/r/527911 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix) [20:31:36] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:33:00] (03CR) 10Herron: [V: 03+1] service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [20:33:07] (03CR) 10Dzahn: [C: 03+1] "valid per https://iso639-3.sil.org/code/bho but bho not yet a comment with "bh" in https://meta.wikimedia.org/wiki/Template:List_of_langua" [dns] - 10https://gerrit.wikimedia.org/r/528781 (https://phabricator.wikimedia.org/T41968) (owner: 10Fomafix) [20:34:28] sukhe: I already merged the dns changes and was running authdns-update (dns4004) failed with no such file for /usr/sbin/gdnsd [20:34:52] yeah [20:35:16] (03CR) 10Dzahn: [C: 03+1] "valid per https://iso639-3.sil.org/code/nrf" [dns] - 10https://gerrit.wikimedia.org/r/527908 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix) [20:35:19] should be OK, just that good if others don't come across it. creates confusion :) [20:35:24] removed it from the list [20:35:31] mutante: please feel free to go ahead [20:35:54] sukhe: thanks:) [20:36:02] thanks for waiting and sorry [20:37:16] RECOVERY - Check systemd state on an-airflow1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:35] (ConfdResourceFailed) resolved: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:39:15] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "@RLazarus and @joe, I'm told you should handle the merging of this. If that's true, would you be kind enough to do that? 😊" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix) [20:39:38] no worries at all sukhe :)) [20:39:44] (03CR) 10Dzahn: [C: 03+2] Add 'rup' as alias for 'roa-rup' [dns] - 10https://gerrit.wikimedia.org/r/527916 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix) [20:39:48] (03PS3) 10Dzahn: Add 'rup' as alias for 'roa-rup' [dns] - 10https://gerrit.wikimedia.org/r/527916 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix) [20:41:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P44636 and previous config saved to /var/cache/conftool/dbconfig/20230214-204126-ladsgroup.json [20:43:14] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, and 2 others: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10BCornwall) DNS merged and deployed! Now just waiting for deployment of the appserver stuff. [20:44:38] RECOVERY - Recursive DNS on 2620:0:863:1:198:35:26:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:45:10] (03CR) 10Dzahn: [C: 03+2] Add 'bho' as alias for 'bh' [dns] - 10https://gerrit.wikimedia.org/r/528781 (https://phabricator.wikimedia.org/T41968) (owner: 10Fomafix) [20:45:12] (03PS2) 10Dzahn: Add 'bho' as alias for 'bh' [dns] - 10https://gerrit.wikimedia.org/r/528781 (https://phabricator.wikimedia.org/T41968) (owner: 10Fomafix) [20:46:14] RECOVERY - Recursive DNS on 198.35.26.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:46:56] (03PS3) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) [20:49:45] (03PS4) 10Dzahn: Add 'nrf' as alias for 'nrm' [dns] - 10https://gerrit.wikimedia.org/r/527908 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix) [20:50:21] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [20:50:49] (03CR) 10Dzahn: [C: 03+2] Add 'nrf' as alias for 'nrm' [dns] - 10https://gerrit.wikimedia.org/r/527908 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix) [20:53:07] (03PS2) 10Dzahn: Add 'egl' as alias for 'eml' [dns] - 10https://gerrit.wikimedia.org/r/527932 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix) [20:54:07] (03CR) 10Dzahn: [C: 03+2] Add 'egl' as alias for 'eml' [dns] - 10https://gerrit.wikimedia.org/r/527932 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix) [20:54:29] (03PS1) 10Ahmon Dancy: mw-debug/values-traindev.yaml: Move mcrouter config into cache section [deployment-charts] - 10https://gerrit.wikimedia.org/r/889227 [20:56:03] (03PS2) 10Dzahn: Add 'cbk' as alias for 'cbk-zam' [dns] - 10https://gerrit.wikimedia.org/r/527911 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix) [20:56:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T328255)', diff saved to https://phabricator.wikimedia.org/P44637 and previous config saved to /var/cache/conftool/dbconfig/20230214-205633-ladsgroup.json [20:56:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [20:56:37] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [20:56:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [20:56:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [20:57:01] (03CR) 10Dzahn: [C: 03+2] Add 'cbk' as alias for 'cbk-zam' [dns] - 10https://gerrit.wikimedia.org/r/527911 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix) [20:57:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [20:57:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T328255)', diff saved to https://phabricator.wikimedia.org/P44638 and previous config saved to /var/cache/conftool/dbconfig/20230214-205709-ladsgroup.json [21:00:00] (03CR) 10Ahmon Dancy: [C: 03+2] mw-debug/values-traindev.yaml: Move mcrouter config into cache section [deployment-charts] - 10https://gerrit.wikimedia.org/r/889227 (owner: 10Ahmon Dancy) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T2100). [21:00:05] danisztls, nray, and sbailey: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:32] o/ [21:00:39] I am here [21:01:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T328255)', diff saved to https://phabricator.wikimedia.org/P44639 and previous config saved to /var/cache/conftool/dbconfig/20230214-210102-ladsgroup.json [21:05:18] (03Merged) 10jenkins-bot: mw-debug/values-traindev.yaml: Move mcrouter config into cache section [deployment-charts] - 10https://gerrit.wikimedia.org/r/889227 (owner: 10Ahmon Dancy) [21:05:53] (03PS8) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) [21:07:29] (JobUnavailable) firing: (11) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:07:59] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39612/console" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [21:09:58] (KubernetesRsyslogDown) firing: rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-ctrl1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:11:21] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/889198 (owner: 10Herron) [21:11:45] (03CR) 10Herron: [V: 03+1] service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [21:13:57] Is anyone available to do the UTC late backport window? [21:16:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P44640 and previous config saved to /var/cache/conftool/dbconfig/20230214-211608-ladsgroup.json [21:16:25] TheresNoTime: poke, you around? [21:16:55] I can help w/ deploys if nobody shows up. [21:17:28] dancy: id assume 15 minutes is a long enough delay [21:17:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Jhancock.wm) a:03Jhancock.wm [21:18:02] Fair enough. [21:19:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2440.codfw.wmnet with OS buster [21:19:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster [21:20:11] !log dancy@deploy1002 Backport cancelled. [21:20:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888764 (https://phabricator.wikimedia.org/T328692) (owner: 10Bernard Wang) [21:20:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2441.codfw.wmnet with OS buster [21:21:01] (03Merged) 10jenkins-bot: Enable Page Tools for logged in users across all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888764 (https://phabricator.wikimedia.org/T328692) (owner: 10Bernard Wang) [21:21:01] danisztls are you around? [21:21:01] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2441.codfw.wmnet with OS buster [21:21:29] !log dancy@deploy1002 Started scap: Backport for [[gerrit:888764|Enable Page Tools for logged in users across all wikis (T328692)]] [21:21:33] T328692: Enable page tools everywhere for logged in users - https://phabricator.wikimedia.org/T328692 [21:22:29] (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:23:20] !log dancy@deploy1002 dancy and bwang: Backport for [[gerrit:888764|Enable Page Tools for logged in users across all wikis (T328692)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:23:49] nray: ^^ That one is yours.. Ready for testing on mwdebug [21:24:01] Great, thanks @dancy . Looking now [21:28:08] @dancy looks good! You can proceed [21:28:45] Proceeding [21:30:44] (03PS1) 10Andrea Denisse: quickdatacopy: Add option to show progress during transfer [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T318778) [21:31:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P44642 and previous config saved to /var/cache/conftool/dbconfig/20230214-213115-ladsgroup.json [21:32:19] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39614/console" [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [21:33:28] (03PS1) 10Andrew Bogott: OpenStack nova: increase nova-api workers per node from 2 to 6 [puppet] - 10https://gerrit.wikimedia.org/r/889235 (https://phabricator.wikimedia.org/T328155) [21:33:44] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:33:50] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:34:00] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:34:18] (03PS2) 10Andrew Bogott: OpenStack nova: increase nova-api workers per node from 2 to 6 [puppet] - 10https://gerrit.wikimedia.org/r/889235 (https://phabricator.wikimedia.org/T328155) [21:34:20] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:888764|Enable Page Tools for logged in users across all wikis (T328692)]] (duration: 12m 50s) [21:34:24] T328692: Enable page tools everywhere for logged in users - https://phabricator.wikimedia.org/T328692 [21:34:26] nray: All done [21:34:31] (03PS3) 10Andrew Bogott: OpenStack nova: increase nova-api workers per node from 2 to 8 [puppet] - 10https://gerrit.wikimedia.org/r/889235 (https://phabricator.wikimedia.org/T328155) [21:34:37] sbailey: You're next [21:34:40] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:34:41] thanks so much @dancy ! [21:34:42] ok [21:34:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Jhancock.wm) a:05Papaul→03Jhancock.wm [21:34:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:35:13] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: increase nova-api workers per node from 2 to 8 [puppet] - 10https://gerrit.wikimedia.org/r/889235 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [21:35:33] !log dancy@deploy1002 Backport cancelled. [21:36:17] sbailey: I notice there is an extension.json file in the commit. That may imply that there's a certain sync order required. [21:36:46] BTW, 889220 updates maintenance code that is only is run manually. No need to worry about extension.json order [21:36:59] ok great [21:37:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [extensions/Linter] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889220 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey) [21:37:17] This is job executed code so sync after merge, no way to test in a timely fashioon [21:39:08] (03PS2) 10Andrea Denisse: quickdatacopy: Add option to show progress during transfer [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T329683) [21:39:20] (03Merged) 10jenkins-bot: Change linter maintenance scripts to use existing config varaibles [extensions/Linter] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889220 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey) [21:39:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2440.codfw.wmnet with reason: host reimage [21:39:48] !log dancy@deploy1002 Started scap: Backport for [[gerrit:889220|Change linter maintenance scripts to use existing config varaibles (T329342)]] [21:39:51] T329342: Enable maintenance Linter data migration scripts for namespace and tag and template - https://phabricator.wikimedia.org/T329342 [21:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:40:51] (03CR) 10Andrea Denisse: "I added the rationale for this change in the task. 😊" [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T329683) (owner: 10Andrea Denisse) [21:41:34] !log dancy@deploy1002 dancy and sbailey: Backport for [[gerrit:889220|Change linter maintenance scripts to use existing config varaibles (T329342)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:41:52] continuing [21:42:28] Thank you @Dancy :-) [21:42:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2440.codfw.wmnet with reason: host reimage [21:44:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2441.codfw.wmnet with reason: host reimage [21:46:07] (03PS1) 10Andrea Denisse: centrallog: Show transfer progress when using quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/889239 (https://phabricator.wikimedia.org/T318778) [21:46:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T328255)', diff saved to https://phabricator.wikimedia.org/P44643 and previous config saved to /var/cache/conftool/dbconfig/20230214-214621-ladsgroup.json [21:46:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [21:46:26] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [21:46:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [21:46:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44644 and previous config saved to /var/cache/conftool/dbconfig/20230214-214642-ladsgroup.json [21:47:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2441.codfw.wmnet with reason: host reimage [21:47:32] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:889220|Change linter maintenance scripts to use existing config varaibles (T329342)]] (duration: 07m 44s) [21:47:36] T329342: Enable maintenance Linter data migration scripts for namespace and tag and template - https://phabricator.wikimedia.org/T329342 [21:48:09] Great, thanks @dancy [21:48:18] No problem. [21:48:43] (03CR) 10Andrea Denisse: "Hello, this task is possibly going to fail CI because a prerequisite for it is merging patch #889231." [puppet] - 10https://gerrit.wikimedia.org/r/889239 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse) [21:48:52] Last call for danisztls ! [21:53:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44645 and previous config saved to /var/cache/conftool/dbconfig/20230214-215351-ladsgroup.json [21:53:55] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [21:57:29] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:57:33] (03CR) 10JHathaway: "kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway) [21:59:05] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:03:07] 10SRE, 10Wikimedia-Interwiki-links, 10Patch-For-Review: Please add ISO code interwikis for non-standard language codes - https://phabricator.wikimedia.org/T23915 (10Dzahn) 05Resolved→03Open Seems like it can't be both "resolved" and "needs patches" / "aliases in ticket still missing" at the same time. [22:04:22] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:08:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P44646 and previous config saved to /var/cache/conftool/dbconfig/20230214-220857-ladsgroup.json [22:10:37] !log eoghan@cumin2002 START - Cookbook sre.ganeti.reimage for host aphlict2001.codfw.wmnet with OS bullseye [22:12:20] (03PS1) 10Krinkle: mc: Add new $wgWANObjectCache setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889245 (https://phabricator.wikimedia.org/T329680) [22:12:24] (03PS1) 10Krinkle: mc: Remove unused $wgWANObjectCaches and $wgMainWANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889246 (https://phabricator.wikimedia.org/T329680) [22:13:22] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "valid code per https://iso639-3.sil.org/code/cmn" [dns] - 10https://gerrit.wikimedia.org/r/528831 (https://phabricator.wikimedia.org/T23915) (owner: 10Fomafix) [22:13:26] (03PS3) 10Dzahn: Add 'cmn' as alias for 'zh' [dns] - 10https://gerrit.wikimedia.org/r/528831 (https://phabricator.wikimedia.org/T23915) (owner: 10Fomafix) [22:23:56] 10SRE, 10Wikimedia-Interwiki-links, 10Patch-For-Review: Please add ISO code interwikis for non-standard language codes - https://phabricator.wikimedia.org/T23915 (10Dzahn) nan = zh-min-nan - already there since https://gerrit.wikimedia.org/r/c/operations/dns/+/479890 vro = fiu-vro - added in https://gerrit.w... [22:24:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P44647 and previous config saved to /var/cache/conftool/dbconfig/20230214-222403-ladsgroup.json [22:24:11] (03PS1) 10Brennen Bearnes: phabricator config: add gitlab_api_key [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) [22:24:33] (03CR) 10CI reject: [V: 04-1] phabricator config: add gitlab_api_key [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes) [22:25:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:25:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2440.codfw.wmnet with OS buster [22:25:23] !log eoghan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage [22:25:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster completed: - mw2440 (**PASS**) - Removed from Pupp... [22:25:31] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [22:25:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2441.codfw.wmnet with OS buster [22:25:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2441.codfw.wmnet with OS buster completed: - mw2441 (**PASS**) - Removed from Pupp... [22:25:43] (03PS2) 10Dzahn: Add 'vro' as alias for 'fiu-vro' [dns] - 10https://gerrit.wikimedia.org/r/527914 (https://phabricator.wikimedia.org/T31186) (owner: 10Fomafix) [22:25:58] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2442.codfw.wmnet with OS buster [22:26:06] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2442.codfw.wmnet with OS buster [22:26:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2443.codfw.wmnet with OS buster [22:26:20] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2443.codfw.wmnet with OS buster [22:26:29] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "valid code per https://iso639-3.sil.org/code/vro | mapping also in https://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered" [dns] - 10https://gerrit.wikimedia.org/r/527914 (https://phabricator.wikimedia.org/T31186) (owner: 10Fomafix) [22:26:42] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-airflow1005.eqiad.wmnet with reason: new OS but some puppet stuff doesn't work yet [22:26:45] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-airflow1005.eqiad.wmnet with reason: new OS but some puppet stuff doesn't work yet [22:28:04] 10SRE, 10Wikimedia-Interwiki-links, 10Patch-For-Review: Please add ISO code interwikis for non-standard language codes - https://phabricator.wikimedia.org/T23915 (10Dzahn) - cmn.wikipedia.org has been added to DNS - vro.wikipedia.org has been added to DNS [22:28:16] 10SRE, 10Wikimedia-Interwiki-links, 10Patch-For-Review: Please add ISO code interwikis for non-standard language codes - https://phabricator.wikimedia.org/T23915 (10Dzahn) 05Open→03Resolved [22:28:31] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage [22:35:40] (03PS1) 10Dzahn: add language 'gsw' for Alemannic, Alsatian, Swiss German [dns] - 10https://gerrit.wikimedia.org/r/889250 (https://phabricator.wikimedia.org/T6793) [22:39:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44648 and previous config saved to /var/cache/conftool/dbconfig/20230214-223910-ladsgroup.json [22:39:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [22:39:14] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [22:39:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [22:39:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T328255)', diff saved to https://phabricator.wikimedia.org/P44649 and previous config saved to /var/cache/conftool/dbconfig/20230214-223931-ladsgroup.json [22:39:50] !log eoghan@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aphlict2001.codfw.wmnet with OS bullseye [22:41:15] * mutante works on Bugzilla tickets (authored by bzimport) [22:42:02] (03CR) 10Dzahn: [C: 03+2] add language 'gsw' for Alemannic, Alsatian, Swiss German [dns] - 10https://gerrit.wikimedia.org/r/889250 (https://phabricator.wikimedia.org/T6793) (owner: 10Dzahn) [22:42:34] mutante: doh, good ol' bugzy [22:43:46] when the bug number is 4 digit [22:44:45] just recently we had some place where it was about a regex for "T" followed by numbers and it almost became "at least 5 digits".. well here are the counter examples [22:45:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T328255)', diff saved to https://phabricator.wikimedia.org/P44650 and previous config saved to /var/cache/conftool/dbconfig/20230214-224519-ladsgroup.json [22:45:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2442.codfw.wmnet with reason: host reimage [22:45:23] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [22:46:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2443.codfw.wmnet with reason: host reimage [22:47:31] (03PS2) 10Brennen Bearnes: phabricator config: add gitlab_api_key [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) [22:48:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2442.codfw.wmnet with reason: host reimage [22:50:34] (03PS1) 10Dzahn: add language code 'syc' for Syriac [dns] - 10https://gerrit.wikimedia.org/r/889254 (https://phabricator.wikimedia.org/T28725) [22:51:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2443.codfw.wmnet with reason: host reimage [22:52:38] (03CR) 10Dzahn: [C: 03+2] add language code 'syc' for Syriac [dns] - 10https://gerrit.wikimedia.org/r/889254 (https://phabricator.wikimedia.org/T28725) (owner: 10Dzahn) [23:00:14] (03PS4) 10Dzahn: Add 'egl' as alias for 'eml' [puppet] - 10https://gerrit.wikimedia.org/r/527933 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix) [23:00:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P44651 and previous config saved to /var/cache/conftool/dbconfig/20230214-230025-ladsgroup.json [23:03:25] (03CR) 10Dzahn: "do you need us to add an actual API key in private repo?" [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes) [23:03:33] !log cwhite@deploy1002 Started deploy [releng/phatality@eaa4c16]: T314098 [23:03:37] T314098: Update Phatality to reference ECS fields - https://phabricator.wikimedia.org/T314098 [23:05:17] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:06:19] (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:06:36] (03CR) 10Dzahn: "removing the new host contint2002 I had added meanwhile.. amending . https://puppet-compiler.wmflabs.org/output/850593/39616/doc1002.eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [23:06:49] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1032.eqiad.wmnet, logstash1031.eqiad.wmnet, logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:06:58] hi, looking [23:07:05] PROBLEM - Check systemd state on logstash1025 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:07] PROBLEM - Check systemd state on logstash2025 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:08] o/ [23:07:09] 🧐 [23:07:10] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:07:11] * cwhite on it [23:07:15] PROBLEM - Check systemd state on logstash2024 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:21] cwhite: ahh was just about to ask, thanks :) [23:07:30] deploy didn't go quite right [23:07:34] (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:07:38] need a hand with anything? [23:07:45] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1032.eqiad.wmnet, logstash1031.eqiad.wmnet, logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:07:49] PROBLEM - Check systemd state on logstash2031 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:49] PROBLEM - Check systemd state on logstash1031 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:02] * jhathaway here as well [23:08:13] PROBLEM - Check systemd state on logstash2032 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:25] PROBLEM - Check systemd state on logstash2030 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:31] PROBLEM - Check systemd state on logstash1032 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:08:33] PROBLEM - Check systemd state on logstash2023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:09:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2443.codfw.wmnet with OS buster [23:09:14] (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:09:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2443.codfw.wmnet with OS buster completed: - mw2443 (**PASS**) - Removed from Pupp... [23:09:31] PROBLEM - Check systemd state on logstash1024 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:03] RECOVERY - Check systemd state on logstash1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:17] RECOVERY - Check systemd state on logstash2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:37] (03PS7) 10Dzahn: ci: move lists of contint hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593 [23:10:51] RECOVERY - Check systemd state on logstash2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:53] RECOVERY - Check systemd state on logstash1031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:05] RECOVERY - Check systemd state on logstash1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:17] RECOVERY - Check systemd state on logstash2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:18] (ProbeDown) resolved: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:11:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:11:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2442.codfw.wmnet with OS buster [23:11:25] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:11:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2442.codfw.wmnet with OS buster completed: - mw2442 (**PASS**) - Removed from Pupp... [23:11:33] RECOVERY - Check systemd state on logstash2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:41] RECOVERY - Check systemd state on logstash2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:45] RECOVERY - Check systemd state on logstash1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:47] RECOVERY - Check systemd state on logstash2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:25] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:12:43] (ProbeDown) resolved: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:12:53] (03CR) 10Brennen Bearnes: phabricator config: add gitlab_api_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes) [23:12:59] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/850593/39617/" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn) [23:14:13] !log cwhite@deploy1002 Finished deploy [releng/phatality@eaa4c16]: T314098 (duration: 10m 40s) [23:14:17] T314098: Update Phatality to reference ECS fields - https://phabricator.wikimedia.org/T314098 [23:14:24] !log cwhite@deploy1002 Started deploy [releng/phatality@eaa4c16]: T314098 [23:14:32] !log cwhite@deploy1002 Finished deploy [releng/phatality@eaa4c16]: T314098 (duration: 00m 07s) [23:14:52] (03CR) 10Dzahn: phabricator config: add gitlab_api_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes) [23:15:22] (03CR) 10Dzahn: "@Eoghan see the "notification servers" line in https://gitlab.wikimedia.org/repos/phabricator/deployment/-/merge_requests/2/diffs . that r" [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes) [23:15:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P44652 and previous config saved to /var/cache/conftool/dbconfig/20230214-231531-ladsgroup.json [23:16:31] (03CR) 10Dzahn: phabricator config: add gitlab_api_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes) [23:19:36] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) @Dzahn, After I shared the issue I'm having with Brendan Campbell (who added himself as a subscriber) + a few other colleagues, he suggested that I ask you for help! For Shopi... [23:19:55] (03PS3) 10Brennen Bearnes: phabricator config: add gitlab_api_key [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) [23:22:21] (03CR) 10Brennen Bearnes: phabricator config: add gitlab_api_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes) [23:29:26] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) Hi @SHust let me clarify, so shopify is saying that store.wikimedia.org must have a AAAA record? The current status is that store.wikimedia.org is an alias for c.ssl.shopify.c... [23:30:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T328255)', diff saved to https://phabricator.wikimedia.org/P44653 and previous config saved to /var/cache/conftool/dbconfig/20230214-233037-ladsgroup.json [23:30:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [23:30:42] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [23:30:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [23:30:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44654 and previous config saved to /var/cache/conftool/dbconfig/20230214-233058-ladsgroup.json [23:31:58] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:33:08] (03PS1) 10Bartosz Dziewoński: persistRevisionThreadItems: Avoid listing non-discussion pages [extensions/DiscussionTools] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/889267 (https://phabricator.wikimedia.org/T329627) [23:33:20] (03PS1) 10Bartosz Dziewoński: persistRevisionThreadItems: Avoid listing non-discussion pages [extensions/DiscussionTools] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889268 (https://phabricator.wikimedia.org/T329627) [23:35:36] dancy: sry, I wasn't able today [23:37:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44655 and previous config saved to /var/cache/conftool/dbconfig/20230214-233705-ladsgroup.json [23:37:10] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [23:37:24] 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) @Dzahn, Here's a screenshot from my Shopify chat ( I hope I didn’t share anything I shouldn’t have): {F36845020} {F36845023} {F36845022} [23:38:09] (03PS1) 10Cwhite: dashboards: sudo set noninteractive flag [puppet] - 10https://gerrit.wikimedia.org/r/888740 (https://phabricator.wikimedia.org/T329688) [23:41:45] (03PS1) 10Arlolra: [WIP] Enabled native gallery editing in Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889257 (https://phabricator.wikimedia.org/T329662) [23:52:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P44656 and previous config saved to /var/cache/conftool/dbconfig/20230214-235211-ladsgroup.json [23:58:54] (03PS1) 10Ladsgroup: [WIP] mwscript: Switch to use run.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800) [23:59:08] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [23:59:11] PROBLEM - Disk space on thanos-be2003 is CRITICAL: DISK CRITICAL - free space: / 1871 MB (3% inode=98%): /tmp 1871 MB (3% inode=98%): /var/tmp 1871 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops