[00:00:06] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:00:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:00:47] <wikibugs>	 (03PS1) 10Zabe: beta: Remove deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888814 (https://phabricator.wikimedia.org/T329577)
[00:01:09] <wikibugs>	 (03PS2) 10Dzahn: serviceops-collab: switch alert severity to 'task' globally [puppet] - 10https://gerrit.wikimedia.org/r/888813 (https://phabricator.wikimedia.org/T329587)
[00:01:14] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2439.codfw.wmnet with reason: host reimage
[00:01:33] <icinga-wm>	 PROBLEM - Host mc-gp2003 is DOWN: PING CRITICAL - Packet loss = 100%
[00:01:39] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] beta: Remove deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888814 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[00:02:18] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Remove deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888814 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[00:02:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] serviceops-collab: switch alert severity to 'task' globally [puppet] - 10https://gerrit.wikimedia.org/r/888813 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[00:03:27] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:03:27] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:07] <icinga-wm>	 PROBLEM - Disk space on thanos-be2002 is CRITICAL: DISK CRITICAL - free space: / 1886 MB (3% inode=98%): /tmp 1886 MB (3% inode=98%): /var/tmp 1886 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops
[00:04:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T328255)', diff saved to https://phabricator.wikimedia.org/P44540 and previous config saved to /var/cache/conftool/dbconfig/20230214-000419-ladsgroup.json
[00:04:23] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[00:04:48] <wikibugs>	 (03PS3) 10Dzahn: planet: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884390 (https://phabricator.wikimedia.org/T327977)
[00:04:49] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp2003']
[00:05:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] planet: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884390 (https://phabricator.wikimedia.org/T327977) (owner: 10Dzahn)
[00:06:05] <icinga-wm>	 RECOVERY - Host mc-gp2003 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms
[00:06:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P44541 and previous config saved to /var/cache/conftool/dbconfig/20230214-000629-marostegui.json
[00:06:59] <icinga-wm>	 PROBLEM - Disk space on thanos-be2003 is CRITICAL: DISK CRITICAL - free space: / 293 MB (0% inode=98%): /tmp 293 MB (0% inode=98%): /var/tmp 293 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops
[00:10:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P44542 and previous config saved to /var/cache/conftool/dbconfig/20230214-001053-marostegui.json
[00:13:52] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:17:14] <wikibugs>	 (03PS3) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587)
[00:17:30] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10Papaul)
[00:17:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:19:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[00:21:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T328817)', diff saved to https://phabricator.wikimedia.org/P44543 and previous config saved to /var/cache/conftool/dbconfig/20230214-002136-marostegui.json
[00:21:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[00:21:40] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[00:21:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[00:21:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[00:22:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[00:22:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T328817)', diff saved to https://phabricator.wikimedia.org/P44544 and previous config saved to /var/cache/conftool/dbconfig/20230214-002214-marostegui.json
[00:22:40] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:22:41] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2439.codfw.wmnet with OS buster
[00:22:44] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[00:22:45] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2438.codfw.wmnet with OS buster
[00:22:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2439.codfw.wmnet with OS buster completed: - mw2439 (**PASS**)   - Removed from Pupp...
[00:22:51] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2438.codfw.wmnet with OS buster completed: - mw2438 (**PASS**)   - Removed from Pupp...
[00:23:13] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[00:24:45] <icinga-wm>	 RECOVERY - Disk space on thanos-be2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops
[00:25:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44545 and previous config saved to /var/cache/conftool/dbconfig/20230214-002559-marostegui.json
[00:26:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[00:26:03] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[00:26:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[00:26:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T329203)', diff saved to https://phabricator.wikimedia.org/P44546 and previous config saved to /var/cache/conftool/dbconfig/20230214-002620-marostegui.json
[00:27:43] <icinga-wm>	 RECOVERY - Disk space on thanos-be2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops
[00:32:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T329203)', diff saved to https://phabricator.wikimedia.org/P44547 and previous config saved to /var/cache/conftool/dbconfig/20230214-003201-marostegui.json
[00:32:05] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[00:34:18] <wikibugs>	 (03PS4) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587)
[00:34:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[00:35:23] <icinga-wm>	 PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: run-dashboards-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:35:55] <wikibugs>	 (03PS1) 10Zabe: beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888817
[00:36:20] <wikibugs>	 (03PS2) 10Zabe: beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888817
[00:36:27] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888817 (owner: 10Zabe)
[00:36:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888817 (owner: 10Zabe)
[00:37:05] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Switch beta to read only on mediawiki level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888817 (owner: 10Zabe)
[00:39:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2440.codfw.wmnet with OS buster
[00:39:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster
[00:40:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2441.codfw.wmnet with OS buster
[00:40:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2441.codfw.wmnet with OS buster
[00:42:42] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] Add logs-api service [puppet] - 10https://gerrit.wikimedia.org/r/888700 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[00:43:37] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2442.codfw.wmnet with OS buster
[00:43:46] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2442.codfw.wmnet with OS buster
[00:45:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2443.codfw.wmnet with OS buster
[00:46:46] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2443.codfw.wmnet with OS buster
[00:47:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P44548 and previous config saved to /var/cache/conftool/dbconfig/20230214-004707-marostegui.json
[00:50:25] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:29] <wikibugs>	 (03PS5) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587)
[00:52:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[01:00:57] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:18] <wikibugs>	 (03PS1) 10Superpes15: [blkwiki] Add an alias for "SPECIAL:" Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888821 (https://phabricator.wikimedia.org/T317598)
[01:01:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [blkwiki] Add an alias for "SPECIAL:" Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888821 (https://phabricator.wikimedia.org/T317598) (owner: 10Superpes15)
[01:02:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P44549 and previous config saved to /var/cache/conftool/dbconfig/20230214-010214-marostegui.json
[01:04:13] <wikibugs>	 (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888821 (https://phabricator.wikimedia.org/T317598) (owner: 10Superpes15)
[01:04:15] <wikibugs>	 (03PS6) 10Dzahn: doc: add blackbox::check::http monitor [puppet] - 10https://gerrit.wikimedia.org/r/884393 (https://phabricator.wikimedia.org/T329587)
[01:04:38] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10phaultfinder)
[01:05:33] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:06:19] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:06:21] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:08:09] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49567 bytes in 7.597 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:08:59] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:15:23] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:17:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T329203)', diff saved to https://phabricator.wikimedia.org/P44550 and previous config saved to /var/cache/conftool/dbconfig/20230214-011720-marostegui.json
[01:17:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[01:17:25] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[01:17:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[01:17:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[01:17:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[01:17:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44551 and previous config saved to /var/cache/conftool/dbconfig/20230214-011758-marostegui.json
[01:19:14] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:20:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:22:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T328817)', diff saved to https://phabricator.wikimedia.org/P44552 and previous config saved to /var/cache/conftool/dbconfig/20230214-012230-marostegui.json
[01:22:34] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[01:23:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44553 and previous config saved to /var/cache/conftool/dbconfig/20230214-012312-marostegui.json
[01:23:16] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[01:32:06] <wikibugs>	 (03PS6) 10Urbanecm: [tox] Make running `tox` work [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231)
[01:32:19] <wikibugs>	 (03CR) 10Urbanecm: [tox] Make running `tox` work (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm)
[01:35:22] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2440.codfw.wmnet with OS buster
[01:35:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster executed with errors: - mw2440 (**FAIL**)   - Remove...
[01:37:10] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2441.codfw.wmnet with OS buster
[01:37:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2441.codfw.wmnet with OS buster executed with errors: - mw2441 (**FAIL**)   - Remove...
[01:37:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P44554 and previous config saved to /var/cache/conftool/dbconfig/20230214-013736-marostegui.json
[01:38:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P44555 and previous config saved to /var/cache/conftool/dbconfig/20230214-013818-marostegui.json
[01:39:55] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2442.codfw.wmnet with OS buster
[01:40:01] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2442.codfw.wmnet with OS buster executed with errors: - mw2442 (**FAIL**)   - Remove...
[01:42:54] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2443.codfw.wmnet with OS buster
[01:43:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2443.codfw.wmnet with OS buster executed with errors: - mw2443 (**FAIL**)   - Remove...
[01:51:44] <wikibugs>	 (03CR) 10Urbanecm: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm)
[01:52:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2440.codfw.wmnet with OS buster
[01:52:26] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster
[01:52:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P44556 and previous config saved to /var/cache/conftool/dbconfig/20230214-015242-marostegui.json
[01:53:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P44557 and previous config saved to /var/cache/conftool/dbconfig/20230214-015325-marostegui.json
[01:54:14] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[02:01:31] <logmsgbot>	 !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2440.codfw.wmnet with OS buster
[02:01:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster executed with errors: - mw2440 (**FAIL**)   - Remove...
[02:04:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) @Jhancock.wm can you please take a look at mw244[0-3], it looks like you connected the network cable to NIC 2 and not NIC 1. Thank you
[02:07:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[02:07:29] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T328817)', diff saved to https://phabricator.wikimedia.org/P44558 and previous config saved to /var/cache/conftool/dbconfig/20230214-020748-marostegui.json
[02:07:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[02:07:53] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[02:08:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[02:08:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T329203)', diff saved to https://phabricator.wikimedia.org/P44559 and previous config saved to /var/cache/conftool/dbconfig/20230214-020831-marostegui.json
[02:08:33] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[02:08:35] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[02:08:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[02:08:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44560 and previous config saved to /var/cache/conftool/dbconfig/20230214-020852-marostegui.json
[02:11:50] <wikibugs>	 (03PS4) 10Superpes15: [blkwiki] Add an alias for "SPECIAL:" namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888821 (https://phabricator.wikimedia.org/T317598)
[02:12:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2444.codfw.wmnet with OS buster
[02:12:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2444.codfw.wmnet with OS buster
[02:13:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44561 and previous config saved to /var/cache/conftool/dbconfig/20230214-021358-marostegui.json
[02:14:02] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[02:18:27] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2445.codfw.wmnet with OS buster
[02:18:35] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2445.codfw.wmnet with OS buster
[02:19:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2446.codfw.wmnet with OS buster
[02:19:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2446.codfw.wmnet with OS buster
[02:20:27] <wikibugs>	 (03Abandoned) 10Superpes15: [blkwiki] Add an alias for "SPECIAL:" namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888821 (https://phabricator.wikimedia.org/T317598) (owner: 10Superpes15)
[02:22:29] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:29:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P44562 and previous config saved to /var/cache/conftool/dbconfig/20230214-022904-marostegui.json
[02:31:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2444.codfw.wmnet with reason: host reimage
[02:34:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2444.codfw.wmnet with reason: host reimage
[02:38:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2445.codfw.wmnet with reason: host reimage
[02:39:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2446.codfw.wmnet with reason: host reimage
[02:41:14] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2445.codfw.wmnet with reason: host reimage
[02:43:37] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2446.codfw.wmnet with reason: host reimage
[02:44:04] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2447.codfw.wmnet with OS buster
[02:44:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P44563 and previous config saved to /var/cache/conftool/dbconfig/20230214-024410-marostegui.json
[02:44:12] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2447.codfw.wmnet with OS buster
[02:52:08] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:53:30] <logmsgbot>	 !log tgr: Deployed security patch for T328643
[02:55:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:55:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2444.codfw.wmnet with OS buster
[02:56:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2444.codfw.wmnet with OS buster completed: - mw2444 (**PASS**)   - Removed from Pupp...
[02:57:51] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:58:40] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[02:59:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T329203)', diff saved to https://phabricator.wikimedia.org/P44564 and previous config saved to /var/cache/conftool/dbconfig/20230214-025917-marostegui.json
[02:59:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[02:59:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[02:59:21] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T0300)
[03:03:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[03:03:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[03:03:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T329203)', diff saved to https://phabricator.wikimedia.org/P44565 and previous config saved to /var/cache/conftool/dbconfig/20230214-030345-marostegui.json
[03:04:15] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2447.codfw.wmnet with reason: host reimage
[03:04:28] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[03:04:29] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2446.codfw.wmnet with OS buster
[03:04:31] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[03:04:31] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2445.codfw.wmnet with OS buster
[03:04:35] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2446.codfw.wmnet with OS buster completed: - mw2446 (**PASS**)   - Removed from Pupp...
[03:04:40] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2445.codfw.wmnet with OS buster completed: - mw2445 (**PASS**)   - Removed from Pupp...
[03:05:54] <wikibugs>	 (03PS1) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663)
[03:07:03] <wikibugs>	 (03PS1) 10Legoktm: gitlab_runner: Set pull_policy = ["always", "if-not-present"] [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216)
[03:07:29] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2447.codfw.wmnet with reason: host reimage
[03:07:44] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.23 [core] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/888729 (https://phabricator.wikimedia.org/T325586)
[03:07:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.23 [core] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/888729 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot)
[03:08:10] <wikibugs>	 (03PS2) 10Legoktm: gitlab_runner: Set pull_policy = ["always", "if-not-present"] [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216)
[03:09:32] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[03:10:28] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[03:11:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T329203)', diff saved to https://phabricator.wikimedia.org/P44566 and previous config saved to /var/cache/conftool/dbconfig/20230214-031059-marostegui.json
[03:11:04] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[03:22:43] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[03:22:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.23 [core] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/888729 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot)
[03:26:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P44567 and previous config saved to /var/cache/conftool/dbconfig/20230214-032606-marostegui.json
[03:29:03] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[03:29:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2447.codfw.wmnet with OS buster
[03:29:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2447.codfw.wmnet with OS buster completed: - mw2447 (**PASS**)   - Removed from Pupp...
[03:41:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P44568 and previous config saved to /var/cache/conftool/dbconfig/20230214-034112-marostegui.json
[03:56:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T329203)', diff saved to https://phabricator.wikimedia.org/P44569 and previous config saved to /var/cache/conftool/dbconfig/20230214-035618-marostegui.json
[03:56:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[03:56:23] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[03:56:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[03:56:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T329203)', diff saved to https://phabricator.wikimedia.org/P44570 and previous config saved to /var/cache/conftool/dbconfig/20230214-035639-marostegui.json
[03:58:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T329203)', diff saved to https://phabricator.wikimedia.org/P44571 and previous config saved to /var/cache/conftool/dbconfig/20230214-035852-marostegui.json
[03:59:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[03:59:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[03:59:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T328817)', diff saved to https://phabricator.wikimedia.org/P44572 and previous config saved to /var/cache/conftool/dbconfig/20230214-035922-marostegui.json
[03:59:26] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[04:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T0400)
[04:06:30] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:13:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P44573 and previous config saved to /var/cache/conftool/dbconfig/20230214-041359-marostegui.json
[04:21:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T328817)', diff saved to https://phabricator.wikimedia.org/P44574 and previous config saved to /var/cache/conftool/dbconfig/20230214-042104-marostegui.json
[04:21:08] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[04:24:14] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[04:29:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P44575 and previous config saved to /var/cache/conftool/dbconfig/20230214-042905-marostegui.json
[04:36:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P44576 and previous config saved to /var/cache/conftool/dbconfig/20230214-043610-marostegui.json
[04:44:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T329203)', diff saved to https://phabricator.wikimedia.org/P44577 and previous config saved to /var/cache/conftool/dbconfig/20230214-044411-marostegui.json
[04:44:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[04:44:16] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[04:44:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[04:44:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T329203)', diff saved to https://phabricator.wikimedia.org/P44578 and previous config saved to /var/cache/conftool/dbconfig/20230214-044432-marostegui.json
[04:47:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T329203)', diff saved to https://phabricator.wikimedia.org/P44579 and previous config saved to /var/cache/conftool/dbconfig/20230214-044745-marostegui.json
[04:51:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P44580 and previous config saved to /var/cache/conftool/dbconfig/20230214-045117-marostegui.json
[05:02:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P44581 and previous config saved to /var/cache/conftool/dbconfig/20230214-050252-marostegui.json
[05:06:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T328817)', diff saved to https://phabricator.wikimedia.org/P44582 and previous config saved to /var/cache/conftool/dbconfig/20230214-050623-marostegui.json
[05:06:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[05:06:27] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[05:06:38] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[05:06:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T328817)', diff saved to https://phabricator.wikimedia.org/P44583 and previous config saved to /var/cache/conftool/dbconfig/20230214-050644-marostegui.json
[05:17:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P44584 and previous config saved to /var/cache/conftool/dbconfig/20230214-051758-marostegui.json
[05:22:29] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:28:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T328817)', diff saved to https://phabricator.wikimedia.org/P44585 and previous config saved to /var/cache/conftool/dbconfig/20230214-052854-marostegui.json
[05:28:59] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[05:33:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T329203)', diff saved to https://phabricator.wikimedia.org/P44586 and previous config saved to /var/cache/conftool/dbconfig/20230214-053304-marostegui.json
[05:33:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[05:33:09] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[05:33:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[05:33:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T329203)', diff saved to https://phabricator.wikimedia.org/P44587 and previous config saved to /var/cache/conftool/dbconfig/20230214-053325-marostegui.json
[05:35:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T329203)', diff saved to https://phabricator.wikimedia.org/P44588 and previous config saved to /var/cache/conftool/dbconfig/20230214-053538-marostegui.json
[05:44:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P44589 and previous config saved to /var/cache/conftool/dbconfig/20230214-054400-marostegui.json
[05:50:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P44590 and previous config saved to /var/cache/conftool/dbconfig/20230214-055044-marostegui.json
[05:57:29] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:59:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P44591 and previous config saved to /var/cache/conftool/dbconfig/20230214-055906-marostegui.json
[06:05:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P44592 and previous config saved to /var/cache/conftool/dbconfig/20230214-060551-marostegui.json
[06:14:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T328817)', diff saved to https://phabricator.wikimedia.org/P44593 and previous config saved to /var/cache/conftool/dbconfig/20230214-061413-marostegui.json
[06:14:15] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[06:14:17] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[06:14:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[06:14:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T328817)', diff saved to https://phabricator.wikimedia.org/P44594 and previous config saved to /var/cache/conftool/dbconfig/20230214-061434-marostegui.json
[06:20:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T329203)', diff saved to https://phabricator.wikimedia.org/P44595 and previous config saved to /var/cache/conftool/dbconfig/20230214-062057-marostegui.json
[06:20:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[06:21:01] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[06:21:12] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[06:21:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T329203)', diff saved to https://phabricator.wikimedia.org/P44596 and previous config saved to /var/cache/conftool/dbconfig/20230214-062118-marostegui.json
[06:22:29] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:36:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T328817)', diff saved to https://phabricator.wikimedia.org/P44597 and previous config saved to /var/cache/conftool/dbconfig/20230214-063617-marostegui.json
[06:36:22] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[06:39:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T329203)', diff saved to https://phabricator.wikimedia.org/P44598 and previous config saved to /var/cache/conftool/dbconfig/20230214-063933-marostegui.json
[06:39:38] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[06:40:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Infrastructure-Foundations: Test RAID monitoring on new RAID PERC 755 controllers - https://phabricator.wikimedia.org/T325046 (10Marostegui) 05Open→03Resolved Thanks everyone!
[06:41:17] <wikibugs>	 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui) Thanks @Jhancock.wm - I am starting the host again and I will close this task once it is repooled. The memory count also looks good from my side.
[06:48:55] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Marostegui) They should probably be skipped as we have two masters being written (one per DC) and they need to remai...
[06:51:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P44599 and previous config saved to /var/cache/conftool/dbconfig/20230214-065123-marostegui.json
[06:54:07] <wikibugs>	 (03PS1) 10Marostegui: db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/888987 (https://phabricator.wikimedia.org/T329478)
[06:54:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P44600 and previous config saved to /var/cache/conftool/dbconfig/20230214-065440-marostegui.json
[06:54:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/888987 (https://phabricator.wikimedia.org/T329478) (owner: 10Marostegui)
[06:56:12] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db1099 [puppet] - 10https://gerrit.wikimedia.org/r/888988 (https://phabricator.wikimedia.org/T329181)
[06:57:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1099.eqiad.wmnet
[06:58:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1099 [puppet] - 10https://gerrit.wikimedia.org/r/888988 (https://phabricator.wikimedia.org/T329181) (owner: 10Marostegui)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T0700)
[07:00:05] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T0700).
[07:01:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[07:02:56] <icinga-wm>	 PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:06:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P44601 and previous config saved to /var/cache/conftool/dbconfig/20230214-070630-marostegui.json
[07:09:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P44602 and previous config saved to /var/cache/conftool/dbconfig/20230214-070946-marostegui.json
[07:10:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1099.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[07:11:08] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10Marostegui) a:05Marostegui→03None
[07:12:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1099.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[07:12:11] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:12:12] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1099.eqiad.wmnet
[07:12:16] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1099.eqiad.wmnet` - db1099.eqiad.wmnet (**WARN**)   - Downtimed host on Icinga/Alertm...
[07:12:28] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10Marostegui) This is ready for DC-Ops
[07:12:39] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 (10Marostegui)
[07:21:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T328817)', diff saved to https://phabricator.wikimedia.org/P44603 and previous config saved to /var/cache/conftool/dbconfig/20230214-072136-marostegui.json
[07:21:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[07:21:41] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[07:21:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[07:21:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T328817)', diff saved to https://phabricator.wikimedia.org/P44604 and previous config saved to /var/cache/conftool/dbconfig/20230214-072157-marostegui.json
[07:24:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T329203)', diff saved to https://phabricator.wikimedia.org/P44605 and previous config saved to /var/cache/conftool/dbconfig/20230214-072452-marostegui.json
[07:24:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[07:24:56] <stashbot>	 T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203
[07:25:07] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[07:38:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance
[07:38:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2118.codfw.wmnet with reason: Maintenance
[07:43:23] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[07:43:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T328817)', diff saved to https://phabricator.wikimedia.org/P44606 and previous config saved to /var/cache/conftool/dbconfig/20230214-074335-marostegui.json
[07:43:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[07:43:39] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[07:43:50] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/888776
[07:45:38] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] fix(presto): do not set query.max*per-node config on coordinator [puppet] - 10https://gerrit.wikimedia.org/r/888685 (owner: 10Nicolas Fraison)
[07:49:25] <wikibugs>	 (03PS2) 10Slyngshede: P:installserver::dhcp remove dhcp config for VMs [puppet] - 10https://gerrit.wikimedia.org/r/888692
[07:51:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2181: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/888776 (owner: 10Marostegui)
[07:52:44] <wikibugs>	 (03PS2) 10Nicolas Fraison: fix(presto): create intermediate ${data_dir}/var fodler [puppet] - 10https://gerrit.wikimedia.org/r/888760 (https://phabricator.wikimedia.org/T329361)
[07:54:11] <wikibugs>	 (03CR) 10Elukey: fix(presto): do not set query.max*per-node config on coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888685 (owner: 10Nicolas Fraison)
[07:55:29] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39553/console" [puppet] - 10https://gerrit.wikimedia.org/r/888760 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison)
[07:56:29] <wikibugs>	 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui) Host caught up - doing data checks now.
[07:57:04] <wikibugs>	 (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] fix(presto): create intermediate ${data_dir}/var fodler [puppet] - 10https://gerrit.wikimedia.org/r/888760 (https://phabricator.wikimedia.org/T329361) (owner: 10Nicolas Fraison)
[07:57:47] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.cf
[07:57:47] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[07:57:54] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:installserver::dhcp remove dhcp config for VMs [puppet] - 10https://gerrit.wikimedia.org/r/888692 (owner: 10Slyngshede)
[07:58:03] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.cf
[07:58:05] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[07:58:13] <XioNoX>	 !log enable CF in esams
[07:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P44607 and previous config saved to /var/cache/conftool/dbconfig/20230214-075842-marostegui.json
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T0800).
[08:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:04:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix Cumin alias for etcd/ML [puppet] - 10https://gerrit.wikimedia.org/r/889046
[08:04:34] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix Cumin alias for etcd/ML [puppet] - 10https://gerrit.wikimedia.org/r/889046
[08:06:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Thanks and sorry! :)" [puppet] - 10https://gerrit.wikimedia.org/r/889046 (owner: 10Muehlenhoff)
[08:07:43] <wikibugs>	 (03PS1) 10Elukey: sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767)
[08:09:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix Cumin alias for etcd/ML [puppet] - 10https://gerrit.wikimedia.org/r/889046 (owner: 10Muehlenhoff)
[08:10:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: wmnet: add logs-api svc records (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/888696 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[08:10:44] <wikibugs>	 (03PS2) 10Filippo Giunchedi: wmnet: add logs-api svc records [dns] - 10https://gerrit.wikimedia.org/r/888696 (https://phabricator.wikimedia.org/T320702)
[08:13:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P44608 and previous config saved to /var/cache/conftool/dbconfig/20230214-081348-marostegui.json
[08:14:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Remove non-kafka logstash nodes from kafka configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/886862 (https://phabricator.wikimedia.org/T329142) (owner: 10Cwhite)
[08:16:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: add logs-api svc records [dns] - 10https://gerrit.wikimedia.org/r/888696 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[08:17:29] <wikibugs>	 (03PS2) 10Elukey: admin_ng: update ml-staging-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767)
[08:17:33] <wikibugs>	 (03PS3) 10Elukey: admin_ng: update ml-staging-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767)
[08:17:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin_ng: update ml-staging-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[08:18:32] <wikibugs>	 (03PS4) 10Elukey: admin_ng: update ml-staging-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767)
[08:19:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add Santiago Faci (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888045 (https://phabricator.wikimedia.org/T329296) (owner: 10Filippo Giunchedi)
[08:19:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: elasticsearch: service depends on tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886059 (https://phabricator.wikimedia.org/T328674) (owner: 10Filippo Giunchedi)
[08:20:15] <wikibugs>	 (03CR) 10Muehlenhoff: swift::ring_manager: Enable profile::auto_restarts::service for rsyncd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888170 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:22:55] <wikibugs>	 (03PS1) 10Vgutierrez: cache::haproxy: Update to 2.6.8 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/889053 (https://phabricator.wikimedia.org/T321775)
[08:24:48] <wikibugs>	 (03PS3) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[08:24:52] <wikibugs>	 (03PS7) 10DCausse: flink-app: add support for custom config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231
[08:24:54] <wikibugs>	 (03PS10) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675)
[08:25:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm6001.drmrs.wmnet
[08:25:13] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39554/console" [puppet] - 10https://gerrit.wikimedia.org/r/889053 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez)
[08:26:15] <vgutierrez>	 !log rolling upgrade to HAProxy 2.6.8 in eqsin - T321775
[08:26:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:19] <stashbot>	 T321775: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775
[08:26:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: update ml-staging-codfw's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/884038 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[08:26:26] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Update to 2.6.8 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/889053 (https://phabricator.wikimedia.org/T321775) (owner: 10Vgutierrez)
[08:27:29] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:28:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T328817)', diff saved to https://phabricator.wikimedia.org/P44609 and previous config saved to /var/cache/conftool/dbconfig/20230214-082854-marostegui.json
[08:28:56] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[08:28:59] <stashbot>	 T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817
[08:29:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[08:29:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T328817)', diff saved to https://phabricator.wikimedia.org/P44610 and previous config saved to /var/cache/conftool/dbconfig/20230214-082915-marostegui.json
[08:30:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Add logs-api service [puppet] - 10https://gerrit.wikimedia.org/r/888700 (https://phabricator.wikimedia.org/T320702)
[08:30:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T328817)', diff saved to https://phabricator.wikimedia.org/P44611 and previous config saved to /var/cache/conftool/dbconfig/20230214-083022-marostegui.json
[08:30:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove testvm6001 [puppet] - 10https://gerrit.wikimedia.org/r/889055
[08:31:01] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[08:32:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove testvm6001 [puppet] - 10https://gerrit.wikimedia.org/r/889055 (owner: 10Muehlenhoff)
[08:32:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Add logs-api service [puppet] - 10https://gerrit.wikimedia.org/r/888700 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[08:33:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[08:37:29] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:39:14] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:40:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44612 and previous config saved to /var/cache/conftool/dbconfig/20230214-084020-root.json
[08:42:25] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to  also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) That's awesome!  * Usecase #1 is to populate: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/r...
[08:44:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm6001.drmrs.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:50:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm6001.drmrs.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[08:50:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:50:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm6001.drmrs.wmnet
[08:50:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm6001.drmrs.wmnet` - testvm6001.drmrs.wmnet (**PASS**)   - Downtimed host on Icinga/Alertma...
[08:51:05] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_codfw_logs-api.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[08:51:51] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] flink-app: add support for custom config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse)
[08:51:56] <vgutierrez>	 godog: ^^
[08:52:54] <godog>	 gah of course, I'll take a look! service being setup
[08:53:07] <godog>	 vgutierrez: I'll followup shortly with the pybal bits FYI
[08:53:13] <godog>	 also, happy name day !
[08:53:22] <vgutierrez>	 cheers :)
[08:54:49] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=logs-api
[08:55:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44613 and previous config saved to /var/cache/conftool/dbconfig/20230214-085525-root.json
[08:55:55] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=logs-api,dc=codfw
[08:56:11] <godog>	 sigh that's a confctl bug ^ that actually did nothing
[08:56:47] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: add support for custom config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse)
[08:57:17] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: dc=codfw,service=logs-api
[08:57:21] <wikibugs>	 (03PS1) 10Ayounsi: Remove pfw BFD special case [puppet] - 10https://gerrit.wikimedia.org/r/889062 (https://phabricator.wikimedia.org/T329272)
[09:00:26] <vgutierrez>	 godog: there is nothing for service=logs-api in codfw according to confctl
[09:00:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: conftool-data: add logs-api codfw [puppet] - 10https://gerrit.wikimedia.org/r/889063 (https://phabricator.wikimedia.org/T320702)
[09:01:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_codfw_logs-api.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[09:01:22] <godog>	 vgutierrez: yeah I came to the same conclusion, fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/889063
[09:01:32] <vgutierrez>	 BTW,  filippo@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=logs-api --> if it wasn't the case already, that pooled everything for logs-api in eqiad
[09:01:47] <godog>	 yeah that was intended, new service
[09:02:15] <godog>	 I was surprised by confctl effectively doing nothing yet announcing
[09:02:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: add logs-api codfw [puppet] - 10https://gerrit.wikimedia.org/r/889063 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[09:02:46] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] conftool-data: add logs-api codfw [puppet] - 10https://gerrit.wikimedia.org/r/889063 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[09:03:15] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] swift::ring_manager: Enable profile::auto_restarts::service for rsyncd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888170 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:04:44] <logmsgbot>	 !log filippo@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: service=logs-api
[09:04:55] <godog>	 ok confd should be happy now
[09:05:35] <godog>	 brb
[09:09:56] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to  also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) Usecase #2 is to replace the hardcoded values from: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/pupp...
[09:10:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44614 and previous config saved to /var/cache/conftool/dbconfig/20230214-091030-root.json
[09:10:54] <wikibugs>	 (03PS4) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[09:11:05] <jinxer-wm>	 (ConfdResourceFailed) firing: (3) confd resource _srv_config-master_pybal_codfw_logs-api.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[09:11:19] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Volans) @Marostegui  Just to clarify and avoid confusion, are you suggesting to remove `x2` entirely from the spicer...
[09:13:23] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to  also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) Usecase #3 is to generate https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production...
[09:15:44] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: logs-api to lvs_setup state [puppet] - 10https://gerrit.wikimedia.org/r/889066 (https://phabricator.wikimedia.org/T320702)
[09:16:06] <jinxer-wm>	 (ConfdResourceFailed) firing: (3) confd resource _srv_config-master_pybal_codfw_logs-api.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[09:21:05] <jinxer-wm>	 (ConfdResourceFailed) resolved: (3) confd resource _srv_config-master_pybal_codfw_logs-api.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[09:21:07] <wikibugs>	 (03PS5) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[09:22:29] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:25:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44615 and previous config saved to /var/cache/conftool/dbconfig/20230214-092535-root.json
[09:27:42] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:29:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1109.eqiad.wmnet with reason: Maintenance
[09:29:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1109.eqiad.wmnet with reason: Maintenance
[09:29:50] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39559/console" [puppet] - 10https://gerrit.wikimedia.org/r/889066 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[09:32:12] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Marostegui) @Volans I am unsure. How do we treat parsercache at the moment? x2 is special in the sense that it does...
[09:34:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: logs-api to lvs_setup state [puppet] - 10https://gerrit.wikimedia.org/r/889066 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[09:34:32] <wikibugs>	 (03PS8) 10Clément Goubert: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto)
[09:35:28] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Volans) >>! In T329533#8613691, @Marostegui wrote: > @Volans I am unsure. How do we treat parsercache at the moment?...
[09:37:21] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto)
[09:39:09] <wikibugs>	 (03Merged) 10jenkins-bot: sre.discovery.datacenter: rename and add status command [cookbooks] - 10https://gerrit.wikimedia.org/r/887740 (owner: 10Giuseppe Lavagetto)
[09:39:24] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.79:443]) https://wikitech.wikimedia.org/wiki/PyBal
[09:40:02] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 121 connections established with conf1007.eqiad.wmnet:4001 (min=122) https://wikitech.wikimedia.org/wiki/PyBal
[09:40:06] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 75 connections established with conf1007.eqiad.wmnet:4001 (min=76) https://wikitech.wikimedia.org/wiki/PyBal
[09:40:23] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: spicerack.mysql_legacy errors on get_core_masters_heartbeats when checking x2 - https://phabricator.wikimedia.org/T329533 (10Marostegui) >>! In T329533#8613695, @Volans wrote: >>>! In T329533#8613691, @Marostegui wrote: >> @Volans I am unsur...
[09:40:24] <godog>	 known/expected ^ pending pybal restart
[09:40:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44616 and previous config saved to /var/cache/conftool/dbconfig/20230214-094040-root.json
[09:41:46] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.79:443]) https://wikitech.wikimedia.org/wiki/PyBal
[09:42:23] <wikibugs>	 (03PS3) 10Clément Goubert: sre.discovery.datacenter: fix rollback logic [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) (owner: 10Giuseppe Lavagetto)
[09:42:34] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 70 connections established with conf2005.codfw.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal
[09:42:34] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 88 connections established with conf2004.codfw.wmnet:4001 (min=89) https://wikitech.wikimedia.org/wiki/PyBal
[09:43:38] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.79:443]) https://wikitech.wikimedia.org/wiki/PyBal
[09:44:24] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.79:443]) https://wikitech.wikimedia.org/wiki/PyBal
[09:45:52] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: fix rollback logic [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) (owner: 10Giuseppe Lavagetto)
[09:46:03] <wikibugs>	 (03PS6) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[09:47:20] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39560/console" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[09:47:41] <wikibugs>	 (03Merged) 10jenkins-bot: sre.discovery.datacenter: fix rollback logic [cookbooks] - 10https://gerrit.wikimedia.org/r/887806 (https://phabricator.wikimedia.org/T329175) (owner: 10Giuseppe Lavagetto)
[09:48:54] <wikibugs>	 (03PS2) 10Elukey: sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767)
[09:49:14] <wikibugs>	 (03CR) 10Elukey: sre.k8s.upgrade-cluster: simplify etcd cluster procedure (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:49:42] <wikibugs>	 (03PS7) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[09:50:08] <wikibugs>	 (03PS3) 10Elukey: sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767)
[09:50:10] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[09:50:14] <godog>	 !log roll-restart pybal in eqiad/codfw to pick up logs-api service - T320702
[09:50:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:17] <stashbot>	 T320702: Jaeger secure access to OpenSearch cluster - https://phabricator.wikimedia.org/T320702
[09:50:40] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:51:12] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39561/console" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[09:51:43] <wikibugs>	 (03CR) 10Elukey: "Ben was this tried in the test cluster? It should be relatively easy, just to make sure if we see exceptions or not before hitting the res" [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[09:52:14] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:52:15] <wikibugs>	 (03PS1) 10Volans: Makefile.deploy: fix bundle CA linking [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889068
[09:52:20] <wikibugs>	 (03PS1) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649)
[09:52:36] <wikibugs>	 (03PS8) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[09:52:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[09:52:43] <wikibugs>	 (03PS2) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649)
[09:52:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "after https://phabricator.wikimedia.org/T329611 I think this patch requires further discussion with the team." [puppet] - 10https://gerrit.wikimedia.org/r/888347 (https://phabricator.wikimedia.org/T329467) (owner: 10Majavah)
[09:53:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[09:53:22] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[09:53:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:53:33] <wikibugs>	 (03PS4) 10Elukey: sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767)
[09:53:35] <wikibugs>	 (03CR) 10Elukey: [V: 03+2] sre.k8s.upgrade-cluster: simplify etcd cluster procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/889048 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[09:54:07] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39562/console" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[09:54:08] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 89 connections established with conf2004.codfw.wmnet:4001 (min=89) https://wikitech.wikimedia.org/wiki/PyBal
[09:54:09] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Makefile.deploy: fix bundle CA linking [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889068 (owner: 10Volans)
[09:54:41] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[09:55:29] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Makefile.deploy: fix bundle CA linking [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889068 (owner: 10Volans)
[09:55:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44618 and previous config saved to /var/cache/conftool/dbconfig/20230214-095544-root.json
[09:57:08] <wikibugs>	 (03PS9) 10Jelto: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[09:57:24] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 122 connections established with conf1007.eqiad.wmnet:4001 (min=122) https://wikitech.wikimedia.org/wiki/PyBal
[09:57:29] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[09:57:48] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - logs-api_443: Servers logstash1032.eqiad.wmnet, logstash1030.eqiad.wmnet, logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:58:21] <godog>	 looking ^
[09:58:46] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39563/console" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[09:59:52] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 71 connections established with conf2005.codfw.wmnet:4001 (min=71) https://wikitech.wikimedia.org/wiki/PyBal
[10:00:16] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] Try libmariadb-java with sqoop on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:00:56] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:01:53] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001
[10:02:06] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - logs-api_443: Servers logstash1032.eqiad.wmnet, logstash1025.eqiad.wmnet, logstash1024.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:02:28] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:03:14] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 76 connections established with conf1007.eqiad.wmnet:4001 (min=76) https://wikitech.wikimedia.org/wiki/PyBal
[10:03:21] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001
[10:07:29] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:07:38] <wikibugs>	 (03CR) 10Muehlenhoff: Try libmariadb-java with sqoop on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:08:11] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001
[10:08:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add binder to the kernel module block list [puppet] - 10https://gerrit.wikimedia.org/r/888709 (owner: 10Muehlenhoff)
[10:09:08] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "thanks for opening the change!" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[10:09:24] <wikibugs>	 (03PS2) 10Btullis: Try libmariadb-java with sqoop on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363)
[10:09:38] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001
[10:10:16] <wikibugs>	 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10akosiaris) I 've found this task via a different pathway, trying to help editors in T328875. Debugging that one I ended up dealing with a swift ghost from 2017. While this is old enough to...
[10:10:23] <wikibugs>	 (03PS3) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649)
[10:10:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[10:11:28] <wikibugs>	 (03PS4) 10Clément Goubert: sre.discovery.datacenter: add --fast-insecure switch for pool/depool [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto)
[10:12:13] <wikibugs>	 (03CR) 10Clément Goubert: sre.discovery.datacenter: add --fast-insecure switch for pool/depool (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/887741 (owner: 10Giuseppe Lavagetto)
[10:12:31] <wikibugs>	 (03PS8) 10Clément Goubert: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774
[10:13:28] <wikibugs>	 (03CR) 10Btullis: Try libmariadb-java with sqoop on bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:13:42] <wikibugs>	 (03PS4) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649)
[10:15:53] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[10:16:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:17:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Try libmariadb-java with sqoop on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:17:30] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Try libmariadb-java with sqoop on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:17:37] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:19:41] <moritzm>	 !log installing imagemagick security updates on bullseye
[10:19:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:10] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] "Doh!" [puppet] - 10https://gerrit.wikimedia.org/r/888718 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:21:18] <wikibugs>	 (03PS1) 10Volans: Makefile.deploy: restart services [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889077
[10:22:05] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 (owner: 10Clément Goubert)
[10:22:14] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49567 bytes in 5.775 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:22:32] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.788 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:23:51] <wikibugs>	 (03Merged) 10jenkins-bot: sre.discovery.datacenter: Add progress logging [cookbooks] - 10https://gerrit.wikimedia.org/r/887774 (owner: 10Clément Goubert)
[10:23:59] <wikibugs>	 (03PS2) 10Gehel: miscweb / query_service: remove ability to list directories [puppet] - 10https://gerrit.wikimedia.org/r/883272 (https://phabricator.wikimedia.org/T324667)
[10:25:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Makefile.deploy: restart services [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889077 (owner: 10Volans)
[10:25:29] <wikibugs>	 (03PS1) 10Btullis: Fix a compilation error in bigtop::mysql_jdbc [puppet] - 10https://gerrit.wikimedia.org/r/889079 (https://phabricator.wikimedia.org/T329363)
[10:26:00] <wikibugs>	 (03PS5) 10Ayounsi: k8s FERM: allow gateway and infra ranges by default [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649)
[10:26:51] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] miscweb / query_service: remove ability to list directories [puppet] - 10https://gerrit.wikimedia.org/r/883272 (https://phabricator.wikimedia.org/T324667) (owner: 10Gehel)
[10:27:05] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889069 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[10:27:07] <wikibugs>	 (03CR) 10Btullis: "A fix for my previous error." [puppet] - 10https://gerrit.wikimedia.org/r/889079 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:27:09] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39564/console" [puppet] - 10https://gerrit.wikimedia.org/r/889079 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:28:04] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix a compilation error in bigtop::mysql_jdbc [puppet] - 10https://gerrit.wikimedia.org/r/889079 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:32:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] swift::ring_manager: Enable profile::auto_restarts::service for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/888170 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:36:52] <wikibugs>	 (03PS1) 10Gehel: miscweb / query_service: remove ability to list directories [puppet] - 10https://gerrit.wikimedia.org/r/889080 (https://phabricator.wikimedia.org/T324667)
[10:37:10] <wikibugs>	 (03PS5) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870)
[10:39:49] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway: reformat templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/887991 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[10:41:17] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Makefile.deploy: restart services [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889077 (owner: 10Volans)
[10:41:48] <wikibugs>	 (03PS1) 10Btullis: Fix the bigtop::jdbc class on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889081 (https://phabricator.wikimedia.org/T329363)
[10:42:46] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39565/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[10:43:20] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39566/console" [puppet] - 10https://gerrit.wikimedia.org/r/889081 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:43:31] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] miscweb / query_service: remove ability to list directories [puppet] - 10https://gerrit.wikimedia.org/r/889080 (https://phabricator.wikimedia.org/T324667) (owner: 10Gehel)
[10:43:46] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the bigtop::jdbc class on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889081 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[10:44:53] <wikibugs>	 (03CR) 10Hnowlan: fluent-bit: install wmf-certificates (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/883605 (owner: 10Hnowlan)
[10:45:01] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: reformat templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/887991 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan)
[10:48:15] <wikibugs>	 (03PS1) 10Elukey: role::etcd::v3::ml_etcd::staging: use PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/889082 (https://phabricator.wikimedia.org/T329556)
[10:49:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logs-api: allow GET / only for health check [puppet] - 10https://gerrit.wikimedia.org/r/889083 (https://phabricator.wikimedia.org/T320702)
[10:50:05] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39567/console" [puppet] - 10https://gerrit.wikimedia.org/r/889082 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[10:52:04] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10MoritzMuehlenhoff) >>! In T159412#8599008, @Dzahn wrote: > @Muehlenhoff  Here was my attempt to fix the "mediaw...
[10:52:45] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::etcd::v3::ml_etcd::staging: use PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/889082 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[10:53:02] <wikibugs>	 (03PS6) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870)
[10:54:11] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39568/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[10:56:02] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001
[10:56:13] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001
[10:56:45] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001
[10:58:19] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Release v3.2.9 to netbox-next - volans@cumin1001
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1100)
[11:04:20] <wikibugs>	 (03PS4) 10EoghanGaffney: Add insetup puppet role for aphlict vm in codfw [puppet] - 10https://gerrit.wikimedia.org/r/888690 (https://phabricator.wikimedia.org/T322369)
[11:05:21] <wikibugs>	 (03PS7) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870)
[11:05:54] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Add insetup puppet role for aphlict vm in codfw [puppet] - 10https://gerrit.wikimedia.org/r/888690 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[11:06:34] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39569/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[11:10:19] <wikibugs>	 (03PS1) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[11:10:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[11:11:07] <wikibugs>	 (03PS34) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[11:11:28] <wikibugs>	 (03PS2) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[11:11:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[11:16:22] <wikibugs>	 (03PS3) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[11:16:41] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: tools-manifests: don't collect statsd metrics [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889085 (https://phabricator.wikimedia.org/T244809)
[11:16:46] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: tools-manifest: refresh reference to obsolete 'labs' things [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889106
[11:20:57] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1003.eqiad.wmnet
[11:23:27] <wikibugs>	 (03PS4) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[11:24:36] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1003.eqiad.wmnet
[11:25:33] <wikibugs>	 (03PS5) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[11:25:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[11:28:22] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema1004.eqiad.wmnet
[11:28:55] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: tools-manifest: refresh reference to obsolete 'labs' things [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889106
[11:29:01] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: tools-manifest: add d/gbp.conf file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889110
[11:29:07] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: gitignore: ignore nano .swp file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889111
[11:29:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.25 buster [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889112
[11:30:16] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:30:17] <wikibugs>	 (03PS4) 10Clément Goubert: sre.switchdc.services: Exclude wdqs and wdqs-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193)
[11:30:19] <wikibugs>	 (03PS3) 10Clément Goubert: sre.switchdc.services: import sre.discovery.datacenter excludes [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193)
[11:32:02] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema1004.eqiad.wmnet
[11:32:40] <wikibugs>	 (03CR) 10Clément Goubert: sre.switchdc.services: Exclude wdqs and wdqs-ssl (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert)
[11:38:50] <wikibugs>	 (03PS1) 10Volans: python_deploy: call also a post-deploy target [puppet] - 10https://gerrit.wikimedia.org/r/889113
[11:39:04] <wikibugs>	 (03PS6) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[11:39:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[11:40:15] <wikibugs>	 (03PS7) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[11:40:25] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.mediawiki.restart-appservers: Fix clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert)
[11:40:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[11:40:57] <wikibugs>	 (03PS1) 10Volans: Makefile.deploy: add post-deploy target [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889116
[11:41:24] <wikibugs>	 (03PS1) 10Volans: Rake taskgen: use shellcheck from $PATH [puppet] - 10https://gerrit.wikimedia.org/r/889117
[11:41:32] <wikibugs>	 (03PS8) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[11:42:51] <wikibugs>	 (03PS9) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[11:43:22] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] python_deploy: call also a post-deploy target [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans)
[11:43:53] <wikibugs>	 (03PS1) 10Zabe: beta: Add deployment-db11 and deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889118 (https://phabricator.wikimedia.org/T329577)
[11:43:55] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39578/console" [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[11:44:09] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Should we make the post-deploy optional? Eg. not fail if it doesn't exist?" [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans)
[11:44:11] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2003.codfw.wmnet
[11:44:23] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] beta: Add deployment-db11 and deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889118 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[11:44:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Makefile.deploy: add post-deploy target [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889116 (owner: 10Volans)
[11:44:55] <wikibugs>	 (03PS2) 10Clément Goubert: sre.discovery.datacenter: status improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/889108
[11:45:24] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Add deployment-db11 and deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889118 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[11:46:21] <wikibugs>	 (03PS10) 10JMeybohm: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[11:46:38] <wikibugs>	 (03CR) 10Jbond: "lgtm minor comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans)
[11:46:45] <wikibugs>	 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) Whatever you've found is not the same issue as with ghost objects - a ghost object as defined here is one which appears in `swift list` (or asking swift for the contents of a...
[11:47:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889116 (owner: 10Volans)
[11:47:06] <wikibugs>	 (03PS1) 10Volans: Makefile.deploy: add post-deploy target [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/889119
[11:47:35] <wikibugs>	 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) But yes, thumbnails are transient, so it should always be OK to delete them.
[11:47:45] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Makefile.deploy: add post-deploy target [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/889119 (owner: 10Volans)
[11:47:52] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2003.codfw.wmnet
[11:48:21] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39579/console" [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[11:49:06] <wikibugs>	 (03CR) 10Jbond: "seems my comment was lost, here it is again" [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans)
[11:49:24] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host schema2004.codfw.wmnet
[11:49:45] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[11:51:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44619 and previous config saved to /var/cache/conftool/dbconfig/20230214-115144-root.json
[11:51:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Makefile.deploy: add post-deploy target [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/889119 (owner: 10Volans)
[11:53:05] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host schema2004.codfw.wmnet
[11:54:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] tools-manifests: don't collect statsd metrics [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889085 (https://phabricator.wikimedia.org/T244809) (owner: 10Arturo Borrero Gonzalez)
[11:54:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] tools-manifest: refresh reference to obsolete 'labs' things [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889106 (owner: 10Arturo Borrero Gonzalez)
[11:54:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] tools-manifest: add d/gbp.conf file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889110 (owner: 10Arturo Borrero Gonzalez)
[11:54:18] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] gitignore: ignore nano .swp file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889111 (owner: 10Arturo Borrero Gonzalez)
[11:54:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] d/changelog: generate entry for 0.25 buster [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889112 (owner: 10Arturo Borrero Gonzalez)
[11:54:28] <wikibugs>	 (03PS2) 10Volans: python_deploy: call also a post-deploy target [puppet] - 10https://gerrit.wikimedia.org/r/889113
[11:54:30] <wikibugs>	 (03PS2) 10Volans: Rake taskgen: use shellcheck from $PATH [puppet] - 10https://gerrit.wikimedia.org/r/889117
[11:54:39] <wikibugs>	 (03Merged) 10jenkins-bot: tools-manifests: don't collect statsd metrics [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889085 (https://phabricator.wikimedia.org/T244809) (owner: 10Arturo Borrero Gonzalez)
[11:54:42] <wikibugs>	 (03CR) 10Volans: "addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans)
[11:55:34] <wikibugs>	 (03Abandoned) 10Volans: Makefile.deploy: add post-deploy target [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/889119 (owner: 10Volans)
[11:56:01] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Makefile.deploy: add post-deploy target [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/889116 (owner: 10Volans)
[11:56:46] <wikibugs>	 (03Merged) 10jenkins-bot: tools-manifest: refresh reference to obsolete 'labs' things [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889106 (owner: 10Arturo Borrero Gonzalez)
[11:56:51] <wikibugs>	 (03Merged) 10jenkins-bot: tools-manifest: add d/gbp.conf file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889110 (owner: 10Arturo Borrero Gonzalez)
[11:56:57] <wikibugs>	 (03Merged) 10jenkins-bot: gitignore: ignore nano .swp file [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889111 (owner: 10Arturo Borrero Gonzalez)
[11:57:35] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to  also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) Usecase #4 is to centrally manage the list BGP routers (core routers or ToR switches) used for host to configure t...
[11:57:45] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: generate entry for 0.25 buster [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/889112 (owner: 10Arturo Borrero Gonzalez)
[11:59:54] <wikibugs>	 (03PS35) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580)
[12:01:00] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:26] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] python_deploy: call also a post-deploy target [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans)
[12:03:33] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39580/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[12:06:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44620 and previous config saved to /var/cache/conftool/dbconfig/20230214-120649-root.json
[12:07:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Thanks, LGTM. I suggest we collect a +1 from Cathal as well." [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[12:08:02] <wikibugs>	 (03PS1) 10Muehlenhoff: swift::ring_manager: Only enable auto restart on active ring manager nodes [puppet] - 10https://gerrit.wikimedia.org/r/889122
[12:20:07] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM, let's wait for Cathal to confirm 8972 is the best value to use." [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[12:21:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44621 and previous config saved to /var/cache/conftool/dbconfig/20230214-122154-root.json
[12:26:19] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] changeprop: use a more generic name for events in liftwing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/888653 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[12:27:29] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:36:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44622 and previous config saved to /var/cache/conftool/dbconfig/20230214-123659-root.json
[12:37:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[12:42:34] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:52:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44623 and previous config saved to /var/cache/conftool/dbconfig/20230214-125203-root.json
[12:58:33] <wikibugs>	 (03CR) 10Ottomata: "I know you are still working, just a couple of thoughts on latest patches." [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene)
[13:00:49] <wikibugs>	 (03PS3) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272)
[13:02:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[13:05:35] <wikibugs>	 (03PS4) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272)
[13:07:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2181 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44624 and previous config saved to /var/cache/conftool/dbconfig/20230214-130708-root.json
[13:07:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[13:08:22] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1001.eqiad.wmnet with OS bullseye
[13:08:34] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[13:08:54] <wikibugs>	 (03CR) 10Volans: [C: 03+2] python_deploy: call also a post-deploy target [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans)
[13:08:57] <wikibugs>	 (03PS8) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870)
[13:15:16] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to  also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond)
[13:20:47] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1001.eqiad.wmnet with reason: host reimage
[13:22:29] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:23:02] <wikibugs>	 (03PS3) 10Clément Goubert: sre.discovery.datacenter: status improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/889108
[13:23:53] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1001.eqiad.wmnet with reason: host reimage
[13:26:23] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[13:27:19] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[13:28:47] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I didn't test the output of status but LGTM, optional nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 (owner: 10Clément Goubert)
[13:32:43] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 2 others: Deleted files can remain on swift due to race conditions - https://phabricator.wikimedia.org/T168002 (10zhuyifei1999)
[13:35:32] <wikibugs>	 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui) The data checksum was clean, so I am repooling this host.
[13:35:48] <wikibugs>	 (03PS1) 10Zabe: beta: Add deployment-db11 and deployment-db12 (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889126 (https://phabricator.wikimedia.org/T329577)
[13:36:12] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] beta: Add deployment-db11 and deployment-db12 (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889126 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[13:36:49] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Add deployment-db11 and deployment-db12 (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889126 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[13:36:51] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to  also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) > alarms: true we can set based on the device model (false by default as we have more mx204s, then if mx480: true) J...
[13:38:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mediawiki: drop pybal-check user [puppet] - 10https://gerrit.wikimedia.org/r/886478 (https://phabricator.wikimedia.org/T111899) (owner: 10Majavah)
[13:39:14] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:42:29] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:42:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove further files related to removed pybal health checks [puppet] - 10https://gerrit.wikimedia.org/r/889127 (https://phabricator.wikimedia.org/T111899)
[13:44:06] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Good catch, sorry I missed this in the first review." [puppet] - 10https://gerrit.wikimedia.org/r/889122 (owner: 10Muehlenhoff)
[13:44:23] <wikibugs>	 (03PS1) 10Zabe: beta: Pool deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889128 (https://phabricator.wikimedia.org/T329577)
[13:45:10] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] beta: Pool deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889128 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[13:45:41] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.02387 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:45:52] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Pool deployment-db11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889128 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[13:46:07] <wikibugs>	 (03PS4) 10Clément Goubert: sre.discovery.datacenter: status improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/889108
[13:46:20] <wikibugs>	 (03CR) 10Clément Goubert: sre.discovery.datacenter: status improvements (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 (owner: 10Clément Goubert)
[13:46:35] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889127 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff)
[13:50:40] <wikibugs>	 (03CR) 10Cathal Mooney: "Overall LGTM, and definitely a good idea.  However looking at a cloudcephmon host it has an interface MTU of 1500 set on it?  Perhaps I di" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[13:51:30] <wikibugs>	 (03CR) 10Cathal Mooney: node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[13:53:22] <vgutierrez>	 Group[pybal-check] --> that's triggering issues in puppet
[13:53:54] <vgutierrez>	 taavi: that seems triggered by 1e2f1c0814cdd3547b20c0279b13d45fae07a926
[13:53:55] <wikibugs>	 (03PS1) 10Zabe: Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889097
[13:54:09] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889097 (owner: 10Zabe)
[13:54:43] <wikibugs>	 (03CR) 10David Caro: node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[13:54:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "beta: Switch beta to read only on mediawiki level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889097 (owner: 10Zabe)
[13:54:53] <vgutierrez>	 moritzm: ^^
[13:56:18] <vgutierrez>	 Feb 14 13:52:02 mw1351 puppet-agent[28879]: Could not delete group pybal-check: Execution of '/usr/sbin/groupdel pybal-check' returned 8: groupdel: cannot remove the primary group of user 'pybal-check'
[13:56:18] <vgutierrez>	 Feb 14 13:52:02 mw1351 puppet-agent[28879]: (/Stage[main]/Mediawiki::Users/Group[pybal-check]/ensure) change from 'present' to 'absent' failed: Could not delete group pybal-check: Execution of '/usr/sbin/groupdel pybal-check' returned 8: groupdel: cannot remove the primary group of user 'pybal-check'
[13:57:29] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:58:03] <wikibugs>	 (03PS1) 10Elukey: Replace underscores with hypens in ml-staging's SRV records [dns] - 10https://gerrit.wikimedia.org/r/889134 (https://phabricator.wikimedia.org/T329556)
[13:58:58] <vgutierrez>	 hmm seems like a second puppet run clears the issue
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1400).
[14:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[14:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1400)
[14:02:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Replace underscores with hypens in ml-staging's SRV records [dns] - 10https://gerrit.wikimedia.org/r/889134 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[14:03:03] <wikibugs>	 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Marostegui) 05Open→03Resolved Thanks everyone for all the help!
[14:03:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmcs ceph:Move cloudcephosd1001/1002 to e4 [puppet] - 10https://gerrit.wikimedia.org/r/888659 (https://phabricator.wikimedia.org/T329498) (owner: 10David Caro)
[14:04:25] <wikibugs>	 (03PS11) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[14:04:27] <wikibugs>	 (03PS1) 10Elukey: role::etcd::v3::ml_etcd::staging: replace discovery endpoint [puppet] - 10https://gerrit.wikimedia.org/r/889136 (https://phabricator.wikimedia.org/T329556)
[14:05:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::etcd::v3::ml_etcd::staging: replace discovery endpoint [puppet] - 10https://gerrit.wikimedia.org/r/889136 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[14:05:57] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889138
[14:06:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] swift::ring_manager: Only enable auto restart on active ring manager nodes [puppet] - 10https://gerrit.wikimedia.org/r/889122 (owner: 10Muehlenhoff)
[14:09:04] <wikibugs>	 (03PS12) 10Elukey: profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556)
[14:11:14] <moritzm>	 !log installing libde265 security updates
[14:11:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:54] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889138 (owner: 10Jgiannelos)
[14:12:45] <wikibugs>	 (03CR) 10Atieno: [C: 03+1] Bump Thumbor minor version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290) (owner: 10Hnowlan)
[14:13:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] changeprop: use a more generic name for events in liftwing's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/888653 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[14:13:44] <wikibugs>	 (03PS1) 10Jgiannelos: proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889139
[14:14:51] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[14:16:57] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889138 (owner: 10Jgiannelos)
[14:17:37] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:45] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[14:18:18] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[14:18:35] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[14:19:06] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889139 (owner: 10Jgiannelos)
[14:19:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/886039 (owner: 10Slyngshede)
[14:19:28] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[14:20:17] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[14:20:32] <wikibugs>	 (03PS9) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870)
[14:21:01] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[14:21:16] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[14:21:45] <wikibugs>	 (03PS1) 10Cathal Mooney: Adjust interface names for cloudcephosd1001 and cloudcephosd1002 [puppet] - 10https://gerrit.wikimedia.org/r/889142 (https://phabricator.wikimedia.org/T329498)
[14:22:33] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[14:22:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889142 (https://phabricator.wikimedia.org/T329498) (owner: 10Cathal Mooney)
[14:22:51] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply
[14:22:53] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply
[14:23:25] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply
[14:23:27] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply
[14:23:29] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Adjust interface names for cloudcephosd1001 and cloudcephosd1002 [puppet] - 10https://gerrit.wikimedia.org/r/889142 (https://phabricator.wikimedia.org/T329498) (owner: 10Cathal Mooney)
[14:23:34] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply
[14:23:36] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[14:23:49] <wikibugs>	 (03Merged) 10jenkins-bot: proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/889139 (owner: 10Jgiannelos)
[14:24:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[14:24:33] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply
[14:25:28] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply
[14:26:37] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply
[14:27:26] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39581/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[14:28:33] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[14:28:43] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply
[14:30:22] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply
[14:40:08] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration: Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10Ladsgroup) Super stupid question: Would this help here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/888657/2/modules/thumbor/fil...
[14:40:13] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM, it'd give a slightly more accurate representation of node health too" [puppet] - 10https://gerrit.wikimedia.org/r/889083 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[14:40:18] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin2002"
[14:41:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23
[14:41:21] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin2002"
[14:41:28] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1001.eqiad.wmnet with OS bullseye
[14:42:21] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.000994 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:43:59] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23
[14:47:29] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:48:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] logs-api: allow GET / only for health check [puppet] - 10https://gerrit.wikimedia.org/r/889083 (https://phabricator.wikimedia.org/T320702) (owner: 10Filippo Giunchedi)
[14:49:14] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:53:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:54:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:54:49] <godog>	 !log roll-restart pybal in eqiad/codfw to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/889083
[14:54:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:31] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1002.eqiad.wmnet with OS bullseye
[14:59:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2448.codfw.wmnet with OS buster
[14:59:47] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2448.codfw.wmnet with OS buster
[15:01:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2449.codfw.wmnet with OS buster
[15:01:57] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2449.codfw.wmnet with OS buster
[15:04:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2450.codfw.wmnet with OS buster
[15:04:24] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2450.codfw.wmnet with OS buster
[15:05:49] <moritzm>	 !log installing openjdk-11 security updates
[15:05:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2451.codfw.wmnet with OS buster
[15:07:31] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2451.codfw.wmnet with OS buster
[15:08:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[15:10:37] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage
[15:11:43] <wikibugs>	 (03PS1) 10Zabe: beta: Pool deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889150 (https://phabricator.wikimedia.org/T329577)
[15:13:19] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] beta: Pool deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889150 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[15:13:42] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage
[15:13:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889150 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[15:13:56] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Pool deployment-db12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889150 (https://phabricator.wikimedia.org/T329577) (owner: 10Zabe)
[15:14:04] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 (owner: 10Clément Goubert)
[15:15:30] <wikibugs>	 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) We're due another full backup of swift contents in the next few days, but I think we need a cookbook or similar to script handling these. In outline, assuming we specify eqia...
[15:15:32] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.discovery.datacenter: status improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 (owner: 10Clément Goubert)
[15:16:04] <wikibugs>	 (03CR) 10Volans: "suggestion inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 (owner: 10Clément Goubert)
[15:17:12] <wikibugs>	 (03Merged) 10jenkins-bot: sre.discovery.datacenter: status improvements [cookbooks] - 10https://gerrit.wikimedia.org/r/889108 (owner: 10Clément Goubert)
[15:19:45] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2448.codfw.wmnet with reason: host reimage
[15:21:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2449.codfw.wmnet with reason: host reimage
[15:21:29] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1002.eqiad.wmnet with OS bullseye
[15:22:58] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2448.codfw.wmnet with reason: host reimage
[15:23:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2450.codfw.wmnet with reason: host reimage
[15:24:03] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1002.eqiad.wmnet with OS bullseye
[15:24:45] <wikibugs>	 (03PS1) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767)
[15:25:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2449.codfw.wmnet with reason: host reimage
[15:26:31] <wikibugs>	 (03PS2) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767)
[15:27:19] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2451.codfw.wmnet with reason: host reimage
[15:27:56] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2450.codfw.wmnet with reason: host reimage
[15:30:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2451.codfw.wmnet with reason: host reimage
[15:32:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Fail over to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/889153
[15:33:20] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] profile::etcd::v3: add discovery SAN record on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/889084 (https://phabricator.wikimedia.org/T329556) (owner: 10Elukey)
[15:34:58] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage
[15:35:54] <wikibugs>	 (03PS10) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870)
[15:38:01] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage
[15:39:09] <wikibugs>	 (03PS1) 10Hnowlan: changeprop, jobqueue: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/889154
[15:39:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:39:26] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39582/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:41:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:43:30] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1002.eqiad.wmnet with OS bullseye
[15:44:04] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater: Use S3 instead of Swift for bucket access [deployment-charts] - 10https://gerrit.wikimedia.org/r/889155 (https://phabricator.wikimedia.org/T304914)
[15:44:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:45:32] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[15:45:53] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop, jobqueue: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/889154 (owner: 10Hnowlan)
[15:48:07] <wikibugs>	 (03PS1) 10CDanis: pki: Add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633)
[15:48:24] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1002.eqiad.wmnet with OS bullseye
[15:49:14] <wikibugs>	 (03PS2) 10CDanis: pki: Add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633)
[15:49:59] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync
[15:50:04] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[15:50:10] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[15:50:25] <dcausse>	 jouncebot: now
[15:50:25] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 9 minute(s)
[15:50:33] <inflatador>	 !log bking@deploy1002 'deploying rdf-streaming-updater prod eqiad T304914'
[15:50:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:37] <stashbot>	 T304914: Remove the presto client for swift from the flink image - https://phabricator.wikimedia.org/T304914
[15:51:49] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop, jobqueue: bump container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/889154 (owner: 10Hnowlan)
[15:52:30] <moritzm>	 !log uploaded src:icu67 67.1-7~wmf1 to buster-wikimedia/component/icu67 T329491
[15:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:34] <stashbot>	 T329491: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491
[15:53:36] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[15:54:06] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:55:15] <wikibugs>	 (03PS3) 10CDanis: pki: Add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633)
[15:55:20] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[15:55:50] <icinga-wm>	 PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 1985 MB (3% inode=97%): /srv/swift-storage/sda3 10261 MB (5% inode=99%): /tmp 1985 MB (3% inode=97%): /var/tmp 1985 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops
[15:56:21] <wikibugs>	 (03CR) 10Ahmon Dancy: "ok w/ me." [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[15:56:25] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[15:56:46] <wikibugs>	 (03PS1) 10Krinkle: webperf: Remove broken HeaderName/ReadmeName for arclamp file listing [puppet] - 10https://gerrit.wikimedia.org/r/889161
[15:57:00] <wikibugs>	 (03PS4) 10CDanis: pki: Add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633)
[15:57:05] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[15:57:38] <wikibugs>	 (03PS1) 10Ottomata: Produce rc1.mediawik.page_change to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889162
[15:59:03] <wikibugs>	 (03PS2) 10Ottomata: Produce rc1.mediawik.page_change to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889162
[15:59:07] <wikibugs>	 (03Abandoned) 10Sbailey: Enable Linter migration scripts for namespace and tag and template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888111 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey)
[15:59:12] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[15:59:18] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage
[16:00:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[16:01:42] <wikibugs>	 (03PS1) 10Btullis: Remove the ores::base class from the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363)
[16:02:27] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10Ladsgroup) hmm, it's not too complicated, my only concern is the order they should go in, I don't think that...
[16:02:41] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage
[16:04:00] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Produce rc1.mediawik.page_change to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889162 (owner: 10Ottomata)
[16:04:23] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:04:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2449.codfw.wmnet with OS buster
[16:04:25] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:04:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2450.codfw.wmnet with OS buster
[16:04:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:04:27] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2448.codfw.wmnet with OS buster
[16:04:27] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[16:04:28] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2451.codfw.wmnet with OS buster
[16:04:31] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2449.codfw.wmnet with OS buster completed: - mw2449 (**PASS**)   - Removed from Pupp...
[16:04:34] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2450.codfw.wmnet with OS buster completed: - mw2450 (**PASS**)   - Removed from Pupp...
[16:04:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2448.codfw.wmnet with OS buster completed: - mw2448 (**PASS**)   - Removed from Pupp...
[16:04:40] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2451.codfw.wmnet with OS buster completed: - mw2451 (**PASS**)   - Removed from Pupp...
[16:04:53] <wikibugs>	 (03Merged) 10jenkins-bot: Produce rc1.mediawik.page_change to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889162 (owner: 10Ottomata)
[16:05:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[16:05:57] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39583/console" [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[16:06:05] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[16:06:13] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] Remove aliases 'minnan' and 'zh-cfr' [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[16:09:11] <wikibugs>	 (03CR) 10David Caro: puppet: improvements to replica_cnf_api functional tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[16:09:36] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[16:11:37] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[16:12:07] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] pki: Add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889158 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[16:12:17] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to  also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) > The OOB ones are tricky and should probably be kept for last, probably by fetching the OOB circuits, and not the d...
[16:12:21] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[16:12:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/889117 (owner: 10Volans)
[16:13:17] <wikibugs>	 (03PS11) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870)
[16:13:19] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[16:14:51] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39584/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[16:14:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:15:05] <wikibugs>	 (03CR) 10David Caro: node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[16:15:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[16:16:13] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Rake taskgen: use shellcheck from $PATH [puppet] - 10https://gerrit.wikimedia.org/r/889117 (owner: 10Volans)
[16:16:34] <logmsgbot>	 !log andrew@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin2002"
[16:16:37] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39586/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[16:17:12] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: wgEventStreams - Produce rc1.mediawiki.page_change to eventgate-main (duration: 09m 01s)
[16:18:32] <wikibugs>	 (03PS12) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870)
[16:18:34] <wikibugs>	 (03PS1) 10Btullis: Do not install spark2 on bullseye or later [puppet] - 10https://gerrit.wikimedia.org/r/889166 (https://phabricator.wikimedia.org/T329363)
[16:19:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:22:19] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[16:22:42] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39588/console" [puppet] - 10https://gerrit.wikimedia.org/r/889166 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[16:22:45] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[16:23:39] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39589/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[16:25:29] <wikibugs>	 (03CR) 10Elukey: Remove the ores::base class from the analytics cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[16:26:27] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[16:27:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron)
[16:27:19] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[16:27:29] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:27:42] <logmsgbot>	 !log andrew@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin2002"
[16:27:48] <logmsgbot>	 !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1002.eqiad.wmnet with OS bullseye
[16:27:52] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[16:28:02] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "New version ready for review, here's an output of a run of the script (manually copied from the pcc output):" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[16:28:54] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[16:29:07] <wikibugs>	 (03PS13) 10David Caro: node_pinger: use jumbo frames [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870)
[16:29:15] <wikibugs>	 (03PS2) 10Btullis: Remove the ores::base class from the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363)
[16:29:27] <wikibugs>	 (03CR) 10Btullis: Remove the ores::base class from the analytics cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[16:29:45] <wikibugs>	 (03PS1) 10Ottomata: eventgate-main - bump to image version 2023-02-14-162241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889171
[16:29:47] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[16:30:53] <wikibugs>	 (03PS2) 10Ottomata: eventgate-main - bump to image version 2023-02-14-162241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889171
[16:31:35] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[16:32:19] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[16:32:40] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-main - bump to image version 2023-02-14-162241-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/889171 (owner: 10Ottomata)
[16:33:10] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply
[16:33:13] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1002.eqiad.wmnet with OS bullseye
[16:33:13] <wikibugs>	 (03PS5) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944)
[16:33:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[16:34:01] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[16:34:20] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[16:34:33] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply
[16:34:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[16:35:53] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[16:36:01] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply
[16:36:08] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[16:36:32] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/889127 (https://phabricator.wikimedia.org/T111899) (owner: 10Muehlenhoff)
[16:36:58] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
[16:37:03] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39591/console" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[16:37:45] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[16:38:28] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater: Increase memory alloc from 2 to 3GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494)
[16:38:41] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[16:39:01] <wikibugs>	 (03CR) 10Herron: [C: 03+2] rsync: remove rsync::server::wrap_with_stunnel [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron)
[16:39:48] <wikibugs>	 (03CR) 10Herron: [C: 03+2] "thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron)
[16:44:06] <wikibugs>	 (03CR) 10Andrew Bogott: "One comment inline; the setup/teardown seems good!" [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[16:45:48] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage
[16:47:07] <wikibugs>	 (03PS1) 10Ottomata: wgEventStreams - rc1.mediawiki.page_change: enable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889174
[16:48:28] <wikibugs>	 (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[16:48:29] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage
[16:49:24] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] wgEventStreams - rc1.mediawiki.page_change: enable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889174 (owner: 10Ottomata)
[16:50:03] <wikibugs>	 (03Merged) 10jenkins-bot: wgEventStreams - rc1.mediawiki.page_change: enable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889174 (owner: 10Ottomata)
[16:50:28] <wikibugs>	 (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[16:50:50] <wikibugs>	 (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[16:52:56] <wikibugs>	 (03PS2) 10Clément Goubert: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133
[16:53:15] <wikibugs>	 (03CR) 10Clément Goubert: sre.discovery.datacenter: ConfctlError handling (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 (owner: 10Clément Goubert)
[16:53:41] <wikibugs>	 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10akosiaris) >>! In T327253#8614005, @MatthewVernon wrote: > Whatever you've found is not the same issue as with ghost objects - a ghost object as defined here is one which appears in `swift...
[16:53:47] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "from PCC output proxyfetch URL doesn't look good: proxyfetch.url = ["http://prometheus/"]" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[16:55:43] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] sre.k8s.upgrade-cluster: wrap run_sync actions with try/except (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[16:56:06] <wikibugs>	 (03PS3) 10Clément Goubert: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133
[16:56:25] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1002.eqiad.wmnet with OS bullseye
[16:57:32] <wikibugs>	 (03CR) 10Raymond Ndibe: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[16:58:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133 (owner: 10Clément Goubert)
[16:58:35] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1002.eqiad.wmnet with OS bullseye
[16:59:03] <wikibugs>	 (03PS3) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767)
[16:59:22] <wikibugs>	 (03CR) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[17:00:04] <jouncebot>	 jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Remove the ores::base class from the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[17:01:34] <wikibugs>	 (03PS6) 10Jcrespo: Add unit tests & coverage report [software/mediabackups] - 10https://gerrit.wikimedia.org/r/885428
[17:03:03] <wikibugs>	 (03PS1) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633)
[17:04:21] <wikibugs>	 (03PS4) 10Clément Goubert: sre.discovery.datacenter: ConfctlError handling [cookbooks] - 10https://gerrit.wikimedia.org/r/889133
[17:05:02] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: wgEventStreams - rc1.mediawiki.page_change: enable on all wikis (duration: 07m 11s)
[17:05:18] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "lgtm," [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking)
[17:05:47] <wikibugs>	 (03PS1) 10CDanis: pki: dummy secrets for k8s_aux intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/889176 (https://phabricator.wikimedia.org/T329633)
[17:05:57] <wikibugs>	 (03CR) 10JMeybohm: "This won't work as the current maximum memory request of a container is 3Gi by default  (see helmfile.d/admin_ng/values/common.yaml)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking)
[17:06:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] rdf-streaming-updater: Increase memory alloc from 2 to 3GB [deployment-charts] - 10https://gerrit.wikimedia.org/r/889172 (https://phabricator.wikimedia.org/T302494) (owner: 10Bking)
[17:06:19] <wikibugs>	 (03CR) 10CDanis: [V: 03+2 C: 03+2] pki: dummy secrets for k8s_aux intermediates [labs/private] - 10https://gerrit.wikimedia.org/r/889176 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[17:07:01] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s::package: Ensure the apt component is registered first [puppet] - 10https://gerrit.wikimedia.org/r/887981 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[17:09:48] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove the ores::base class from the analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/889164 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[17:09:48] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage
[17:09:56] <wikibugs>	 (03PS2) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633)
[17:10:11] <wikibugs>	 (03PS4) 10Elukey: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767)
[17:10:13] <wikibugs>	 (03PS3) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633)
[17:12:50] <wikibugs>	 (03PS3) 10Nray: Enable Page Tools for logged in users across all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888764 (https://phabricator.wikimedia.org/T328692) (owner: 10Bernard Wang)
[17:12:56] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1002.eqiad.wmnet with reason: host reimage
[17:13:09] <wikibugs>	 (03CR) 10David Caro: node_pinger: use jumbo frames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[17:13:53] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[17:16:33] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[17:17:45] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:18:15] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database
[17:18:28] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database
[17:19:51] <wikibugs>	 (03PS13) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051
[17:20:43] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[17:21:46] <wikibugs>	 (03PS5) 10Jbond: sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272)
[17:22:29] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:23:27] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001"
[17:23:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: add network devices to netbox hiera export [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[17:24:43] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001"
[17:24:43] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:25:02] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[17:27:04] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001"
[17:28:08] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001"
[17:28:08] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:28:09] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS bullseye
[17:28:18] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS bullseye
[17:28:24] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[17:29:11] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1002.eqiad.wmnet with OS bullseye
[17:29:26] <wikibugs>	 (03PS6) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944)
[17:30:26] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001"
[17:31:32] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS records for cloudcephosd1002 - cmooney@cumin1001"
[17:31:32] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:32:13] <icinga-wm>	 PROBLEM - Host 2620:0:863:1:198:35:26:8 is DOWN: PING CRITICAL - Packet loss = 100%
[17:32:14] <wikibugs>	 (03PS9) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576)
[17:32:29] <wikibugs>	 (03CR) 10Elukey: services: add the first lift wing stream to change-prop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey)
[17:32:35] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:32:38] <sukhe>	 ^ expected
[17:32:41] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:32:42] <herron>	 ack
[17:32:59] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:33:17] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:33:26] <wikibugs>	 (03PS4) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633)
[17:33:28] <wikibugs>	 (03PS1) 10CDanis: pki: Again add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889181 (https://phabricator.wikimedia.org/T329633)
[17:33:53] <icinga-wm>	 PROBLEM - Recursive DNS on 198.35.26.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[17:34:02] <sukhe>	 ^ expected
[17:34:32] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] pki: Again add intermediates for aux k8s cluster (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/889181 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[17:34:40] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "Thanks, your changes make sense to me!" [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[17:35:12] <wikibugs>	 (03PS10) 10Legoktm: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216)
[17:35:36] <wikibugs>	 (03PS7) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944)
[17:36:27] <wikibugs>	 (03CR) 10Vgutierrez: service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[17:36:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216) (owner: 10Legoktm)
[17:37:24] <wikibugs>	 (03PS11) 10Legoktm: gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners [puppet] - 10https://gerrit.wikimedia.org/r/888828 (https://phabricator.wikimedia.org/T329216)
[17:37:29] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:37:39] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39593/console" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[17:39:34] <wikibugs>	 (03PS5) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633)
[17:43:28] <wikibugs>	 (03CR) 10Herron: [V: 03+1] service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[17:44:29] <wikibugs>	 (03PS1) 10CDanis: rename k8s aux [labs/private] - 10https://gerrit.wikimedia.org/r/889185 (https://phabricator.wikimedia.org/T329633)
[17:44:42] <wikibugs>	 (03CR) 10CDanis: [V: 03+2 C: 03+2] rename k8s aux [labs/private] - 10https://gerrit.wikimedia.org/r/889185 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[17:45:48] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[17:45:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage
[17:46:05] <icinga-wm>	 RECOVERY - Host 2620:0:863:1:198:35:26:8 is UP: PING OK - Packet loss = 0%, RTA = 70.93 ms
[17:48:46] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage
[17:49:29] <icinga-wm>	 PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[17:49:45] <wikibugs>	 (03PS6) 10CDanis: pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633)
[17:49:51] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[17:51:06] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[17:51:18] <wikibugs>	 (03CR) 10David Caro: puppet: improvements to replica_cnf_api functional tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[17:53:19] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] pki: Add intermediates for aux k8s cluster (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[17:53:45] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39595/console" [puppet] - 10https://gerrit.wikimedia.org/r/889175 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[17:57:29] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1800)
[18:00:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] sre.k8s.upgrade-cluster: wrap run_sync actions with try/except (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[18:02:25] <wikibugs>	 (03CR) 10David Caro: "I got some questions, can you elaborate on what errors are you seeing with the tests currently?" [puppet] - 10https://gerrit.wikimedia.org/r/888827 (https://phabricator.wikimedia.org/T303663) (owner: 10Raymond Ndibe)
[18:05:48] <wikibugs>	 (03PS1) 10Bas dehaan: Added extended confirmed on nlwiki Implemented configuration changes regarding page protection for nlwiki, per request of local community. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642)
[18:05:50] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888736 (https://phabricator.wikimedia.org/T329642) (owner: 10Bas dehaan)
[18:06:17] <wikibugs>	 (03PS1) 10CDanis: role::aux_k8s: upgrade cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/889189 (https://phabricator.wikimedia.org/T329633)
[18:06:20] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10Papaul)
[18:06:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1002']
[18:10:30] <dduvall>	 !log refactored failed security patch for T278365
[18:10:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:13] <papaul>	 !log upgrading firmware on mc-gp1002
[18:15:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:50] <wikibugs>	 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10Papaul)
[18:16:19] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889191 (https://phabricator.wikimedia.org/T325586)
[18:16:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889191 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot)
[18:16:26] <wikibugs>	 10SRE, 10ops-codfw, 10SRE Observability (FY2022/2023-Q3): Decommission netmon2001 - https://phabricator.wikimedia.org/T322695 (10Papaul) 05Open→03Resolved Complete
[18:16:45] <wikibugs>	 (03PS1) 10CDanis: admin_ng: update aux's settings for k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/889194
[18:16:47] <wikibugs>	 (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195
[18:16:59] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889191 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot)
[18:17:21] <logmsgbot>	 !log dduvall@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.23  refs T325586
[18:17:25] <stashbot>	 T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586
[18:17:44] <papaul>	 ok
[18:18:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (owner: 10Jbond)
[18:20:05] <icinga-wm>	 PROBLEM - Host mc-gp1002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:24:49] <icinga-wm>	 RECOVERY - Host mc-gp1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[18:24:57] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['mc-gp1002']
[18:29:41] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1002']
[18:30:32] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T329595 (10Papaul) @Marostegui we are getting the error below on the interface where db2099 is connected. It might be a bad cable or bad port or something else. I tried clearing the statistics last week end but this came back up. I will pi...
[18:31:09] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Migrate sre.switchdc.mediawiki to spicerack class API - https://phabricator.wikimedia.org/T328908 (10Krinkle)
[18:31:20] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover eqiad pooling schedule - https://phabricator.wikimedia.org/T328903 (10Krinkle)
[18:31:30] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Ensure sre.switchdc.mediawiki live test multi-DC compatibility - https://phabricator.wikimedia.org/T329065 (10Krinkle)
[18:31:41] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Krinkle)
[18:31:58] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Krinkle)
[18:32:08] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Krinkle)
[18:32:17] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 2 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10Krinkle)
[18:32:35] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889189 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[18:33:01] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/889194 (owner: 10CDanis)
[18:33:30] <claime>	 Krinkle: Sorry, I should clean the tags up when creating subtask. It's kind of annoying that phabricator does that.
[18:34:00] <Krinkle>	 claime: aye, no problem. +1 at T239378 if you like :)
[18:34:00] <stashbot>	 T239378: Disable parent task metadata by default for new sub tasks - https://phabricator.wikimedia.org/T239378
[18:34:13] <wikibugs>	 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10JMeybohm)
[18:34:30] <Krinkle>	 fwiw, it is sometimes intended, but usually not indeed. I'd say it's easy enough to set them directly when creating the subtask if/when it is intended.
[18:34:39] <icinga-wm>	 PROBLEM - Host mc-gp1002 is DOWN: PING CRITICAL - Packet loss = 100%
[18:34:41] <mutante>	 yea, I find myself just creating a task first and then linking it as subtask after the fact. because more often than not I dont want all the subscribers and tags
[18:35:54] <logmsgbot>	 !log cdanis@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: upgrade to v1.23
[18:35:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:36:54] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns4004.wikimedia.org with OS bullseye
[18:37:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS bullseye executed with errors: - dns4004 (**FAIL**)   - Downtimed o...
[18:37:20] <logmsgbot>	 !log cdanis@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: upgrade to v1.23
[18:37:33] <icinga-wm>	 RECOVERY - Host mc-gp1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[18:37:47] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['mc-gp1002']
[18:38:00] <sukhe>	 !log reimage dns4004 back to buster to resolve pdns-rec Prometheus endpoit issues: T321309
[18:38:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:04] <stashbot>	 T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309
[18:38:14] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS buster
[18:38:24] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster
[18:39:14] <wikibugs>	 (03PS1) 10Herron: pontoon: don't deploy benthos instances with prod config [puppet] - 10https://gerrit.wikimedia.org/r/889198
[18:39:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - aux-k8s-ctrl_6443: Servers aux-k8s-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:40:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:41:58] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] pontoon: don't deploy benthos instances with prod config [puppet] - 10https://gerrit.wikimedia.org/r/889198 (owner: 10Herron)
[18:42:21] <wikibugs>	 (03PS2) 10Herron: pontoon: don't deploy benthos instances with prod config [puppet] - 10https://gerrit.wikimedia.org/r/889198
[18:42:29] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:43:27] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[18:44:14] <jinxer-wm>	 (JobUnavailable) firing: (12) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:45:12] <wikibugs>	 (03Merged) 10jenkins-bot: sre.k8s.upgrade-cluster: wrap run_sync actions with try/except [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[18:45:25] <wikibugs>	 (03PS1) 10Ssingh: P:dns::recursor: skip installation of prometheus-pdns-rec-exporter [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309)
[18:46:29] <icinga-wm>	 PROBLEM - Host 2620:0:863:1:198:35:26:8 is DOWN: CRITICAL - Destination Unreachable (2620:0:863:1:198:35:26:8)
[18:46:38] <logmsgbot>	 !log cdanis@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: upgrade to v1.23
[18:47:17] <sukhe>	 2620:0:863:1:198:35:26:8 down is expected (dns4004)
[18:47:37] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] role::aux_k8s: upgrade cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/889189 (https://phabricator.wikimedia.org/T329633) (owner: 10CDanis)
[18:47:41] <mutante>	 thanks, I was just wondering why I have never seen a MAC there
[18:47:53] <mutante>	 eh, v6 IP of course
[18:47:54] <logmsgbot>	 !log cdanis@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: upgrade to v1.23
[18:48:10] <sukhe>	 :)
[18:48:58] <wikibugs>	 (03PS2) 10Ssingh: P:dns::recursor: skip installation of prometheus-pdns-rec-exporter [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309)
[18:49:58] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39597/console" [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[18:51:16] <wikibugs>	 (03PS1) 10CDanis: k8s.upgrade-cluster: fix bug in re-enabling Puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/889200
[18:51:56] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] k8s.upgrade-cluster: fix bug in re-enabling Puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/889200 (owner: 10CDanis)
[18:53:46] <wikibugs>	 (03Merged) 10jenkins-bot: k8s.upgrade-cluster: fix bug in re-enabling Puppet [cookbooks] - 10https://gerrit.wikimedia.org/r/889200 (owner: 10CDanis)
[18:54:15] <icinga-wm>	 RECOVERY - Host 2620:0:863:1:198:35:26:8 is UP: PING OK - Packet loss = 0%, RTA = 70.93 ms
[18:54:27] <logmsgbot>	 !log cdanis@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: upgrade to v1.23
[18:55:14] <logmsgbot>	 !log cdanis@cumin1001 START - Cookbook sre.ganeti.reimage for host aux-k8s-ctrl1001.eqiad.wmnet with OS bullseye
[18:55:49] <wikibugs>	 (03CR) 10Dzahn: ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[18:56:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage
[18:58:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage
[18:59:42] <icinga-wm>	 RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:59:47] <wikibugs>	 (03PS1) 10Ahmon Dancy: Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/889202
[19:00:05] <jouncebot>	 dduvall and ^demon: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T1900).
[19:00:32] <wikibugs>	 (03PS1) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309)
[19:00:42] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/889202 (owner: 10Ahmon Dancy)
[19:00:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:01:18] <wikibugs>	 (03Merged) 10jenkins-bot: Merge branch 'master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/889202 (owner: 10Ahmon Dancy)
[19:01:38] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39598/console" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:01:50] <wikibugs>	 (03PS2) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309)
[19:01:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[19:02:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:03:20] <wikibugs>	 (03CR) 10Dzahn: "Antoine said on IRC: "I think the change to the zuul_merger_hosts variable should not be in this change,  but the jenkins_master_hosts sho" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[19:03:56] <wikibugs>	 (03CR) 10Dzahn: ci: move lists of contint and zuul hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[19:06:22] <logmsgbot>	 !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl1001.eqiad.wmnet with reason: host reimage
[19:06:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1003']
[19:07:03] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp1003']
[19:07:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1003']
[19:07:29] <jinxer-wm>	 (JobUnavailable) firing: (12) Reduced availability for job calico-felix in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:08:24] <logmsgbot>	 !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.23  refs T325586 (duration: 51m 03s)
[19:08:28] <stashbot>	 T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586
[19:09:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[19:09:26] <logmsgbot>	 !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl1001.eqiad.wmnet with reason: host reimage
[19:09:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[19:09:52] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools on mobile at almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889204 (https://phabricator.wikimedia.org/T328940)
[19:09:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:09:55] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - aux-k8s-ctrl_6443: Servers aux-k8s-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:09:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[19:10:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance
[19:10:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance
[19:10:53] <wikibugs>	 (03PS3) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309)
[19:11:50] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39599/console" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:14:56] <wikibugs>	 (03PS6) 10Dzahn: ci: move lists of contint hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593
[19:15:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance
[19:15:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2111.codfw.wmnet with reason: Maintenance
[19:15:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T328255)', diff saved to https://phabricator.wikimedia.org/P44628 and previous config saved to /var/cache/conftool/dbconfig/20230214-191550-ladsgroup.json
[19:15:54] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[19:16:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:17:05] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond)
[19:17:10] <papaul>	 !log upgrading firmware on mc-gp1003
[19:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:17] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond) p:05Triage→03Medium
[19:19:23] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond)
[19:20:12] <wikibugs>	 (03PS4) 10Ssingh: dnsrecursor: enable webserver for bullseye installation of pdns-rec [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309)
[19:20:47] <icinga-wm>	 PROBLEM - Host mc-gp1003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:21:09] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39601/console" [puppet] - 10https://gerrit.wikimedia.org/r/889203 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:21:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:21:56] <logmsgbot>	 !log cdanis@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aux-k8s-ctrl1001.eqiad.wmnet with OS bullseye
[19:22:00] <wikibugs>	 (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195
[19:22:06] <logmsgbot>	 !log cdanis@cumin1001 START - Cookbook sre.ganeti.reimage for host aux-k8s-ctrl1002.eqiad.wmnet with OS bullseye
[19:22:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T328255)', diff saved to https://phabricator.wikimedia.org/P44629 and previous config saved to /var/cache/conftool/dbconfig/20230214-192242-ladsgroup.json
[19:22:46] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[19:23:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Add network data to the hiera files [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (owner: 10Jbond)
[19:25:31] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp1003']
[19:25:37] <icinga-wm>	 RECOVERY - Host mc-gp1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[19:26:57] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 2 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10jbond)
[19:27:13] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mc-gp1003']
[19:28:59] <dduvall>	 we are out of disk space on deploy1002 :/
[19:29:08] <dduvall>	 in /srv
[19:29:14] <jinxer-wm>	 (JobUnavailable) firing: (12) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:30:02] <dancy>	 doh!
[19:31:04] <icinga-wm>	 PROBLEM - Host mc-gp1003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:31:11] <wikibugs>	 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Varnent) a:05Varnent→03None
[19:31:31] <logmsgbot>	 !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl1002.eqiad.wmnet with reason: host reimage
[19:31:48] <dduvall>	 !log scap sync-world failed due to lack of disk space on deploy1002 /srv (cc T325586)
[19:31:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:52] <stashbot>	 T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586
[19:31:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[19:33:23] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] P:dns::recursor: skip installation of prometheus-pdns-rec-exporter [puppet] - 10https://gerrit.wikimedia.org/r/889199 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[19:33:47] <dduvall>	 !log running `docker system prune` on deploy1002 to free up disk space on /srv
[19:33:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:33:58] <dancy>	 nooo
[19:34:00] <mutante>	 dduvall: a huge chunk of space is used by "deployment.T307349" which seems like a copy of the normal deployment dir. so that ticket number is probably where we should comment
[19:34:01] <stashbot>	 T307349: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349
[19:34:06] <dancy>	 that'll make k8s build slow
[19:34:20] <mutante>	 T307349
[19:34:27] <dduvall>	 scap is broken atm. seems like it's worth it?
[19:34:33] <logmsgbot>	 !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl1002.eqiad.wmnet with reason: host reimage
[19:34:36] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mc-gp1003']
[19:34:38] <icinga-wm>	 RECOVERY - Host mc-gp1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[19:34:38] <wikibugs>	 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Varnent) I believe it is in pipeline for any requests lingering - but probably best to check with @CKoerner_WMF for diff. For the other two - while not her dire...
[19:34:40] <dancy>	 Definitely delete /srv/deployment.T307349
[19:34:40] <dduvall>	 reclaimable: 111G
[19:35:20] <wikibugs>	 10SRE, 10Deployments, 10bacula, 10Parsoid (Tracking), 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10Dzahn) There is a directory "deployment.T307349" under /srv/ on deploy1002 that uses 47GB.  And the...
[19:36:25] <dancy>	 dduvall: Did you capture a `docker system df -v` ahead of time?  I'd like to see it if so
[19:36:28] <dduvall>	 alright. but we should implement some docker clean up job or move its store elsewhere i think
[19:36:34] <mutante>	 !log root@deploy1002:/srv# rm -rf deployment.T307349/
[19:36:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:37] <dduvall>	 i haven't pruned
[19:36:41] <dancy>	 nod.. I added a note to self to run docker-gc on deploy1002
[19:36:52] <mutante>	 /dev/mapper/vg0-srv   277G  216G   47G  83% /srv
[19:36:54] <dancy>	 ah good. then I'll look myself.
[19:36:56] <mutante>	 here you go
[19:37:01] <dduvall>	 https://www.irccloud.com/pastebin/f9tUZ4EU/
[19:37:09] <dduvall>	 mutante: thank you :)
[19:37:13] <mutante>	 yw
[19:37:25] <dduvall>	 !log did not run `docker system prune` due to objections
[19:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:37:45] <wikibugs>	 10SRE, 10Deployments, 10bacula, 10Parsoid (Tracking), 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10Dzahn) I deleted it.  19:36 < mutante> !log root@deploy1002:/srv# rm -rf deployment.T307349/ 19:36...
[19:37:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P44630 and previous config saved to /var/cache/conftool/dbconfig/20230214-193748-ladsgroup.json
[19:38:41] <wikibugs>	 (03PS1) 10Gehel: conftool / cirrus: elastic2069 in wrong LVS pool [puppet] - 10https://gerrit.wikimedia.org/r/889212 (https://phabricator.wikimedia.org/T329145)
[19:38:56] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10Papaul)
[19:39:15] <wikibugs>	 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10serviceops: Update iDRAC and NIC firmware on mc-gp* hosts - https://phabricator.wikimedia.org/T329323 (10Papaul) 05Open→03Resolved a:03Papaul @jijiki complete
[19:39:25] <dduvall>	 dancy: i'm confused though. wouldn't today's image be the base for subsequent builds?
[19:39:33] <dduvall>	 since it's `scap stage-train -Dfull_image_build:True --yes auto`
[19:39:45] <dancy>	 oh right.. first train of the week
[19:39:48] <dancy>	 ok. objection removed.
[19:39:53] <dduvall>	 :)
[19:40:17] <dduvall>	 we have space again. i'll leave it for now
[19:40:23] <dancy>	 thx.. that'll make testing docker-gc easier 
[19:40:31] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] conftool / cirrus: elastic2069 in wrong LVS pool [puppet] - 10https://gerrit.wikimedia.org/r/889212 (https://phabricator.wikimedia.org/T329145) (owner: 10Gehel)
[19:40:40] <dduvall>	 w00t
[19:40:41] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] conftool / cirrus: elastic2069 in wrong LVS pool [puppet] - 10https://gerrit.wikimedia.org/r/889212 (https://phabricator.wikimedia.org/T329145) (owner: 10Gehel)
[19:40:43] <wikibugs>	 (03CR) 10Bking: [C: 03+1] conftool / cirrus: elastic2069 in wrong LVS pool [puppet] - 10https://gerrit.wikimedia.org/r/889212 (https://phabricator.wikimedia.org/T329145) (owner: 10Gehel)
[19:41:36] <wikibugs>	 (03PS3) 10BCornwall: Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[19:41:37] <logmsgbot>	 !log dduvall@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.23  refs T325586
[19:41:40] <stashbot>	 T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586
[19:42:31] <wikibugs>	 (03CR) 10Dzahn: ci: move lists of contint hosts to hieradata/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[19:42:36] <dduvall>	 dancy: fyi i'm running `stage-train` again (but without the full image build), just because... idempotence
[19:43:02] <dancy>	 Sounds good.
[19:43:14] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=active; selector: name=elastic2069.cofdw.wmnet
[19:43:21] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/weight=10; selector: name=elastic2069.cofdw.wmnet
[19:45:02] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/weight=10; selector: name=elastic2069.cofdw.wmnet,service=elasticsearch-psi-ssl
[19:46:41] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/weight=10; selector: name=elastic2069.codfw.wmnet,service=elasticsearch-psi-ssl
[19:46:52] <logmsgbot>	 !log cdanis@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aux-k8s-ctrl1002.eqiad.wmnet with OS bullseye
[19:46:53] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2069.codfw.wmnet,service=elasticsearch-psi-ssl
[19:47:29] <jinxer-wm>	 (JobUnavailable) firing: (12) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:48:45] <wikibugs>	 (03CR) 10Volans: "post-merge comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/889151 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey)
[19:50:51] <logmsgbot>	 !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.23  refs T325586 (duration: 09m 14s)
[19:50:55] <stashbot>	 T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586
[19:51:38] <icinga-wm>	 RECOVERY - Disk space on deploy1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1002&var-datasource=eqiad+prometheus/ops
[19:51:45] <wikibugs>	 10SRE, 10Diff-blog, 10Technical Blog, 10HTTPS: Send HSTS header on all Wordpress VIP-hosted domains - https://phabricator.wikimedia.org/T270034 (10Sbenchagra) @Dzahn Thanks for following up. I am just seeing this ticket. I will follow up and get back to you.
[19:52:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P44631 and previous config saved to /var/cache/conftool/dbconfig/20230214-195255-ladsgroup.json
[19:53:03] <logmsgbot>	 !log dduvall@deploy1002 Pruned MediaWiki: 1.40.0-wmf.21 (duration: 02m 10s)
[19:53:20] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:03:20] <wikibugs>	 10SRE, 10ops-eqsin, 10ops-ulsfo, 10DC-Ops: eqsin & ulsfo: new R450s drawing far more power than R440s (power over contracted caps in both sites) - https://phabricator.wikimedia.org/T328957 (10wiki_willy) Tim's previous suggestion was from T315398.  However, that applies primarily to mediawiki servers and w...
[20:04:35] <jinxer-wm>	 (ConfdResourceFailed) firing: (64) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:04:43] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889217 (https://phabricator.wikimedia.org/T325586)
[20:04:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889217 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot)
[20:05:20] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889217 (https://phabricator.wikimedia.org/T325586) (owner: 10TrainBranchBot)
[20:06:15] <wikibugs>	 (03PS3) 10BCornwall: Remove aliases 'minnan' and 'zh-cfr' [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:07:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "valid ISO-639-3 code - https://iso639-3.sil.org/code/vro" [puppet] - 10https://gerrit.wikimedia.org/r/527915 (https://phabricator.wikimedia.org/T31186) (owner: 10Fomafix)
[20:08:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T328255)', diff saved to https://phabricator.wikimedia.org/P44632 and previous config saved to /var/cache/conftool/dbconfig/20230214-200801-ladsgroup.json
[20:08:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance
[20:08:05] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[20:08:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance
[20:08:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T328255)', diff saved to https://phabricator.wikimedia.org/P44633 and previous config saved to /var/cache/conftool/dbconfig/20230214-200822-ladsgroup.json
[20:09:29] <wikibugs>	 (03CR) 10Vgutierrez: service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[20:09:35] <jinxer-wm>	 (ConfdResourceFailed) firing: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:09:46] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.reimage for host an-airflow1005.eqiad.wmnet with OS bullseye
[20:10:26] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39604/console" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:11:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T328255)', diff saved to https://phabricator.wikimedia.org/P44634 and previous config saved to /var/cache/conftool/dbconfig/20230214-201114-ladsgroup.json
[20:11:55] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39605/console" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:12:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "valid ISO-639-3 code - https://www.ethnologue.com/language/sgs" [dns] - 10https://gerrit.wikimedia.org/r/481539 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix)
[20:12:34] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.23  refs T325586
[20:12:38] <stashbot>	 T325586: 1.40.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T325586
[20:12:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add 'sgs' as alias for 'bat-smg' [dns] - 10https://gerrit.wikimedia.org/r/481539 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix)
[20:12:46] <wikibugs>	 (03PS5) 10Dzahn: Add 'sgs' as alias for 'bat-smg' [dns] - 10https://gerrit.wikimedia.org/r/481539 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix)
[20:14:24] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39608/console" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:14:35] <jinxer-wm>	 (ConfdResourceFailed) resolved: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:15:01] <wikibugs>	 (03CR) 10Dzahn: "also bat-smg matches sgs per https://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered_by_code" [dns] - 10https://gerrit.wikimedia.org/r/481539 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix)
[20:17:13] <mutante>	 sukhe: authdns-update returns an error currently because dns4004 does not have /usr/sbin/gdnsd yet but is in the list of sync hosts
[20:17:19] <sukhe>	 yep
[20:17:23] <wikibugs>	 (03PS1) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011)
[20:17:24] <sukhe>	 trying to figure it out
[20:17:34] <mutante>	 is it ok if I keep using it though?
[20:17:34] <sukhe>	 worse case, I will just remove it from there
[20:17:35] <jinxer-wm>	 (ConfdResourceFailed) firing: (64) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:17:35] <sukhe>	 thanks
[20:17:49] <sukhe>	 yeaH I think I am going to just remove it
[20:17:56] <mutante>	 alright
[20:18:07] <sukhe>	 mutante: are you merging a dns change?
[20:18:13] <sukhe>	 if you can wait for a bit, please do 
[20:18:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Automated removal of obsolete kernels - https://phabricator.wikimedia.org/T277011 (10jhathaway) a:03jhathaway
[20:18:20] <wikibugs>	 (03PS1) 10Sbailey: Change linter maintenance scripts to use existing config varaibles [extensions/Linter] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889220 (https://phabricator.wikimedia.org/T329342)
[20:18:25] <sukhe>	 I will either resolve it or not and failing which just remove dns4004
[20:18:34] <mutante>	 sukhe: I have like 10 of them :)
[20:18:38] <mutante>	 I will wait
[20:19:04] <sukhe>	 thanks
[20:19:04] <wikibugs>	 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 2 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10Ladsgroup) Option 5 sounds good too, I think we can also reuse this solution in toolhub too (T329319)
[20:19:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway)
[20:20:19] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns4004.wikimedia.org with OS buster
[20:20:32] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster executed with errors: - dns4004 (**FAIL**)   - Downtimed on...
[20:20:32] <sukhe>	 yeah this will take a while as it did last time, we have some weird Puppet dependency failure here
[20:20:36] <wikibugs>	 10SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Dzahn) sgs.wikpedia.org has been added to DNS now  sgs.wikipedia.org is...
[20:20:38] <sukhe>	 for now I am going to remove dns4004 from the list
[20:20:44] <wikibugs>	 (03CR) 10Sbailey: "I think this is correctly constructed" [extensions/Linter] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889220 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey)
[20:21:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dns4004.wikimedia.org with reason: failure during reimaging
[20:21:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dns4004.wikimedia.org with reason: failure during reimaging
[20:21:24] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-airflow1005.eqiad.wmnet with reason: host reimage
[20:22:35] <jinxer-wm>	 (ConfdResourceFailed) firing: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:22:46] <wikibugs>	 (03CR) 10Arlolra: [C: 03+1] Change linter maintenance scripts to use existing config varaibles [extensions/Linter] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889220 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey)
[20:23:29] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39611/console" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:23:40] <sukhe>	 mutante: patch coming
[20:23:43] <wikibugs>	 (03PS2) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011)
[20:24:04] <mutante>	 sukhe: thank you, no rush. but let me know when it's ok to sync again
[20:24:04] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-airflow1005.eqiad.wmnet with reason: host reimage
[20:24:26] <sukhe>	 I think should be OK
[20:24:29] <wikibugs>	 (03PS2) 10Dzahn: Add 'rup' as alias for 'roa-rup' [dns] - 10https://gerrit.wikimedia.org/r/527916 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix)
[20:24:29] <sukhe>	 but let's make it clean
[20:25:17] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+1] "Sorry for the braindead PCC operations." [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:25:31] <wikibugs>	 (03PS1) 10Ssingh: hiera: temporarily remove references to dns4004 [puppet] - 10https://gerrit.wikimedia.org/r/889221 (https://phabricator.wikimedia.org/T321309)
[20:25:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway)
[20:26:20] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: temporarily remove references to dns4004 [puppet] - 10https://gerrit.wikimedia.org/r/889221 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[20:26:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P44635 and previous config saved to /var/cache/conftool/dbconfig/20230214-202620-ladsgroup.json
[20:27:11] <wikibugs>	 (03CR) 10Dzahn: "needs deployment from serviceops team (might need apache restarts across cluster and syncing to k8s)" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:27:29] <jinxer-wm>	 (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[20:27:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "valid per https://iso639-3.sil.org/code/rup | matches https://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered_by_code" [dns] - 10https://gerrit.wikimedia.org/r/527916 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix)
[20:29:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "valid code per https://iso639-3.sil.org/code/egl but not yet in https://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered_by" [dns] - 10https://gerrit.wikimedia.org/r/527932 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix)
[20:30:25] <wikibugs>	 (03PS4) 10BCornwall: Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:30:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "valid per https://iso639-3.sil.org/code/cbk but not yet in https://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered_by_code" [dns] - 10https://gerrit.wikimedia.org/r/527911 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix)
[20:31:36] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] Remove aliases 'minnan' and 'zh-cfr' [dns] - 10https://gerrit.wikimedia.org/r/529829 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:33:00] <wikibugs>	 (03CR) 10Herron: [V: 03+1] service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[20:33:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "valid per https://iso639-3.sil.org/code/bho but bho not yet a comment with "bh" in https://meta.wikimedia.org/wiki/Template:List_of_langua" [dns] - 10https://gerrit.wikimedia.org/r/528781 (https://phabricator.wikimedia.org/T41968) (owner: 10Fomafix)
[20:34:28] <brett>	 sukhe: I already merged the dns changes and was running authdns-update (dns4004) failed with no such file for /usr/sbin/gdnsd
[20:34:52] <sukhe>	 yeah
[20:35:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "valid per https://iso639-3.sil.org/code/nrf" [dns] - 10https://gerrit.wikimedia.org/r/527908 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix)
[20:35:19] <sukhe>	 should be OK, just that good if others don't come across it. creates confusion :)
[20:35:24] <sukhe>	 removed it from the list
[20:35:31] <sukhe>	 mutante: please feel free to go ahead
[20:35:54] <mutante>	 sukhe: thanks:)
[20:36:02] <sukhe>	 thanks for waiting and sorry
[20:37:16] <icinga-wm>	 RECOVERY - Check systemd state on an-airflow1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:37:35] <jinxer-wm>	 (ConfdResourceFailed) resolved: (68) confd resource _var_lib_gdnsd_discovery-apertium.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[20:39:15] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+1] "@RLazarus and @joe, I'm told you should handle the merging of this. If that's true, would you be kind enough to do that? 😊" [puppet] - 10https://gerrit.wikimedia.org/r/529830 (https://phabricator.wikimedia.org/T230382) (owner: 10Fomafix)
[20:39:38] <mutante>	 no worries at all sukhe :))
[20:39:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add 'rup' as alias for 'roa-rup' [dns] - 10https://gerrit.wikimedia.org/r/527916 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix)
[20:39:48] <wikibugs>	 (03PS3) 10Dzahn: Add 'rup' as alias for 'roa-rup' [dns] - 10https://gerrit.wikimedia.org/r/527916 (https://phabricator.wikimedia.org/T17988) (owner: 10Fomafix)
[20:41:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P44636 and previous config saved to /var/cache/conftool/dbconfig/20230214-204126-ladsgroup.json
[20:43:14] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, and 2 others: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10BCornwall) DNS merged and deployed! Now just waiting for deployment of the appserver stuff.
[20:44:38] <icinga-wm>	 RECOVERY - Recursive DNS on 2620:0:863:1:198:35:26:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:45:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add 'bho' as alias for 'bh' [dns] - 10https://gerrit.wikimedia.org/r/528781 (https://phabricator.wikimedia.org/T41968) (owner: 10Fomafix)
[20:45:12] <wikibugs>	 (03PS2) 10Dzahn: Add 'bho' as alias for 'bh' [dns] - 10https://gerrit.wikimedia.org/r/528781 (https://phabricator.wikimedia.org/T41968) (owner: 10Fomafix)
[20:46:14] <icinga-wm>	 RECOVERY - Recursive DNS on 198.35.26.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:46:56] <wikibugs>	 (03PS3) 10JHathaway: Purge unused kernels on boot [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011)
[20:49:45] <wikibugs>	 (03PS4) 10Dzahn: Add 'nrf' as alias for 'nrm' [dns] - 10https://gerrit.wikimedia.org/r/527908 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix)
[20:50:21] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway)
[20:50:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add 'nrf' as alias for 'nrm' [dns] - 10https://gerrit.wikimedia.org/r/527908 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix)
[20:53:07] <wikibugs>	 (03PS2) 10Dzahn: Add 'egl' as alias for 'eml' [dns] - 10https://gerrit.wikimedia.org/r/527932 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix)
[20:54:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add 'egl' as alias for 'eml' [dns] - 10https://gerrit.wikimedia.org/r/527932 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix)
[20:54:29] <wikibugs>	 (03PS1) 10Ahmon Dancy: mw-debug/values-traindev.yaml: Move mcrouter config into cache section [deployment-charts] - 10https://gerrit.wikimedia.org/r/889227
[20:56:03] <wikibugs>	 (03PS2) 10Dzahn: Add 'cbk' as alias for 'cbk-zam' [dns] - 10https://gerrit.wikimedia.org/r/527911 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix)
[20:56:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T328255)', diff saved to https://phabricator.wikimedia.org/P44637 and previous config saved to /var/cache/conftool/dbconfig/20230214-205633-ladsgroup.json
[20:56:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance
[20:56:37] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[20:56:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance
[20:56:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[20:57:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add 'cbk' as alias for 'cbk-zam' [dns] - 10https://gerrit.wikimedia.org/r/527911 (https://phabricator.wikimedia.org/T124657) (owner: 10Fomafix)
[20:57:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[20:57:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T328255)', diff saved to https://phabricator.wikimedia.org/P44638 and previous config saved to /var/cache/conftool/dbconfig/20230214-205709-ladsgroup.json
[21:00:00] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] mw-debug/values-traindev.yaml: Move mcrouter config into cache section [deployment-charts] - 10https://gerrit.wikimedia.org/r/889227 (owner: 10Ahmon Dancy)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230214T2100).
[21:00:05] <jouncebot>	 danisztls, nray, and sbailey: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:32] <nray>	 o/
[21:00:39] <sbailey>	 I am here
[21:01:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T328255)', diff saved to https://phabricator.wikimedia.org/P44639 and previous config saved to /var/cache/conftool/dbconfig/20230214-210102-ladsgroup.json
[21:05:18] <wikibugs>	 (03Merged) 10jenkins-bot: mw-debug/values-traindev.yaml: Move mcrouter config into cache section [deployment-charts] - 10https://gerrit.wikimedia.org/r/889227 (owner: 10Ahmon Dancy)
[21:05:53] <wikibugs>	 (03PS8) 10Herron: service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944)
[21:07:29] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:07:59] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39612/console" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[21:09:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-ctrl1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:11:21] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/889198 (owner: 10Herron)
[21:11:45] <wikibugs>	 (03CR) 10Herron: [V: 03+1] service::catalog: add prometheus-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[21:13:57] <nray>	 Is anyone available to do the UTC late backport window?
[21:16:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P44640 and previous config saved to /var/cache/conftool/dbconfig/20230214-211608-ladsgroup.json
[21:16:25] <RhinosF1>	 TheresNoTime: poke, you around?
[21:16:55] <dancy>	 I can help w/ deploys if nobody shows up.
[21:17:28] <RhinosF1>	 dancy: id assume 15 minutes is a long enough delay
[21:17:55] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install ms-be207[0-3] - https://phabricator.wikimedia.org/T326352 (10Jhancock.wm) a:03Jhancock.wm
[21:18:02] <dancy>	 Fair enough.
[21:19:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2440.codfw.wmnet with OS buster
[21:19:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster
[21:20:11] <logmsgbot>	 !log dancy@deploy1002 Backport cancelled.
[21:20:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888764 (https://phabricator.wikimedia.org/T328692) (owner: 10Bernard Wang)
[21:20:50] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2441.codfw.wmnet with OS buster
[21:21:01] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Page Tools for logged in users across all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888764 (https://phabricator.wikimedia.org/T328692) (owner: 10Bernard Wang)
[21:21:01] <dancy>	 danisztls are you around?
[21:21:01] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2441.codfw.wmnet with OS buster
[21:21:29] <logmsgbot>	 !log dancy@deploy1002 Started scap: Backport for [[gerrit:888764|Enable Page Tools for logged in users across all wikis (T328692)]]
[21:21:33] <stashbot>	 T328692: Enable page tools everywhere for logged in users - https://phabricator.wikimedia.org/T328692
[21:22:29] <jinxer-wm>	 (PuppetCertificateAboutToExpire) firing: (2) Puppet CA certificate labstore1006.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:23:20] <logmsgbot>	 !log dancy@deploy1002 dancy and bwang: Backport for [[gerrit:888764|Enable Page Tools for logged in users across all wikis (T328692)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[21:23:49] <dancy>	 nray: ^^ That one is yours.. Ready for testing on mwdebug
[21:24:01] <nray>	 Great, thanks @dancy . Looking now
[21:28:08] <nray>	 @dancy looks good! You can proceed
[21:28:45] <dancy>	 Proceeding
[21:30:44] <wikibugs>	 (03PS1) 10Andrea Denisse: quickdatacopy: Add option to show progress during transfer [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T318778)
[21:31:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P44642 and previous config saved to /var/cache/conftool/dbconfig/20230214-213115-ladsgroup.json
[21:32:19] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39614/console" [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[21:33:28] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack nova: increase nova-api workers per node from 2 to 6 [puppet] - 10https://gerrit.wikimedia.org/r/889235 (https://phabricator.wikimedia.org/T328155)
[21:33:44] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:33:50] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:34:00] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:34:18] <wikibugs>	 (03PS2) 10Andrew Bogott: OpenStack nova: increase nova-api workers per node from 2 to 6 [puppet] - 10https://gerrit.wikimedia.org/r/889235 (https://phabricator.wikimedia.org/T328155)
[21:34:20] <logmsgbot>	 !log dancy@deploy1002 Finished scap: Backport for [[gerrit:888764|Enable Page Tools for logged in users across all wikis (T328692)]] (duration: 12m 50s)
[21:34:24] <stashbot>	 T328692: Enable page tools everywhere for logged in users - https://phabricator.wikimedia.org/T328692
[21:34:26] <dancy>	 nray: All done
[21:34:31] <wikibugs>	 (03PS3) 10Andrew Bogott: OpenStack nova: increase nova-api workers per node from 2 to 8 [puppet] - 10https://gerrit.wikimedia.org/r/889235 (https://phabricator.wikimedia.org/T328155)
[21:34:37] <dancy>	 sbailey: You're next
[21:34:40] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:34:41] <nray>	 thanks so much @dancy ! 
[21:34:42] <sbailey>	 ok
[21:34:50] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:rack/setup/install wdqs20[13-22] - https://phabricator.wikimedia.org/T326689 (10Jhancock.wm) a:05Papaul→03Jhancock.wm
[21:34:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:35:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: increase nova-api workers per node from 2 to 8 [puppet] - 10https://gerrit.wikimedia.org/r/889235 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott)
[21:35:33] <logmsgbot>	 !log dancy@deploy1002 Backport cancelled.
[21:36:17] <dancy>	 sbailey: I notice there is an extension.json file in the commit.  That may imply that there's a certain sync order required. 
[21:36:46] <sbailey>	 BTW, 889220 updates maintenance code that is only is run manually. No need to worry about extension.json order
[21:36:59] <dancy>	 ok great
[21:37:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [extensions/Linter] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889220 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey)
[21:37:17] <sbailey>	 This is job executed code so sync after merge, no way to test in a timely fashioon
[21:39:08] <wikibugs>	 (03PS2) 10Andrea Denisse: quickdatacopy: Add option to show progress during transfer [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T329683)
[21:39:20] <wikibugs>	 (03Merged) 10jenkins-bot: Change linter maintenance scripts to use existing config varaibles [extensions/Linter] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889220 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey)
[21:39:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2440.codfw.wmnet with reason: host reimage
[21:39:48] <logmsgbot>	 !log dancy@deploy1002 Started scap: Backport for [[gerrit:889220|Change linter maintenance scripts to use existing config varaibles (T329342)]]
[21:39:51] <stashbot>	 T329342: Enable maintenance Linter data migration scripts for namespace and tag and template - https://phabricator.wikimedia.org/T329342
[21:39:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:40:51] <wikibugs>	 (03CR) 10Andrea Denisse: "I added the rationale for this change in the task. 😊" [puppet] - 10https://gerrit.wikimedia.org/r/889231 (https://phabricator.wikimedia.org/T329683) (owner: 10Andrea Denisse)
[21:41:34] <logmsgbot>	 !log dancy@deploy1002 dancy and sbailey: Backport for [[gerrit:889220|Change linter maintenance scripts to use existing config varaibles (T329342)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[21:41:52] <dancy>	 continuing
[21:42:28] <sbailey>	 Thank you @Dancy :-)
[21:42:40] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2440.codfw.wmnet with reason: host reimage
[21:44:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2441.codfw.wmnet with reason: host reimage
[21:46:07] <wikibugs>	 (03PS1) 10Andrea Denisse: centrallog: Show transfer progress when using quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/889239 (https://phabricator.wikimedia.org/T318778)
[21:46:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T328255)', diff saved to https://phabricator.wikimedia.org/P44643 and previous config saved to /var/cache/conftool/dbconfig/20230214-214621-ladsgroup.json
[21:46:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance
[21:46:26] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[21:46:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance
[21:46:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44644 and previous config saved to /var/cache/conftool/dbconfig/20230214-214642-ladsgroup.json
[21:47:29] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2441.codfw.wmnet with reason: host reimage
[21:47:32] <logmsgbot>	 !log dancy@deploy1002 Finished scap: Backport for [[gerrit:889220|Change linter maintenance scripts to use existing config varaibles (T329342)]] (duration: 07m 44s)
[21:47:36] <stashbot>	 T329342: Enable maintenance Linter data migration scripts for namespace and tag and template - https://phabricator.wikimedia.org/T329342
[21:48:09] <sbailey>	 Great, thanks @dancy
[21:48:18] <dancy>	 No problem.  
[21:48:43] <wikibugs>	 (03CR) 10Andrea Denisse: "Hello, this task is possibly going to fail CI because a prerequisite for it is merging patch #889231." [puppet] - 10https://gerrit.wikimedia.org/r/889239 (https://phabricator.wikimedia.org/T318778) (owner: 10Andrea Denisse)
[21:48:52] <dancy>	 Last call for danisztls !
[21:53:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44645 and previous config saved to /var/cache/conftool/dbconfig/20230214-215351-ladsgroup.json
[21:53:55] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[21:57:29] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:57:33] <wikibugs>	 (03CR) 10JHathaway: "kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/889219 (https://phabricator.wikimedia.org/T277011) (owner: 10JHathaway)
[21:59:05] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[22:03:07] <wikibugs>	 10SRE, 10Wikimedia-Interwiki-links, 10Patch-For-Review: Please add ISO code interwikis for non-standard language codes - https://phabricator.wikimedia.org/T23915 (10Dzahn) 05Resolved→03Open Seems like it can't be both "resolved" and "needs patches" / "aliases in ticket still missing" at the same time.
[22:04:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[22:08:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P44646 and previous config saved to /var/cache/conftool/dbconfig/20230214-220857-ladsgroup.json
[22:10:37] <logmsgbot>	 !log eoghan@cumin2002 START - Cookbook sre.ganeti.reimage for host aphlict2001.codfw.wmnet with OS bullseye
[22:12:20] <wikibugs>	 (03PS1) 10Krinkle: mc: Add new $wgWANObjectCache setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889245 (https://phabricator.wikimedia.org/T329680)
[22:12:24] <wikibugs>	 (03PS1) 10Krinkle: mc: Remove unused $wgWANObjectCaches and $wgMainWANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889246 (https://phabricator.wikimedia.org/T329680)
[22:13:22] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "valid code per https://iso639-3.sil.org/code/cmn" [dns] - 10https://gerrit.wikimedia.org/r/528831 (https://phabricator.wikimedia.org/T23915) (owner: 10Fomafix)
[22:13:26] <wikibugs>	 (03PS3) 10Dzahn: Add 'cmn' as alias for 'zh' [dns] - 10https://gerrit.wikimedia.org/r/528831 (https://phabricator.wikimedia.org/T23915) (owner: 10Fomafix)
[22:23:56] <wikibugs>	 10SRE, 10Wikimedia-Interwiki-links, 10Patch-For-Review: Please add ISO code interwikis for non-standard language codes - https://phabricator.wikimedia.org/T23915 (10Dzahn) nan = zh-min-nan - already there since https://gerrit.wikimedia.org/r/c/operations/dns/+/479890 vro = fiu-vro - added in https://gerrit.w...
[22:24:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P44647 and previous config saved to /var/cache/conftool/dbconfig/20230214-222403-ladsgroup.json
[22:24:11] <wikibugs>	 (03PS1) 10Brennen Bearnes: phabricator config: add gitlab_api_key [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149)
[22:24:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] phabricator config: add gitlab_api_key [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes)
[22:25:19] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[22:25:20] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2440.codfw.wmnet with OS buster
[22:25:23] <logmsgbot>	 !log eoghan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage
[22:25:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2440.codfw.wmnet with OS buster completed: - mw2440 (**PASS**)   - Removed from Pupp...
[22:25:31] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[22:25:31] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2441.codfw.wmnet with OS buster
[22:25:38] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2441.codfw.wmnet with OS buster completed: - mw2441 (**PASS**)   - Removed from Pupp...
[22:25:43] <wikibugs>	 (03PS2) 10Dzahn: Add 'vro' as alias for 'fiu-vro' [dns] - 10https://gerrit.wikimedia.org/r/527914 (https://phabricator.wikimedia.org/T31186) (owner: 10Fomafix)
[22:25:58] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2442.codfw.wmnet with OS buster
[22:26:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2442.codfw.wmnet with OS buster
[22:26:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mw2443.codfw.wmnet with OS buster
[22:26:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mw2443.codfw.wmnet with OS buster
[22:26:29] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "valid code per https://iso639-3.sil.org/code/vro | mapping also in https://meta.wikimedia.org/wiki/Template:List_of_language_names_ordered" [dns] - 10https://gerrit.wikimedia.org/r/527914 (https://phabricator.wikimedia.org/T31186) (owner: 10Fomafix)
[22:26:42] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-airflow1005.eqiad.wmnet with reason: new OS but some puppet stuff doesn't work yet
[22:26:45] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-airflow1005.eqiad.wmnet with reason: new OS but some puppet stuff doesn't work yet
[22:28:04] <wikibugs>	 10SRE, 10Wikimedia-Interwiki-links, 10Patch-For-Review: Please add ISO code interwikis for non-standard language codes - https://phabricator.wikimedia.org/T23915 (10Dzahn) - cmn.wikipedia.org has been added to DNS - vro.wikipedia.org has been added to DNS
[22:28:16] <wikibugs>	 10SRE, 10Wikimedia-Interwiki-links, 10Patch-For-Review: Please add ISO code interwikis for non-standard language codes - https://phabricator.wikimedia.org/T23915 (10Dzahn) 05Open→03Resolved
[22:28:31] <logmsgbot>	 !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aphlict2001.codfw.wmnet with reason: host reimage
[22:35:40] <wikibugs>	 (03PS1) 10Dzahn: add language 'gsw' for Alemannic, Alsatian, Swiss German [dns] - 10https://gerrit.wikimedia.org/r/889250 (https://phabricator.wikimedia.org/T6793)
[22:39:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44648 and previous config saved to /var/cache/conftool/dbconfig/20230214-223910-ladsgroup.json
[22:39:12] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance
[22:39:14] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[22:39:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance
[22:39:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T328255)', diff saved to https://phabricator.wikimedia.org/P44649 and previous config saved to /var/cache/conftool/dbconfig/20230214-223931-ladsgroup.json
[22:39:50] <logmsgbot>	 !log eoghan@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aphlict2001.codfw.wmnet with OS bullseye
[22:41:15] * mutante works on Bugzilla tickets (authored by bzimport)
[22:42:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add language 'gsw' for Alemannic, Alsatian, Swiss German [dns] - 10https://gerrit.wikimedia.org/r/889250 (https://phabricator.wikimedia.org/T6793) (owner: 10Dzahn)
[22:42:34] <herzog>	 mutante: doh, good ol' bugzy
[22:43:46] <mutante>	 when the bug number is 4 digit
[22:44:45] <mutante>	 just recently we had some place where it was about a regex for "T" followed by numbers and it almost became "at least 5 digits".. well here are the counter examples
[22:45:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T328255)', diff saved to https://phabricator.wikimedia.org/P44650 and previous config saved to /var/cache/conftool/dbconfig/20230214-224519-ladsgroup.json
[22:45:21] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2442.codfw.wmnet with reason: host reimage
[22:45:23] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[22:46:22] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2443.codfw.wmnet with reason: host reimage
[22:47:31] <wikibugs>	 (03PS2) 10Brennen Bearnes: phabricator config: add gitlab_api_key [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149)
[22:48:48] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2442.codfw.wmnet with reason: host reimage
[22:50:34] <wikibugs>	 (03PS1) 10Dzahn: add language code 'syc' for Syriac [dns] - 10https://gerrit.wikimedia.org/r/889254 (https://phabricator.wikimedia.org/T28725)
[22:51:17] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2443.codfw.wmnet with reason: host reimage
[22:52:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add language code 'syc' for Syriac [dns] - 10https://gerrit.wikimedia.org/r/889254 (https://phabricator.wikimedia.org/T28725) (owner: 10Dzahn)
[23:00:14] <wikibugs>	 (03PS4) 10Dzahn: Add 'egl' as alias for 'eml' [puppet] - 10https://gerrit.wikimedia.org/r/527933 (https://phabricator.wikimedia.org/T36217) (owner: 10Fomafix)
[23:00:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P44651 and previous config saved to /var/cache/conftool/dbconfig/20230214-230025-ladsgroup.json
[23:03:25] <wikibugs>	 (03CR) 10Dzahn: "do you need us to add an actual API key in private repo?" [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes)
[23:03:33] <logmsgbot>	 !log cwhite@deploy1002 Started deploy [releng/phatality@eaa4c16]: T314098
[23:03:37] <stashbot>	 T314098: Update Phatality to reference ECS fields - https://phabricator.wikimedia.org/T314098
[23:05:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:06:19] <jinxer-wm>	 (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:06:36] <wikibugs>	 (03CR) 10Dzahn: "removing the new host contint2002 I had added meanwhile.. amending . https://puppet-compiler.wmflabs.org/output/850593/39616/doc1002.eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[23:06:49] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1032.eqiad.wmnet, logstash1031.eqiad.wmnet, logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:06:58] <rzl>	 hi, looking
[23:07:05] <icinga-wm>	 PROBLEM - Check systemd state on logstash1025 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:07:07] <icinga-wm>	 PROBLEM - Check systemd state on logstash2025 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:07:08] <cwhite>	 o/
[23:07:09] <herron>	 🧐
[23:07:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:07:11] * cwhite on it
[23:07:15] <icinga-wm>	 PROBLEM - Check systemd state on logstash2024 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:07:21] <rzl>	 cwhite: ahh was just about to ask, thanks :)
[23:07:30] <cwhite>	 deploy didn't go quite right
[23:07:34] <jinxer-wm>	 (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:07:38] <rzl>	 need a hand with anything?
[23:07:45] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1032.eqiad.wmnet, logstash1031.eqiad.wmnet, logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:07:49] <icinga-wm>	 PROBLEM - Check systemd state on logstash2031 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:07:49] <icinga-wm>	 PROBLEM - Check systemd state on logstash1031 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:02] * jhathaway here as well
[23:08:13] <icinga-wm>	 PROBLEM - Check systemd state on logstash2032 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:25] <icinga-wm>	 PROBLEM - Check systemd state on logstash2030 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:31] <icinga-wm>	 PROBLEM - Check systemd state on logstash1032 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:08:33] <icinga-wm>	 PROBLEM - Check systemd state on logstash2023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:09:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:09:07] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2443.codfw.wmnet with OS buster
[23:09:14] <jinxer-wm>	 (ProbeDown) firing: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:09:14] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2443.codfw.wmnet with OS buster completed: - mw2443 (**PASS**)   - Removed from Pupp...
[23:09:31] <icinga-wm>	 PROBLEM - Check systemd state on logstash1024 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch-dashboards.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:10:03] <icinga-wm>	 RECOVERY - Check systemd state on logstash1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:10:17] <icinga-wm>	 RECOVERY - Check systemd state on logstash2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:10:37] <wikibugs>	 (03PS7) 10Dzahn: ci: move lists of contint hosts to hieradata/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/850593
[23:10:51] <icinga-wm>	 RECOVERY - Check systemd state on logstash2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:10:53] <icinga-wm>	 RECOVERY - Check systemd state on logstash1031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:05] <icinga-wm>	 RECOVERY - Check systemd state on logstash1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:17] <icinga-wm>	 RECOVERY - Check systemd state on logstash2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:11:22] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
[23:11:23] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2442.codfw.wmnet with OS buster
[23:11:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:11:31] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mw2442.codfw.wmnet with OS buster completed: - mw2442 (**PASS**)   - Removed from Pupp...
[23:11:33] <icinga-wm>	 RECOVERY - Check systemd state on logstash2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:41] <icinga-wm>	 RECOVERY - Check systemd state on logstash2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:45] <icinga-wm>	 RECOVERY - Check systemd state on logstash1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:11:47] <icinga-wm>	 RECOVERY - Check systemd state on logstash2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:12:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:12:43] <jinxer-wm>	 (ProbeDown) resolved: (2) Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:12:53] <wikibugs>	 (03CR) 10Brennen Bearnes: phabricator config: add gitlab_api_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes)
[23:12:59] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/850593/39617/" [puppet] - 10https://gerrit.wikimedia.org/r/850593 (owner: 10Dzahn)
[23:14:13] <logmsgbot>	 !log cwhite@deploy1002 Finished deploy [releng/phatality@eaa4c16]: T314098 (duration: 10m 40s)
[23:14:17] <stashbot>	 T314098: Update Phatality to reference ECS fields - https://phabricator.wikimedia.org/T314098
[23:14:24] <logmsgbot>	 !log cwhite@deploy1002 Started deploy [releng/phatality@eaa4c16]: T314098
[23:14:32] <logmsgbot>	 !log cwhite@deploy1002 Finished deploy [releng/phatality@eaa4c16]: T314098 (duration: 00m 07s)
[23:14:52] <wikibugs>	 (03CR) 10Dzahn: phabricator config: add gitlab_api_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes)
[23:15:22] <wikibugs>	 (03CR) 10Dzahn: "@Eoghan see the "notification servers" line in https://gitlab.wikimedia.org/repos/phabricator/deployment/-/merge_requests/2/diffs . that r" [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes)
[23:15:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P44652 and previous config saved to /var/cache/conftool/dbconfig/20230214-231531-ladsgroup.json
[23:16:31] <wikibugs>	 (03CR) 10Dzahn: phabricator config: add gitlab_api_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes)
[23:19:36] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) @Dzahn, After I shared the issue I'm having with Brendan Campbell (who added himself as a subscriber) + a few other colleagues, he suggested that I ask you for help!  For Shopi...
[23:19:55] <wikibugs>	 (03PS3) 10Brennen Bearnes: phabricator config: add gitlab_api_key [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149)
[23:22:21] <wikibugs>	 (03CR) 10Brennen Bearnes: phabricator config: add gitlab_api_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889248 (https://phabricator.wikimedia.org/T324149) (owner: 10Brennen Bearnes)
[23:29:26] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10Dzahn) Hi @SHust let me clarify, so shopify is saying that store.wikimedia.org must have a AAAA record?  The current status is that store.wikimedia.org is an alias for c.ssl.shopify.c...
[23:30:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T328255)', diff saved to https://phabricator.wikimedia.org/P44653 and previous config saved to /var/cache/conftool/dbconfig/20230214-233037-ladsgroup.json
[23:30:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[23:30:42] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[23:30:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance
[23:30:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44654 and previous config saved to /var/cache/conftool/dbconfig/20230214-233058-ladsgroup.json
[23:31:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[23:33:08] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: persistRevisionThreadItems: Avoid listing non-discussion pages [extensions/DiscussionTools] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/889267 (https://phabricator.wikimedia.org/T329627)
[23:33:20] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: persistRevisionThreadItems: Avoid listing non-discussion pages [extensions/DiscussionTools] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/889268 (https://phabricator.wikimedia.org/T329627)
[23:35:36] <danisztls>	 dancy: sry, I wasn't able today
[23:37:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T328255)', diff saved to https://phabricator.wikimedia.org/P44655 and previous config saved to /var/cache/conftool/dbconfig/20230214-233705-ladsgroup.json
[23:37:10] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[23:37:24] <wikibugs>	 10SRE, 10Traffic-Icebox, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10SHust) @Dzahn, Here's a screenshot from my Shopify chat ( I hope I didn’t share anything I shouldn’t have):  {F36845020}  {F36845023}  {F36845022}
[23:38:09] <wikibugs>	 (03PS1) 10Cwhite: dashboards: sudo set noninteractive flag [puppet] - 10https://gerrit.wikimedia.org/r/888740 (https://phabricator.wikimedia.org/T329688)
[23:41:45] <wikibugs>	 (03PS1) 10Arlolra: [WIP] Enabled native gallery editing in Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889257 (https://phabricator.wikimedia.org/T329662)
[23:52:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P44656 and previous config saved to /var/cache/conftool/dbconfig/20230214-235211-ladsgroup.json
[23:58:54] <wikibugs>	 (03PS1) 10Ladsgroup: [WIP] mwscript: Switch to use run.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889259 (https://phabricator.wikimedia.org/T326800)
[23:59:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul)
[23:59:11] <icinga-wm>	 PROBLEM - Disk space on thanos-be2003 is CRITICAL: DISK CRITICAL - free space: / 1871 MB (3% inode=98%): /tmp 1871 MB (3% inode=98%): /var/tmp 1871 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops