[00:23:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:26:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:42:21] <icinga-wm>	 RECOVERY - Check systemd state on apifeatureusage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:43:25] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:49:31] <icinga-wm>	 PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:55:29] <icinga-wm>	 RECOVERY - Disk space on centrallog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops
[03:27:13] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[03:37:13] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[04:47:53] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (bast6001), Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[04:50:01] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:51:37] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[05:55:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[05:55:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[05:55:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:33] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) Thanks everyone! I will get this scheduled for Thursday 3rd Feb at 9:00AM UTC
[05:56:43] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui)
[05:58:01] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms
[05:59:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113 (s5,s6) T299479', diff saved to https://phabricator.wikimedia.org/P19570 and previous config saved to /var/cache/conftool/dbconfig/20220131-055947-marostegui.json
[05:59:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:59:52] <stashbot>	 T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479
[06:00:42] <wikibugs>	 (03PS1) 10Marostegui: db1113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758271 (https://phabricator.wikimedia.org/T299479)
[06:02:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758271 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui)
[06:03:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[06:03:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance
[06:03:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T298559)', diff saved to https://phabricator.wikimedia.org/P19571 and previous config saved to /var/cache/conftool/dbconfig/20220131-060326-marostegui.json
[06:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:31] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[06:04:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1113.eqiad.wmnet with OS bullseye
[06:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:51] <wikibugs>	 (03PS1) 10Marostegui: drop_ft_title_ft_namesapce_T297189.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758280 (https://phabricator.wikimedia.org/T297189)
[06:11:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove logpager from s4 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P19572 and previous config saved to /var/cache/conftool/dbconfig/20220131-061121-marostegui.json
[06:11:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:27] <stashbot>	 T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127
[06:12:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298559)', diff saved to https://phabricator.wikimedia.org/P19573 and previous config saved to /var/cache/conftool/dbconfig/20220131-061219-marostegui.json
[06:12:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:23] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[06:18:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:20:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:27:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P19574 and previous config saved to /var/cache/conftool/dbconfig/20220131-062723-marostegui.json
[06:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1113.eqiad.wmnet with OS bullseye
[06:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:56] <wikibugs>	 (03PS1) 10Marostegui: switchover-tmpl.sh: Changed notes [software] - 10https://gerrit.wikimedia.org/r/758286
[06:34:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19575 and previous config saved to /var/cache/conftool/dbconfig/20220131-063437-root.json
[06:34:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:34:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19576 and previous config saved to /var/cache/conftool/dbconfig/20220131-063448-root.json
[06:34:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:59] <icinga-wm>	 RECOVERY - Disk space on prometheus1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops
[06:42:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P19577 and previous config saved to /var/cache/conftool/dbconfig/20220131-064228-marostegui.json
[06:42:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:53] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1113: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758024
[06:49:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19578 and previous config saved to /var/cache/conftool/dbconfig/20220131-064941-root.json
[06:49:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:49:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1113: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758024 (owner: 10Marostegui)
[06:49:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19579 and previous config saved to /var/cache/conftool/dbconfig/20220131-064952-root.json
[06:49:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:44] <wikibugs>	 (03PS1) 10Marostegui: s5 codfw hosts: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758288 (https://phabricator.wikimedia.org/T300473)
[06:52:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] s5 codfw hosts: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758288 (https://phabricator.wikimedia.org/T300473) (owner: 10Marostegui)
[06:54:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2137.codfw.wmnet with OS bullseye
[06:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2128.codfw.wmnet with OS bullseye
[06:55:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298559)', diff saved to https://phabricator.wikimedia.org/P19580 and previous config saved to /var/cache/conftool/dbconfig/20220131-065733-marostegui.json
[06:57:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[06:57:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[06:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:38] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[06:57:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19581 and previous config saved to /var/cache/conftool/dbconfig/20220131-070444-root.json
[07:04:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19582 and previous config saved to /var/cache/conftool/dbconfig/20220131-070456-root.json
[07:04:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[07:05:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[07:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:56] <icinga-wm>	 PROBLEM - Disk space on elastic2035 is CRITICAL: DISK CRITICAL - free space: / 1087 MB (4% inode=94%): /tmp 1087 MB (4% inode=94%): /var/tmp 1087 MB (4% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops
[07:13:39] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[07:13:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
[07:13:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[07:13:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[07:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298559)', diff saved to https://phabricator.wikimedia.org/P19583 and previous config saved to /var/cache/conftool/dbconfig/20220131-071350-marostegui.json
[07:13:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:56] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[07:19:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19584 and previous config saved to /var/cache/conftool/dbconfig/20220131-071948-root.json
[07:19:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19585 and previous config saved to /var/cache/conftool/dbconfig/20220131-071959-root.json
[07:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298559)', diff saved to https://phabricator.wikimedia.org/P19586 and previous config saved to /var/cache/conftool/dbconfig/20220131-072249-marostegui.json
[07:22:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:54] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[07:28:08] <icinga-wm>	 RECOVERY - Disk space on elastic2035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops
[07:29:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2137.codfw.wmnet with OS bullseye
[07:29:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2113.codfw.wmnet with OS bullseye
[07:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2128.codfw.wmnet with OS bullseye
[07:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2111.codfw.wmnet with OS bullseye
[07:33:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19587 and previous config saved to /var/cache/conftool/dbconfig/20220131-073754-marostegui.json
[07:37:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:11] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2075.codfw.wmnet with OS bullseye
[07:39:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1010.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage
[07:50:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1010.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage
[07:50:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:07] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff)
[07:52:18] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1010
[07:52:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19588 and previous config saved to /var/cache/conftool/dbconfig/20220131-075258-marostegui.json
[07:53:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:22] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] drop_ft_title_ft_namesapce_T297189.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758280 (https://phabricator.wikimedia.org/T297189) (owner: 10Marostegui)
[08:00:04] <wikibugs>	 (03PS1) 10Marostegui: Revert "s5 codfw hosts: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758315
[08:01:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2113.codfw.wmnet with OS bullseye
[08:01:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: thanos::frontend: fix envoy configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) (owner: 10Giuseppe Lavagetto)
[08:04:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Untested but LGTM" [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[08:04:52] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] drop_ft_title_ft_namesapce_T297189.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758280 (https://phabricator.wikimedia.org/T297189) (owner: 10Marostegui)
[08:05:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "s5 codfw hosts: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758315 (owner: 10Marostegui)
[08:06:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2111.codfw.wmnet with OS bullseye
[08:06:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:24] <wikibugs>	 (03PS1) 10Marostegui: db2123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758419 (https://phabricator.wikimedia.org/T300473)
[08:08:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298559)', diff saved to https://phabricator.wikimedia.org/P19589 and previous config saved to /var/cache/conftool/dbconfig/20220131-080803-marostegui.json
[08:08:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:08] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[08:08:08] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[08:08:09] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance
[08:08:10] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance
[08:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance
[08:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758419 (https://phabricator.wikimedia.org/T300473) (owner: 10Marostegui)
[08:09:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2123.codfw.wmnet with OS bullseye
[08:09:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: apifeatureusage: disable gc logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757955 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite)
[08:11:53] <wikibugs>	 (03PS1) 10Marostegui: production.my.cnf: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/758420 (https://phabricator.wikimedia.org/T287244)
[08:12:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2075.codfw.wmnet with OS bullseye
[08:12:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] production.my.cnf: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/758420 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui)
[08:13:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi)
[08:13:59] <wikibugs>	 (03PS5) 10Filippo Giunchedi: site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199)
[08:14:17] <wikibugs>	 (03PS1) 10Gehel: Revert "cirrussearch: Reenable saneitizer" [puppet] - 10https://gerrit.wikimedia.org/r/758317
[08:14:53] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] Revert "cirrussearch: Reenable saneitizer" [puppet] - 10https://gerrit.wikimedia.org/r/758317 (owner: 10Gehel)
[08:16:48] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Revert "cirrussearch: Reenable saneitizer" [puppet] - 10https://gerrit.wikimedia.org/r/758317 (owner: 10Gehel)
[08:19:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:21:05] <marostegui>	 !log Set  innodb_adaptive_hash_index=OFF on es2028, es2029, es2026 T268869
[08:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:10] <stashbot>	 T268869: Consider setting innodb_adaptive_hash_index=OFF by default - https://phabricator.wikimedia.org/T268869
[08:21:47] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet
[08:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:06] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet
[08:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:15] <marostegui>	 !log Set  innodb_adaptive_hash_index=OFF on es2020, es2024 T268869
[08:22:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:27] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet
[08:23:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[08:25:30] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[08:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298559)', diff saved to https://phabricator.wikimedia.org/P19590 and previous config saved to /var/cache/conftool/dbconfig/20220131-082534-marostegui.json
[08:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:39] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[08:29:13] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet
[08:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:44] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet
[08:29:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: (25) Elasticsearch instance elastic2026-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[08:32:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[08:34:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298559)', diff saved to https://phabricator.wikimedia.org/P19591 and previous config saved to /var/cache/conftool/dbconfig/20220131-083432-marostegui.json
[08:34:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:37] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[08:34:45] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10Ladsgroup) p:05Triage→03Medium
[08:36:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: (70) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[08:36:53] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:37:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[08:38:13] <godog>	 the thanos rule alert is me
[08:38:59] <wikibugs>	 10SRE, 10DNS, 10Domains, 10Traffic, and 2 others: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Ladsgroup)
[08:41:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: (70) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[08:41:19] <wikibugs>	 10SRE, 10Traffic: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10Ladsgroup) p:05Triage→03Medium Feel free to change priority.
[08:41:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315 T297189', diff saved to https://phabricator.wikimedia.org/P19592 and previous config saved to /var/cache/conftool/dbconfig/20220131-084157-marostegui.json
[08:42:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:02] <stashbot>	 T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189
[08:43:27] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Mailing lists are not indexed by Google - https://phabricator.wikimedia.org/T299293 (10Ladsgroup) p:05Triage→03Low
[08:43:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:44:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2123.codfw.wmnet with OS bullseye
[08:44:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:08] <Amir1>	 marostegui: can you change the clinic duty in the header please? :D
[08:45:16] <Amir1>	 s/header/topic
[08:46:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: (70) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[08:46:32] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: [DRAFT] Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977
[08:46:34] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423
[08:46:36] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: fixup refactor [deployment-charts] - 10https://gerrit.wikimedia.org/r/758424
[08:46:54] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: fixup refactor [deployment-charts] - 10https://gerrit.wikimedia.org/r/758424 (owner: 10Giuseppe Lavagetto)
[08:47:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 (owner: 10Giuseppe Lavagetto)
[08:47:30] <wikibugs>	 10SRE, 10DNS, 10Domains, 10Traffic, and 2 others: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Ladsgroup) p:05Triage→03High Given the time-pressure.
[08:49:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19593 and previous config saved to /var/cache/conftool/dbconfig/20220131-084936-root.json
[08:49:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P19594 and previous config saved to /var/cache/conftool/dbconfig/20220131-084937-marostegui.json
[08:49:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: (69) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[08:51:37] <godog>	 the many cirrus alerts are also me, fixing
[08:51:51] <dcausse>	 thanks!
[08:52:37] <godog>	 dcausse: sure np! thanks for bearing with me while I'm figuring out T296199 as I go :)
[08:52:38] <stashbot>	 T296199: Prometheus hardware refresh (+ Bullseye upgrade) - https://phabricator.wikimedia.org/T296199
[08:52:58] <dcausse>	 np! :)
[08:56:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) resolved: (34) Elasticsearch instance elastic2025-production-search-omega-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[08:57:24] <wikibugs>	 (03PS1) 10Marostegui: drop_ft_title_ft_namespace_T297189.py: Rename file [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758425
[08:57:35] <wikibugs>	 (03PS1) 10JMeybohm: echoserver: Allow to easily mount external secret as certificate source [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/758426
[08:58:33] <wikibugs>	 (03PS2) 10Marostegui: drop_ft_title_ft_namespace_T297189.py: Rename file [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758425
[08:58:45] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] echoserver: Allow to easily mount external secret as certificate source [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/758426 (owner: 10JMeybohm)
[08:58:54] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] drop_ft_title_ft_namespace_T297189.py: Rename file [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758425 (owner: 10Marostegui)
[08:59:44] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758319
[09:00:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db2123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758319 (owner: 10Marostegui)
[09:03:08] <jayme>	 !log published image docker-registry.discovery.wmnet/echoserver:1.10.0-2
[09:03:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19595 and previous config saved to /var/cache/conftool/dbconfig/20220131-090439-root.json
[09:04:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P19596 and previous config saved to /var/cache/conftool/dbconfig/20220131-090441-marostegui.json
[09:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:41] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: [DRAFT] Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977
[09:07:43] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423
[09:11:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10Ladsgroup) Hi @NRodriguez, Can you access production now? So we can close this ticket. Thanks!
[09:12:07] <icinga-wm>	 PROBLEM - Check for large files in client bucket on mwmaint1002 is CRITICAL: WARNING: large files in client bucket https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file
[09:18:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Setup a new build host based on bullseye - https://phabricator.wikimedia.org/T298463 (10Ladsgroup) p:05Triage→03Medium Feel free to change the priority.
[09:19:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19597 and previous config saved to /var/cache/conftool/dbconfig/20220131-091943-root.json
[09:19:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298559)', diff saved to https://phabricator.wikimedia.org/P19598 and previous config saved to /var/cache/conftool/dbconfig/20220131-091952-marostegui.json
[09:19:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[09:19:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance
[09:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:57] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[09:19:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298559)', diff saved to https://phabricator.wikimedia.org/P19599 and previous config saved to /var/cache/conftool/dbconfig/20220131-091959-marostegui.json
[09:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1007.eqiad.wmnet with OS buster
[09:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1007.eqiad.wmnet with OS buster
[09:24:12] <wikibugs>	 (03PS1) 10Marostegui: add_linter_namespace_T300402.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758429 (https://phabricator.wikimedia.org/T300402)
[09:25:27] <wikibugs>	 (03PS2) 10Marostegui: add_linter_namespace_T300402.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758429 (https://phabricator.wikimedia.org/T300402)
[09:25:51] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] add_linter_namespace_T300402.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758429 (https://phabricator.wikimedia.org/T300402) (owner: 10Marostegui)
[09:26:01] <marostegui>	 that was fast Amir1!
[09:26:06] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] add_linter_namespace_T300402.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758429 (https://phabricator.wikimedia.org/T300402) (owner: 10Marostegui)
[09:26:27] <Amir1>	 marostegui: I saw it in the IRC :P
[09:26:35] <marostegui>	 haha
[09:29:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298559)', diff saved to https://phabricator.wikimedia.org/P19600 and previous config saved to /var/cache/conftool/dbconfig/20220131-092917-marostegui.json
[09:29:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:22] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[09:34:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19601 and previous config saved to /var/cache/conftool/dbconfig/20220131-093450-root.json
[09:34:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:39] <dcausse>	 !log restart blazegraph on wdqs1012 (jvm stuck for 6hours)
[09:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:54] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, two nits inline. It's quite likely that 4.4.2 received additional metrics (https://doc.powerdns.com/recursor/metrics.html), bu" [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[09:43:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10MoritzMuehlenhoff) >>! In T299839#7649515, @Volans wrote: > Adding #WMF-NDA-Requests, @mark, @faidon and @MoritzMuehlenhoff for SRE, #security and...
[09:44:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P19602 and previous config saved to /var/cache/conftool/dbconfig/20220131-094422-marostegui.json
[09:44:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1007.eqiad.wmnet with OS buster
[09:45:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1007.eqiad.wmnet with OS buster completed: - ganeti1007 (**PASS**)...
[09:45:43] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] switchover-tmpl.sh: Changed notes [software] - 10https://gerrit.wikimedia.org/r/758286 (owner: 10Marostegui)
[09:46:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Changed notes [software] - 10https://gerrit.wikimedia.org/r/758286 (owner: 10Marostegui)
[09:46:29] <wikibugs>	 (03PS1) 10Vgutierrez: site: Reimage cp5011 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/758430 (https://phabricator.wikimedia.org/T271421)
[09:46:41] <wikibugs>	 (03Merged) 10jenkins-bot: switchover-tmpl.sh: Changed notes [software] - 10https://gerrit.wikimedia.org/r/758286 (owner: 10Marostegui)
[09:47:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10Majavah) >>! In T299839#7649515, @Volans wrote: > One thing to clarify is how we can ensure that the off-boarding process from this group will be p...
[09:48:17] <mmandere>	 !log cp3061: upgrade varnish to 6.0.10-1wm1 T300264
[09:48:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:43] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init
[09:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:33] <wikibugs>	 (03PS5) 10Majavah: Bare minimum port to Python 3 to support Debian Bullseye [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254)
[09:53:18] <wikibugs>	 (03CR) 10Majavah: Bare minimum port to Python 3 to support Debian Bullseye (033 comments) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[09:53:24] <vgutierrez>	 !log depool cp5011 to be reimaged as cache::text_envoy - T271421
[09:53:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:29] <stashbot>	 T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421
[09:53:58] <mmandere>	 !log cp3062: upgrade varnish to 6.0.10-1wm1 T300264
[09:54:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:24] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5011 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/758430 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:54:42] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:03Joe We went with istio-ingress after some evaluation which wasn't reported here.
[09:54:48] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe)
[09:54:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I can update the deb on apt.wikimedia.org in the afternoon." [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[09:55:40] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5011.eqsin.wmnet with OS buster
[09:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster
[09:59:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P19603 and previous config saved to /var/cache/conftool/dbconfig/20220131-095926-marostegui.json
[09:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1015.eqiad.wmnet with OS buster
[10:01:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1015.eqiad.wmnet with OS buster
[10:13:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "heads up, in recent PDNS versions, the recursor has its own REST API that includes a /metrics endpoint that generates prometheus metrics, " [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[10:13:53] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10MoritzMuehlenhoff) >>! In T299839#7662797, @Majavah wrote: >>>! In T299839#7649515, @Volans wrote: >> One thing to clarify is how we can ensure tha...
[10:14:00] <wikibugs>	 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10MMandere) 05Open→03In progress
[10:14:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298559)', diff saved to https://phabricator.wikimedia.org/P19604 and previous config saved to /var/cache/conftool/dbconfig/20220131-101431-marostegui.json
[10:14:33] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[10:14:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[10:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:37] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[10:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298559)', diff saved to https://phabricator.wikimedia.org/P19605 and previous config saved to /var/cache/conftool/dbconfig/20220131-101439-marostegui.json
[10:14:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:45] <mmandere>	 !log cp[6001-6016].drmrs.wmnet remove unused libvarnishapi1 T300247
[10:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:50] <stashbot>	 T300247: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247
[10:18:20] <awight>	 I'm confused, the dewiki API is reporting an enormous maxlag of ~1.5h, but I don't see any corresponding lag in the grafana board for db replication.
[10:20:14] <wikibugs>	 (03PS5) 10Filippo Giunchedi: sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261)
[10:21:17] <Lucas_WMDE>	 huh. "Waiting for 10.64.0.163:3315: 5758.188369 seconds lagged."
[10:21:34] <moritzm>	 !log installing apache/apache-modsecurity2 security updates
[10:21:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:27] <Lucas_WMDE>	 awight: orchestrator.w.o also shows db1096 as lagged, “not replicating”
[10:23:20] <Lucas_WMDE>	 cc marostegui ^ you applied a schema change to that host earlier today
[10:23:28] <awight>	 Lucas_WMDE: I see that server was depooled, reimaged, and repooled just a few hours ago so maybe this is expected.
[10:23:38] <wikibugs>	 (03CR) 10ZPapierski: [C: 03+1] sre.wdqs.data-reload: few fixes and cleanups [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse)
[10:23:39] <marostegui>	 Lucas_WMDE: checking, thanks
[10:23:42] <awight>	 The most surprising detail is that I can't see the issue from grafana, though.
[10:24:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "I would need to better understand the network flows involved (now and then) before merging this patch. Could you please elaborate more her" [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah)
[10:24:20] <wikibugs>	 (03PS1) 10Volans: junos: catch another timeout exception on close [software/homer] - 10https://gerrit.wikimedia.org/r/758438
[10:24:25] <marostegui>	 Lucas_WMDE: that one didn't have the schema change applied, but for some reason it wasn't replicating
[10:24:28] <marostegui>	 I have started it now
[10:24:32] <marostegui>	 GOing to depool it
[10:24:37] <Lucas_WMDE>	 ok thanks
[10:24:50] <marostegui>	 Lucas_WMDE: thanks a lot for the heads uop
[10:24:55] <Lucas_WMDE>	 np
[10:24:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315', diff saved to https://phabricator.wikimedia.org/P19606 and previous config saved to /var/cache/conftool/dbconfig/20220131-102457-marostegui.json
[10:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:20] <Lucas_WMDE>	 https://de.wikipedia.org/w/api.php?action=query&maxlag=-1 looks better now
[10:25:41] <marostegui>	 Lucas_WMDE: will repool it back once it has catch up
[10:25:45] <awight>	 That was fast!
[10:26:21] <marostegui>	 it is now back, so repooling slowly again!
[10:26:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: repooling', diff saved to https://phabricator.wikimedia.org/P19607 and previous config saved to /var/cache/conftool/dbconfig/20220131-102636-root.json
[10:26:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:52] <Lucas_WMDE>	 wow, that was fast indeed
[10:27:13] <wikibugs>	 10SRE, 10User-Ladsgroup: Adding aquhen@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T298778 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I added this but for the future please do it yourself. cc. @odimitrijevic
[10:27:15] <marostegui>	 Glad to see people using orchestrator :)
[10:27:32] <Lucas_WMDE>	 I have it bookmarked as “aka new dbtree” so I remember the name ;)
[10:27:36] <wikibugs>	 10SRE, 10User-Ladsgroup: Add user nmaphophe@wikimedia.org to the analytics-alerts mail alias - https://phabricator.wikimedia.org/T298770 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup T298778#7662871
[10:27:48] <mmandere>	 !log cp[5006,5012].eqsin.wmnet remove unused libvarnishapi1 T300247
[10:27:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:53] <stashbot>	 T300247: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247
[10:28:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] junos: catch another timeout exception on close [software/homer] - 10https://gerrit.wikimedia.org/r/758438 (owner: 10Volans)
[10:29:46] <wikibugs>	 (03CR) 10Volans: [C: 03+2] junos: catch another timeout exception on close [software/homer] - 10https://gerrit.wikimedia.org/r/758438 (owner: 10Volans)
[10:31:07] <mmandere>	 !log cp[4021,4025-4026,4032-4034,4036].ulsfo.wmnet remove unused libvarnishapi1 T300247
[10:31:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:34] <wikibugs>	 (03Merged) 10jenkins-bot: junos: catch another timeout exception on close [software/homer] - 10https://gerrit.wikimedia.org/r/758438 (owner: 10Volans)
[10:33:26] <mmandere>	 !log cp[3052,3064-3065].esams.wmnet  remove unused libvarnishapi1 T300247
[10:33:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:31] <stashbot>	 T300247: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247
[10:33:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298559)', diff saved to https://phabricator.wikimedia.org/P19608 and previous config saved to /var/cache/conftool/dbconfig/20220131-103350-marostegui.json
[10:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:55] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[10:36:04] <mmandere>	 !log cp[2041-2042] remove unused libvarnishapi1 T300247
[10:36:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:46] <awight>	 Lucas_WMDE: would be nice if it included the decoder ring to map wiki -> partition
[10:37:54] <mmandere>	 !log cp[1087,1089-1090] remove unused libvarnishapi1 T300247
[10:37:55] <awight>	 I hope that I don't have the permissions to cause any actual change by dragging and dropping replicas
[10:37:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: repooling', diff saved to https://phabricator.wikimedia.org/P19609 and previous config saved to /var/cache/conftool/dbconfig/20220131-104140-root.json
[10:41:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1015.eqiad.wmnet with OS buster
[10:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1015.eqiad.wmnet with OS buster completed: - ganeti1015 (**PASS**)...
[10:43:04] <wikibugs>	 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) p:05Triage→03Medium
[10:48:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19610 and previous config saved to /var/cache/conftool/dbconfig/20220131-104855-marostegui.json
[10:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: repooling', diff saved to https://phabricator.wikimedia.org/P19611 and previous config saved to /var/cache/conftool/dbconfig/20220131-105643-root.json
[10:56:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:09] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5011.eqsin.wmnet with OS buster
[10:57:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:17] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster completed: - cp5011 (**WARN*...
[10:58:13] <vgutierrez>	 !log pool cp5011 running envoy as TLS terminator - T271421
[10:58:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:17] <stashbot>	 T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421
[10:59:21] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez)
[11:04:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19612 and previous config saved to /var/cache/conftool/dbconfig/20220131-110400-marostegui.json
[11:04:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:26] <wikibugs>	 (03CR) 10Majavah: openstack encapi: Drop special treatment for puppetmasters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah)
[11:07:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/M
[11:07:54] <icinga-wm>	 g/restbase
[11:08:14] <wikibugs>	 (03PS6) 10Filippo Giunchedi: sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261)
[11:08:23] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 15 days, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing
[11:08:24] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15 days, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing
[11:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:11:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: repooling', diff saved to https://phabricator.wikimedia.org/P19613 and previous config saved to /var/cache/conftool/dbconfig/20220131-111147-root.json
[11:11:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:02] <taavi>	 I'm deploying a beta only config change
[11:12:32] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] beta: READ_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758038 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah)
[11:13:09] <wikibugs>	 (03Merged) 10jenkins-bot: beta: READ_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758038 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah)
[11:15:49] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: openstack encapi: Drop special treatment for puppetmasters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah)
[11:16:32] <wikibugs>	 (03CR) 10Majavah: openstack encapi: Drop special treatment for puppetmasters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah)
[11:19:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298559)', diff saved to https://phabricator.wikimedia.org/P19614 and previous config saved to /var/cache/conftool/dbconfig/20220131-111904-marostegui.json
[11:19:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:10] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[11:19:35] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1024.eqiad.wmnet
[11:19:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[11:20:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:05] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1025.eqiad.wmnet with OS buster
[11:21:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[11:21:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[11:22:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:49] <wikibugs>	 (03PS1) 10Majavah: prod: READ_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758443 (https://phabricator.wikimedia.org/T289068)
[11:23:07] <wikibugs>	 (03CR) 10Majavah: "this worked fine in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758443 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah)
[11:23:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[11:23:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[11:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack encapi: Drop special treatment for puppetmasters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah)
[11:32:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[11:32:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[11:32:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[11:37:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:19] <wikibugs>	 (03PS1) 10Majavah: P:openstack::puppetmaster: fix passing non-existent variables [puppet] - 10https://gerrit.wikimedia.org/r/758444
[11:39:02] <wikibugs>	 (03PS1) 104nn1l2: azwikiquote: Add autopatrolled user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758445 (https://phabricator.wikimedia.org/T300435)
[11:39:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:openstack::puppetmaster: fix passing non-existent variables [puppet] - 10https://gerrit.wikimedia.org/r/758444 (owner: 10Majavah)
[11:39:53] <wikibugs>	 (03PS2) 10Majavah: P:openstack::puppetmaster: fix passing non-existent variables [puppet] - 10https://gerrit.wikimedia.org/r/758444
[11:42:15] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::puppetmaster: fix passing non-existent variables [puppet] - 10https://gerrit.wikimedia.org/r/758444 (owner: 10Majavah)
[11:44:20] <wikibugs>	 (03PS1) 10Ladsgroup: Add unseen.wikimedia.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/758446 (https://phabricator.wikimedia.org/T300398)
[11:48:30] <wikibugs>	 (03PS1) 10Ladsgroup: Add unseen.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/758447 (https://phabricator.wikimedia.org/T300398)
[11:49:33] <wikibugs>	 (03CR) 10Ayounsi: "1 comment then LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond)
[11:50:41] <wikibugs>	 (03PS1) 10Ladsgroup: redirects: Fix url shortener documentation [puppet] - 10https://gerrit.wikimedia.org/r/758448
[11:52:18] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:57:46] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] redirects: Fix url shortener documentation [puppet] - 10https://gerrit.wikimedia.org/r/758448 (owner: 10Ladsgroup)
[11:58:20] <wikibugs>	 10ops-ulsfo, 10Traffic: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10Vgutierrez)
[11:58:23] <wikibugs>	 (03PS1) 104nn1l2: commonswiki: Add four domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758449 (https://phabricator.wikimedia.org/T300375)
[11:58:57] <wikibugs>	 (03PS4) 10Ladsgroup: Beta: maintenance: skip mediawiki::state function [puppet] - 10https://gerrit.wikimedia.org/r/462019 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani)
[11:59:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[11:59:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[11:59:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[11:59:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[11:59:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[11:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[12:00:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:01] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[12:00:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[12:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T1200).
[12:00:06] <jouncebot>	 nn1l2: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298559)', diff saved to https://phabricator.wikimedia.org/P19615 and previous config saved to /var/cache/conftool/dbconfig/20220131-120007-marostegui.json
[12:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:10] <nn1l2>	 hi
[12:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:13] <Lucas_WMDE>	 o/
[12:00:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:17] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[12:00:36] <Lucas_WMDE>	 I can deploy today
[12:01:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298559)', diff saved to https://phabricator.wikimedia.org/P19616 and previous config saved to /var/cache/conftool/dbconfig/20220131-120113-marostegui.json
[12:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:37] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Add unseen.wikimedia.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/758446 (https://phabricator.wikimedia.org/T300398) (owner: 10Ladsgroup)
[12:01:58] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1025.eqiad.wmnet with OS buster
[12:02:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:15] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1025.eqiad.wmnet
[12:02:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:32] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "redirects under canonical domains are still handled by mediawiki, so that's why it's the place to put it :)" [puppet] - 10https://gerrit.wikimedia.org/r/758447 (https://phabricator.wikimedia.org/T300398) (owner: 10Ladsgroup)
[12:02:38] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1026.eqiad.wmnet with OS buster
[12:02:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:02:55] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] azwikiquote: Add autopatrolled user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758445 (https://phabricator.wikimedia.org/T300435) (owner: 104nn1l2)
[12:03:42] <wikibugs>	 (03Merged) 10jenkins-bot: azwikiquote: Add autopatrolled user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758445 (https://phabricator.wikimedia.org/T300435) (owner: 104nn1l2)
[12:04:46] <wikibugs>	 (03CR) 10Ladsgroup: "Confirming it's noop in production https://puppet-compiler.wmflabs.org/pcc-worker1002/33504/" [puppet] - 10https://gerrit.wikimedia.org/r/462019 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani)
[12:04:50] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Beta: maintenance: skip mediawiki::state function [puppet] - 10https://gerrit.wikimedia.org/r/462019 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani)
[12:05:04] <Lucas_WMDE>	 nn1l2: the azwikiquote change is on mwdebug1001, please test it
[12:05:08] <wikibugs>	 (03PS4) 10Ladsgroup: Beta: maintenance: no openldap management [puppet] - 10https://gerrit.wikimedia.org/r/462020 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani)
[12:05:10] <nn1l2>	 ok
[12:05:28] <taavi>	 Amir1: you're going to love the various other hacks currently only applied to deployment-puppetmaster04 /var/lib/git/operations/puppet
[12:05:32] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697
[12:06:06] <Amir1>	 taavi: I know :( I hope this removes some of the hacks, let me know if I can clean up more
[12:06:47] <nn1l2>	 LGTM
[12:06:52] <Lucas_WMDE>	 ok
[12:07:27] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "Noop in production https://puppet-compiler.wmflabs.org/pcc-worker1003/33505/" [puppet] - 10https://gerrit.wikimedia.org/r/462020 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani)
[12:07:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[12:07:30] <taavi>	 Amir1: this one is currently my favourite I think https://phabricator.wikimedia.org/P19617
[12:07:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:758445|azwikiquote: Add autopatrolled user group (T300435)]] (duration: 00m 50s)
[12:08:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:23] <stashbot>	 T300435:  Add autopatrolled user group to az.wikiquote - https://phabricator.wikimedia.org/T300435
[12:08:43] <Amir1>	 taavi: 😭
[12:08:58] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[12:09:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[12:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:09:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[12:09:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[12:09:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[12:09:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[12:09:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19618 and previous config saved to /var/cache/conftool/dbconfig/20220131-120952-marostegui.json
[12:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:57] <stashbot>	 T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558
[12:10:02] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): commonswiki: Add four domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758449 (https://phabricator.wikimedia.org/T300375) (owner: 104nn1l2)
[12:11:44] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] commonswiki: Add four domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758449 (https://phabricator.wikimedia.org/T300375) (owner: 104nn1l2)
[12:11:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[12:11:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[12:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:28] <wikibugs>	 (03Merged) 10jenkins-bot: commonswiki: Add four domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758449 (https://phabricator.wikimedia.org/T300375) (owner: 104nn1l2)
[12:12:58] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Miriam) >>! In T299919#7660039, @jhathaway wrote: > @Miriam I assume you mean 2022-06-30 😉, though with covid still with us, who knows what year it is!...
[12:13:10] <Lucas_WMDE>	 nn1l2: the four new domains are also on mwdebug1001 now, please test
[12:13:26] <nn1l2>	 give me some time pelase
[12:13:33] <Lucas_WMDE>	 sure
[12:15:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[12:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:07] <nn1l2>	 All successful :) 1) https://commons.wikimedia.org/wiki/File:Cbdg_44379-r.jpg   2) https://commons.wikimedia.org/wiki/File:Voornesduin2.jpg  3) https://commons.wikimedia.org/wiki/File:3d176240-8407-4d1b-992d-ad5f00fc2bcb.jpg   4) https://commons.wikimedia.org/wiki/File:D37D8C6587C6F637EBE60B6A151D19A51EBDA6F8E24BC4C8FF9F59E1DF2B661E.jpg
[12:16:13] <nn1l2>	 Good to go
[12:16:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P19619 and previous config saved to /var/cache/conftool/dbconfig/20220131-121618-marostegui.json
[12:16:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:56] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui)
[12:17:16] <nn1l2>	 Lucas_WMDE: GTG
[12:17:20] <Lucas_WMDE>	 ok
[12:18:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:758449|commonswiki: Add four domains to the wgCopyUploadsDomains allowlist (T300375, T300360, T300359, T300357)]] (duration: 00m 50s)
[12:18:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:42] <stashbot>	 T300357: Add www.nmr-pics.nl to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300357
[12:18:42] <stashbot>	 T300360: Add arter.dk to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300360
[12:18:43] <stashbot>	 T300375: Add researcharchive.calacademy.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300375
[12:18:43] <stashbot>	 T300359: Add files.plutof.ut.ee to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300359
[12:20:01] <wikibugs>	 (03PS3) 10Ayounsi: Move sandbox filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748080 (https://phabricator.wikimedia.org/T273865)
[12:20:03] <wikibugs>	 (03PS3) 10Ayounsi: Move core routers loopback filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865)
[12:20:05] <wikibugs>	 (03PS3) 10Ayounsi: Move core routers border-in filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865)
[12:20:41] <Lucas_WMDE>	 !log UTC morning backport window done
[12:20:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[12:21:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:46] <nn1l2>	 hi taavi, do you have a min for a quick consultation?
[12:22:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[12:22:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[12:22:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[12:23:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:29] <wikibugs>	 (03PS14) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077
[12:25:41] <taavi>	 nn1l2: very quick, I need to leave in like 5 minutes
[12:25:53] <nn1l2>	 thanks
[12:25:56] <nn1l2>	 see https://phabricator.wikimedia.org/rOMWC6dcc2c6d8db872b931e0eac4fe4e2569fc4e11d0
[12:26:00] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697
[12:26:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah)
[12:26:16] <nn1l2>	 patrolmarks is already implied by patrol
[12:26:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697 (owner: 10Arturo Borrero Gonzalez)
[12:26:48] <taavi>	 yes?
[12:27:05] <nn1l2>	 is it worth if I clean up the InitialiseSettings.php file and remove redundant permissions?
[12:28:03] <taavi>	 what do other similar groups do?
[12:28:16] <wikibugs>	 (03PS15) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077
[12:28:26] <nn1l2>	 I don't understand you
[12:28:43] <nn1l2>	 anybody who has patrol flag does no need patrolmarks
[12:29:00] <nn1l2>	 anybody who has patrol flag does not need patrolmarks
[12:29:14] <taavi>	 yeah, I guess it can be cleaned up
[12:29:35] <taavi>	 I was wondering if other 'patroller' groups on other wikis also grant 'patrolmarks', but I guess that is a no
[12:29:46] <nn1l2>	 Thanks, I will upload a patch for the next window :)
[12:31:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P19620 and previous config saved to /var/cache/conftool/dbconfig/20220131-123123-marostegui.json
[12:31:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:36] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1026.eqiad.wmnet with OS buster
[12:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298559)', diff saved to https://phabricator.wikimedia.org/P19621 and previous config saved to /var/cache/conftool/dbconfig/20220131-124627-marostegui.json
[12:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:33] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[12:46:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance
[12:46:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance
[12:46:36] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1026.eqiad.wmnet
[12:46:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance
[12:46:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance
[12:46:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[12:46:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[12:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298559)', diff saved to https://phabricator.wikimedia.org/P19622 and previous config saved to /var/cache/conftool/dbconfig/20220131-124655-marostegui.json
[12:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:03] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1027.eqiad.wmnet with OS buster
[12:47:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298559)', diff saved to https://phabricator.wikimedia.org/P19623 and previous config saved to /var/cache/conftool/dbconfig/20220131-124801-marostegui.json
[12:48:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:10] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] "checked that it is based on the most recent commit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757659 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE))
[13:01:08] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add unseen.wikimedia.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/758446 (https://phabricator.wikimedia.org/T300398) (owner: 10Ladsgroup)
[13:02:45] <wikibugs>	 (03PS1) 10Marostegui: db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758465 (https://phabricator.wikimedia.org/T300473)
[13:03:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P19624 and previous config saved to /var/cache/conftool/dbconfig/20220131-130306-marostegui.json
[13:03:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758465 (https://phabricator.wikimedia.org/T300473) (owner: 10Marostegui)
[13:06:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1154.eqiad.wmnet with OS bullseye
[13:06:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:25] <wikibugs>	 (03PS2) 10Ladsgroup: Add unseen.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/758447 (https://phabricator.wikimedia.org/T300398)
[13:06:41] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add unseen.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/758447 (https://phabricator.wikimedia.org/T300398) (owner: 10Ladsgroup)
[13:08:07] <wikibugs>	 10SRE, 10DNS, 10Domains, 10Traffic, and 4 others: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup It'll take a bit but it will be there. Ping me if it doesn't work.
[13:10:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19625 and previous config saved to /var/cache/conftool/dbconfig/20220131-131011-marostegui.json
[13:10:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1007.eqiad.wmnet
[13:10:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:16] <stashbot>	 T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558
[13:10:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:20] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Wikidata, 10serviceops, and 2 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10Ladsgroup)
[13:11:03] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s8 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:11:53] <RhinosF1>	 marostegui: are you aware of ^
[13:14:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1007.eqiad.wmnet
[13:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1007.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[13:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff)
[13:16:41] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:16:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1007.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[13:16:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P19626 and previous config saved to /var/cache/conftool/dbconfig/20220131-131811-marostegui.json
[13:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:15] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on clouddb1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:19:21] <marostegui>	 ^
[13:19:27] <marostegui>	 me, silencing
[13:20:53] <wikibugs>	 10SRE, 10Python3-Porting: git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10Ladsgroup) >>! In T279509#6979904, @MoritzMuehlenhoff wrote: > git-fat is the only package requiring Python 2 in a base bullseye setup at this point.  Is there a way to migrate to git-lfs instead?
[13:22:13] <wikibugs>	 (03PS1) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500)
[13:22:34] <wikibugs>	 (03PS2) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500)
[13:23:45] <wikibugs>	 (03PS7) 10Filippo Giunchedi: sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261)
[13:25:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Tested validate_cmd in Pontoon and works as expected" [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi)
[13:25:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P19627 and previous config saved to /var/cache/conftool/dbconfig/20220131-132516-marostegui.json
[13:25:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:42] <wikibugs>	 (03PS3) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500)
[13:31:15] <wikibugs>	 (03PS4) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500)
[13:31:48] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on clouddb1020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:33:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298559)', diff saved to https://phabricator.wikimedia.org/P19628 and previous config saved to /var/cache/conftool/dbconfig/20220131-133316-marostegui.json
[13:33:18] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[13:33:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[13:33:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:21] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[13:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298559)', diff saved to https://phabricator.wikimedia.org/P19629 and previous config saved to /var/cache/conftool/dbconfig/20220131-133323-marostegui.json
[13:33:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298559)', diff saved to https://phabricator.wikimedia.org/P19630 and previous config saved to /var/cache/conftool/dbconfig/20220131-133430-marostegui.json
[13:34:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:23] <wikibugs>	 (03PS3) 10Ssingh: site: add role for durum hosts in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/757741 (https://phabricator.wikimedia.org/T300158)
[13:36:26] <wikibugs>	 (03PS5) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500)
[13:36:40] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:36:50] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10Scap, 10observability: expose hosts in maintenance state so we can prevent scap from running on them - https://phabricator.wikimedia.org/T100777 (10lmata)
[13:37:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1154.eqiad.wmnet with OS bullseye
[13:37:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:11] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33509/console" [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[13:37:26] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability, and 2 others: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060 (10lmata)
[13:37:40] <wikibugs>	 10SRE, 10Traffic: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10Ladsgroup)
[13:37:52] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: add role for durum hosts in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/757741 (https://phabricator.wikimedia.org/T300158) (owner: 10Ssingh)
[13:37:54] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10lmata)
[13:38:17] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10Ladsgroup)
[13:39:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Create Ganeti VMs for durum in drmrs - https://phabricator.wikimedia.org/T300158 (10ssingh)
[13:40:06] <wikibugs>	 10SRE, 10Observability-Metrics, 10Traffic-Icebox: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10lmata)
[13:40:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P19631 and previous config saved to /var/cache/conftool/dbconfig/20220131-134021-marostegui.json
[13:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:40] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10lmata)
[13:41:08] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10User-CDanis: CLI script for manual paging - https://phabricator.wikimedia.org/T82937 (10lmata)
[13:41:08] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s8 on clouddb1020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:41:35] <wikibugs>	 10SRE, 10Observability-Metrics: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10lmata)
[13:43:17] <wikibugs>	 10SRE, 10Observability-Metrics, 10serviceops, 10Patch-For-Review: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10lmata)
[13:45:00] <wikibugs>	 10SRE, 10Observability-Alerting, 10Documentation, 10Service-Architecture: Create a doc explaining the SLA between services and the monitoring tool - https://phabricator.wikimedia.org/T105780 (10lmata)
[13:45:18] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Aggregate Proton, Restbase and mobileapps icinga alerts - https://phabricator.wikimedia.org/T250017 (10lmata)
[13:45:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "LGTM overall, see two comments mostly about style." [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[13:45:49] <wikibugs>	 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, and 2 others: ops-monitoring-bot creating dupes - https://phabricator.wikimedia.org/T226908 (10lmata)
[13:45:55] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 (10lmata)
[13:46:07] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Icinga notifications didn't get applied after a puppet run - https://phabricator.wikimedia.org/T251407 (10lmata)
[13:46:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[13:47:29] <wikibugs>	 10SRE, 10Observability-Metrics, 10Traffic-Icebox, 10User-ema: Multiple ATS HTTP2 stats missing from Prometheus - https://phabricator.wikimedia.org/T292817 (10lmata)
[13:47:51] <wikibugs>	 10SRE, 10Observability-Alerting: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996 (10lmata)
[13:47:54] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:48:05] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability, and 2 others: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10lmata)
[13:48:08] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Move sandbox filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748080 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi)
[13:48:42] <wikibugs>	 (03Merged) 10jenkins-bot: Move sandbox filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748080 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi)
[13:49:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P19632 and previous config saved to /var/cache/conftool/dbconfig/20220131-134934-marostegui.json
[13:49:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:40] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting: Icinga alert for hosts with no Puppet roles - https://phabricator.wikimedia.org/T238006 (10lmata)
[13:50:11] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability, 10User-jbond: Monitoring for puppetdb queue size - https://phabricator.wikimedia.org/T236707 (10lmata)
[13:50:32] <wikibugs>	 10SRE, 10Icinga, 10Infrastructure-Foundations, 10Mail, and 2 others: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890 (10lmata)
[13:50:41] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Icinga monitoring for Yubikey components - https://phabricator.wikimedia.org/T151048 (10lmata)
[13:51:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[13:52:19] <XioNoX>	 !log Move sandbox filter to Capirca on all core routers
[13:52:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19633 and previous config saved to /var/cache/conftool/dbconfig/20220131-135525-marostegui.json
[13:55:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:30] <stashbot>	 T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558
[13:55:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
[13:55:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance
[13:55:33] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
[13:55:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
[13:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[13:56:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[13:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298558)', diff saved to https://phabricator.wikimedia.org/P19634 and previous config saved to /var/cache/conftool/dbconfig/20220131-135610-marostegui.json
[13:56:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298558)', diff saved to https://phabricator.wikimedia.org/P19635 and previous config saved to /var/cache/conftool/dbconfig/20220131-140127-marostegui.json
[14:01:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:34] <stashbot>	 T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558
[14:04:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P19636 and previous config saved to /var/cache/conftool/dbconfig/20220131-140439-marostegui.json
[14:04:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:44] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:07:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1015.eqiad.wmnet
[14:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:38] <wikibugs>	 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve  parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10LSobanski) a:05Marostegui→03None Removing assignment as I don't believe Manuel will be looking into th...
[14:09:42] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
[14:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:04] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1262184 and 2310 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:10:08] <logmsgbot>	 !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host prometheus2006.codfw.wmnet
[14:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:11] <wikibugs>	 (03CR) 10BBlack: "LGTM on all the interrelated changes to the socket path / install_from_component stuff.  Inline question about the last bit for the owner " [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[14:10:46] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet
[14:10:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:08] <wikibugs>	 (03PS1) 10Ayounsi: Delete now unused analytics policy file [homer/public] - 10https://gerrit.wikimedia.org/r/758470
[14:13:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1015.eqiad.wmnet
[14:13:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:10] <wikibugs>	 (03CR) 10Ottomata: "Luca, what if someone wants to spin up a new Kafka cluster in Cloud with TLS that does not use the certs John is going to create?  Is ther" [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey)
[14:14:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1015.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[14:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:12] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Update termbox to 2022-01-25-175409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757659 (https://phabricator.wikimedia.org/T296202)
[14:15:28] <Lucas_WMDE>	 jouncebot: nowandnext
[14:15:28] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 14 minute(s)
[14:15:28] <jouncebot>	 In 2 hour(s) and 14 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T1630)
[14:15:39] <Lucas_WMDE>	 alright, I’ll probably deploy ^ that termbox update in deployment-charts
[14:15:41] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Icinga monitoring for Yubikey components - https://phabricator.wikimedia.org/T151048 (10MoritzMuehlenhoff) 05Open→03Declined This is no longer needed, we longer use the YubiHSM stack, closing
[14:15:43] <wikibugs>	 10SRE: Extending Yubico 2FA for production use (meta bug) - https://phabricator.wikimedia.org/T151045 (10MoritzMuehlenhoff)
[14:16:13] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Update termbox to 2022-01-25-175409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757659 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE))
[14:16:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P19637 and previous config saved to /var/cache/conftool/dbconfig/20220131-141633-marostegui.json
[14:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff)
[14:17:01] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[14:17:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1015.eqiad.wmnet to ganeti01.svc.eqiad.wmnet
[14:17:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:18] <moritzm>	 !log draining ganeti1008 for eventual reimage
[14:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:45] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet
[14:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:23] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum6001.drmrs.wmnet
[14:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[14:19:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298559)', diff saved to https://phabricator.wikimedia.org/P19638 and previous config saved to /var/cache/conftool/dbconfig/20220131-141943-marostegui.json
[14:19:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[14:19:47] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[14:19:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:49] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[14:19:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19639 and previous config saved to /var/cache/conftool/dbconfig/20220131-141951-marostegui.json
[14:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:57] <wikibugs>	 (03Merged) 10jenkins-bot: Update termbox to 2022-01-25-175409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757659 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE))
[14:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply on staging
[14:20:32] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply on test
[14:20:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply on production
[14:20:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:53] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1027.eqiad.wmnet with OS buster
[14:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19640 and previous config saved to /var/cache/conftool/dbconfig/20220131-142057-marostegui.json
[14:21:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:12] <Lucas_WMDE>	 hm, the chart label changed from termbox-0.0.20 to termbox-0.1.1
[14:21:17] <Lucas_WMDE>	 I assume that’s fine to apply
[14:22:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: sync on test
[14:22:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: sync on staging
[14:22:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:19] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1027.eqiad.wmnet
[14:23:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:44] <wikibugs>	 (03PS1) 10Ladsgroup: db2107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758472 (https://phabricator.wikimedia.org/T300510)
[14:24:05] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db2107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758472 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup)
[14:24:28] <jinxer-wm>	 (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org
[14:24:38] <Lucas_WMDE>	 seems to work fine on test.wikidata.org (staging cluster), proceeding with sync to codfw and eqiad
[14:24:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply on production
[14:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:48] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply on staging
[14:24:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply on test
[14:24:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance
[14:25:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance
[14:25:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2107 (T300510)', diff saved to https://phabricator.wikimedia.org/P19641 and previous config saved to /var/cache/conftool/dbconfig/20220131-142550-ladsgroup.json
[14:25:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:55] <stashbot>	 T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510
[14:27:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2107.codfw.wmnet with OS bullseye
[14:27:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:00] <wikibugs>	 10SRE, 10Traffic: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10CDanis) After discussing with @BBlack and @Vgutierrez it seems that this isn't a good use case for ncredir as ncredir only supports dns-01 challenges.  So we need to find some other e...
[14:28:14] <logmsgbot>	 !log filippo@deploy1002 Started deploy [librenms/librenms@f049593]: Add custom patches to librenms 21.4.0
[14:28:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:24] <logmsgbot>	 !log filippo@deploy1002 Finished deploy [librenms/librenms@f049593]: Add custom patches to librenms 21.4.0 (duration: 00m 10s)
[14:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:03] <Lucas_WMDE>	 hm, my helmfile apply has been running for a few minutes now… I hope everything’s alright there
[14:29:20] <Lucas_WMDE>	 I’ll wait a bit longer though
[14:31:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P19642 and previous config saved to /var/cache/conftool/dbconfig/20220131-143138-marostegui.json
[14:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: sync on production
[14:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:10] <Lucas_WMDE>	 well, it timed out
[14:35:10] <wikibugs>	 (03PS3) 10Majavah: pdns: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254)
[14:35:14] <Lucas_WMDE>	 after, I think, ten minutes
[14:36:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P19643 and previous config saved to /var/cache/conftool/dbconfig/20220131-143602-marostegui.json
[14:36:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:21] <perryprog>	 Huh, wikimedia.org went down for me for a hot second
[14:36:33] <perryprog>	 It's back up now though; might be a DNS thing on my end
[14:39:13] <Lucas_WMDE>	 if anyone with [[wikitech:Kubernetes/Deployments]] expertise is around, I’d appreciate some help
[14:39:30] <Lucas_WMDE>	 it’s probably nothing serious but I’m not very confident on my own ^^
[14:39:49] <taavi>	 Lucas_WMDE: what do you need?
[14:40:00] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33510/console" [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[14:40:02] <Lucas_WMDE>	 I ran
[14:40:05] <Lucas_WMDE>	 lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox (master $ u=) $ helmfile -e codfw -i apply
[14:40:15] <Lucas_WMDE>	 and the production release failed with a timeout
[14:40:29] <Lucas_WMDE>	 I’m guessing I should retry and hope for the best, but I’m not sure ^^
[14:40:38] <Lucas_WMDE>	 as far as I can tell there’s no other output indicating what went wrong
[14:40:44] <Lucas_WMDE>	 Error: UPGRADE FAILED: release production failed, and has been rolled back due to atomic being set: timed out waiting for the condition
[14:41:09] <Lucas_WMDE>	 (if I understand correctly, there are three releases(?) in the codfw(?) cluster, and only the production one failed, and the other two – staging and test? – went through)
[14:41:28] <Lucas_WMDE>	 *in the codfw cluster(?), to put the question mark on the right word that I’m uncertain about ^^
[14:42:02] <taavi>	 `kube_env termbox codfw; kubectl get pod` shows one new pod and 3 old ones
[14:42:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: tweak IcingaOverload alert [alerts] - 10https://gerrit.wikimedia.org/r/758474
[14:43:33] <Lucas_WMDE>	 hm, the new pod still has the old image AFAICT
[14:43:39] <Lucas_WMDE>	 from 2021 instead of 2022
[14:43:59] <Lucas_WMDE>	 all four of them have the same image
[14:46:06] <Lucas_WMDE>	 kubectl get events has two errors about failing to pull the image o_O
[14:46:17] <Lucas_WMDE>	 (in the same kube_env)
[14:46:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298558)', diff saved to https://phabricator.wikimedia.org/P19644 and previous config saved to /var/cache/conftool/dbconfig/20220131-144642-marostegui.json
[14:46:44] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[14:46:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[14:46:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:48] <stashbot>	 T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558
[14:46:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298558)', diff saved to https://phabricator.wikimedia.org/P19645 and previous config saved to /var/cache/conftool/dbconfig/20220131-144650-marostegui.json
[14:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:59] <taavi>	 I'm not sure what exactly happened, and apparently the kubernetes user does not have enough permissions for any manual helm operations (even those listed on the wikitech page)
[14:47:50] <Lucas_WMDE>	 I think I’ll try the command again
[14:48:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298558)', diff saved to https://phabricator.wikimedia.org/P19646 and previous config saved to /var/cache/conftool/dbconfig/20220131-144806-marostegui.json
[14:48:09] <Lucas_WMDE>	 the new image is definitely working in the staging release (powering test.wikidata.org), I can see the differences in the SSR HTML
[14:48:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:19] <taavi>	 what if you try again and see what happens?
[14:48:22] <Lucas_WMDE>	 in the staging *cluster (I think)
[14:48:28] <Lucas_WMDE>	 yeah, let’s do that
[14:48:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply on production
[14:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply on test
[14:48:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply on staging
[14:48:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:07] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10jhathaway)
[14:50:21] <Lucas_WMDE>	 looks like it’s waiting again
[14:50:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: sync on production
[14:50:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:51] <Lucas_WMDE>	 yay!
[14:51:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P19647 and previous config saved to /var/cache/conftool/dbconfig/20220131-145107-marostegui.json
[14:51:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:20] <Lucas_WMDE>	 now there are three new running pods (and a fourth one ContainerCreating)
[14:51:28] <Lucas_WMDE>	 ok let’s go for eqiad
[14:51:33] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: apply on production
[14:51:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply on staging
[14:51:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply on test
[14:51:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: sync on production
[14:53:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:32] <Lucas_WMDE>	 yup, new SSR is running on www.wikidata.org
[14:53:37] <Lucas_WMDE>	 thanks taavi!
[14:53:46] <Lucas_WMDE>	 (well, on m.wikidata.org ^^)
[14:55:24] <wikibugs>	 (03PS6) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500)
[14:56:27] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33511/console" [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[14:58:05] <jelto>	 !log update scap to 4.2.2 on A:mw-canary or A:parsoid-canary or A:mw-jobrunner-canary - T300392
[14:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:10] <stashbot>	 T300392: Deploy Scap version 4.2.2 - https://phabricator.wikimedia.org/T300392
[14:58:13] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] Add hostname-override and cluster-cidr to kube-proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm)
[14:58:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2107.codfw.wmnet with OS bullseye
[14:58:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P19648 and previous config saved to /var/cache/conftool/dbconfig/20220131-150311-marostegui.json
[15:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:10] <jelto>	 !log update scap to 4.2.2 on A:restbase-canary - T300392
[15:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:15] <stashbot>	 T300392: Deploy Scap version 4.2.2 - https://phabricator.wikimedia.org/T300392
[15:06:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19649 and previous config saved to /var/cache/conftool/dbconfig/20220131-150611-marostegui.json
[15:06:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[15:06:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[15:06:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:16] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[15:06:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298559)', diff saved to https://phabricator.wikimedia.org/P19650 and previous config saved to /var/cache/conftool/dbconfig/20220131-150619-marostegui.json
[15:06:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298559)', diff saved to https://phabricator.wikimedia.org/P19651 and previous config saved to /var/cache/conftool/dbconfig/20220131-150725-marostegui.json
[15:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:13] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi)
[15:18:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P19652 and previous config saved to /var/cache/conftool/dbconfig/20220131-151816-marostegui.json
[15:18:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P19653 and previous config saved to /var/cache/conftool/dbconfig/20220131-152230-marostegui.json
[15:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:05] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided)
[15:24:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:19] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 00m 13s)
[15:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:25] <wikibugs>	 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 4 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lucas_Werkmeister_WMDE) That sounds like a very cumbersome hack to me, and I also think it’s too early t...
[15:33:16] <logmsgbot>	 !log jelto@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided)
[15:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298558)', diff saved to https://phabricator.wikimedia.org/P19654 and previous config saved to /var/cache/conftool/dbconfig/20220131-153320-marostegui.json
[15:33:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[15:33:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[15:33:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:26] <stashbot>	 T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558
[15:33:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298558)', diff saved to https://phabricator.wikimedia.org/P19655 and previous config saved to /var/cache/conftool/dbconfig/20220131-153328-marostegui.json
[15:33:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298558)', diff saved to https://phabricator.wikimedia.org/P19656 and previous config saved to /var/cache/conftool/dbconfig/20220131-153446-marostegui.json
[15:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P19657 and previous config saved to /var/cache/conftool/dbconfig/20220131-153734-marostegui.json
[15:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:51] <logmsgbot>	 !log jelto@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 04m 34s)
[15:37:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:17] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Miriam) @jhathaway  could you double check that @AniketArs has LDAP access? They are not able to access the notebooks. He is  able to access the stat ma...
[15:45:49] <jayme>	 Lucas_WMDE: sorry, did spot your messages here. Reading the backlog it seems that at least one node had/has issues pulling the image (wikibase-termbox:2022-01-25-175409-production)
[15:46:08] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM!  I'm on the fence about page defaulting to true, but let's try it" [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[15:46:33] <Lucas_WMDE>	 jayme: it seemed to work fine on the second attempt, do you think any further action is necessary?
[15:48:06] <jayme>	 Lucas_WMDE: there is still one pod failing (at least in codfw). It's events (kubectl describe po termbox-production-6f5b9d8cf-hclqs) show a "contect canceled" error pulling the image
[15:48:17] <Lucas_WMDE>	 oh
[15:48:22] <jayme>	 that usually means that docker was unable to pull the image in 2m
[15:48:37] <jayme>	 pull & extract that is
[15:48:43] <_joe_>	 which is strange indeed
[15:49:04] <jayme>	 it's an HDD node...so maybe termbox image grew?
[15:49:20] <Lucas_WMDE>	 possibly, though not by very much I would’ve thought
[15:49:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P19658 and previous config saved to /var/cache/conftool/dbconfig/20220131-154950-marostegui.json
[15:49:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:57] <wikibugs>	 (03PS1) 10Ladsgroup: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758506 (https://phabricator.wikimedia.org/T300510)
[15:50:05] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "The recursors parts seem fine for traffic's use (should be nop on buster and work fine for our own bullseye transition), and we don't use " [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[15:50:18] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "db2107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758489
[15:50:26] <jayme>	 just ~60MB compated to the version from 2021-12-06
[15:50:26] <wikibugs>	 (03CR) 10Andrew Bogott: pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[15:50:41] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758506 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup)
[15:50:47] <jayme>	 but >500MB compared to 2021-03-09 :)
[15:52:18] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[15:52:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298559)', diff saved to https://phabricator.wikimedia.org/P19659 and previous config saved to /var/cache/conftool/dbconfig/20220131-155239-marostegui.json
[15:52:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[15:52:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[15:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:45] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[15:52:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19660 and previous config saved to /var/cache/conftool/dbconfig/20220131-155246-marostegui.json
[15:52:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19661 and previous config saved to /var/cache/conftool/dbconfig/20220131-155353-marostegui.json
[15:54:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:39] <wikibugs>	 (03CR) 10Andrew Bogott: pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[15:54:50] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM, have not been using these logs either." [puppet] - 10https://gerrit.wikimedia.org/r/757955 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite)
[15:55:20] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated tests: schedule webgen tool in the correct grid [puppet] - 10https://gerrit.wikimedia.org/r/758509 (https://phabricator.wikimedia.org/T300501)
[15:55:30] <jayme>	 Lucas_WMDE: no immediate action required from your side. I'll cycle back to this (potentially implementing a workaround) after a meeting
[15:55:34] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[15:55:35] <Lucas_WMDE>	 looks like there’s a new pod that successfully pulled the image now
[15:55:39] <Lucas_WMDE>	 ok!
[15:55:56] <wikibugs>	 (03PS1) 10Majavah: O:openstack::services: don't use pdns prometheus exporters on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758510
[15:56:15] <wikibugs>	 (03PS2) 10Majavah: O:openstack::services: don't use pdns prometheus exporters on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758510 (https://phabricator.wikimedia.org/T300254)
[15:56:28] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "No change for doh*  hosts." [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[15:57:46] <jayme>	 Lucas_WMDE: yeah, I've killed the other one which came with a good chance of the new pod being scheduled on a node with SSD's instead of HDD's
[15:58:22] <Lucas_WMDE>	 ah ok
[15:58:27] <Lucas_WMDE>	 so it wasn’t a coincidence ^^
[15:58:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[15:59:05] <wikibugs>	 (03CR) 10Herron: [C: 03+1] o11y: tweak IcingaOverload alert [alerts] - 10https://gerrit.wikimedia.org/r/758474 (owner: 10Filippo Giunchedi)
[15:59:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T300510)', diff saved to https://phabricator.wikimedia.org/P19662 and previous config saved to /var/cache/conftool/dbconfig/20220131-155905-ladsgroup.json
[15:59:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:13] <stashbot>	 T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510
[16:00:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[16:00:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[16:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T300510)', diff saved to https://phabricator.wikimedia.org/P19663 and previous config saved to /var/cache/conftool/dbconfig/20220131-160054-ladsgroup.json
[16:00:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:59] <XioNoX>	 !log Move core routers loopback filter to Capirca
[16:02:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2125.codfw.wmnet with OS bullseye
[16:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:10] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "db2107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758489
[16:03:16] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db2107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758489 (owner: 10Ladsgroup)
[16:04:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P19664 and previous config saved to /var/cache/conftool/dbconfig/20220131-160456-marostegui.json
[16:04:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:26] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) 05Open→03Resolved
[16:06:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Epic: Capacity planning for (& optimization of) transport backhaul vs edge egress - https://phabricator.wikimedia.org/T263275 (10JAllemandou)
[16:06:39] <icinga-wm>	 PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:54] <icinga-wm>	 PROBLEM - Host ncredir-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:54] <icinga-wm>	 PROBLEM - Host ncredir4001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:54] <icinga-wm>	 PROBLEM - Host ncredir4002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:57] <icinga-wm>	 PROBLEM - Host netflow4002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:46] <icinga-wm>	 PROBLEM - Host cr3-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:47] <wikibugs>	 (03PS5) 10Herron: centrallog: clean up old /srv/syslog/host directories after grace period [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056)
[16:08:01] <icinga-wm>	 PROBLEM - BFD status on cr2-eqord is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:08:05] <RhinosF1>	 XioNoX: ^
[16:08:15] <XioNoX>	 er
[16:08:17] <icinga-wm>	 PROBLEM - Host bast4003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:28] * Emperor is here. Have we a problem?
[16:08:31] * volans here
[16:08:33] <bblack>	 that doesn't look good
[16:08:36] <cdanis>	 here
[16:08:39] <godog>	 I'm here too
[16:08:41] <XioNoX>	 rolling back my chane
[16:08:42] <XioNoX>	 change
[16:08:55] <icinga-wm>	 PROBLEM - Host install4001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:57] <icinga-wm>	 PROBLEM - Host doh4001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:57] <icinga-wm>	 PROBLEM - Host doh4002 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:59] <icinga-wm>	 PROBLEM - Host durum4001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P19665 and previous config saved to /var/cache/conftool/dbconfig/20220131-160859-marostegui.json
[16:09:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:09] <rzl>	 here but in meeting, watching and can help if needed
[16:09:11] <icinga-wm>	 PROBLEM - Host prometheus4001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:09:17] <jhathaway>	 here as well
[16:09:19] <_joe_>	 should we depool ulsfo?
[16:09:23] <icinga-wm>	 RECOVERY - Host doh4001 is UP: PING OK - Packet loss = 0%, RTA = 68.62 ms
[16:09:23] <icinga-wm>	 RECOVERY - Host durum4001 is UP: PING OK - Packet loss = 0%, RTA = 68.55 ms
[16:09:23] <icinga-wm>	 RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 69.04 ms
[16:09:25] <icinga-wm>	 RECOVERY - Host doh4002 is UP: PING OK - Packet loss = 0%, RTA = 68.61 ms
[16:09:26] <icinga-wm>	 RECOVERY - Host cr3-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 69.14 ms
[16:09:26] <XioNoX>	 nah
[16:09:27] <icinga-wm>	 RECOVERY - Host bast4003 is UP: PING OK - Packet loss = 0%, RTA = 68.51 ms
[16:09:27] <icinga-wm>	 RECOVERY - Host install4001 is UP: PING OK - Packet loss = 0%, RTA = 68.48 ms
[16:09:28] <_joe_>	 I guess not
[16:09:34] <wikibugs>	 (03PS1) 10BBlack: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/758511
[16:09:39] <icinga-wm>	 RECOVERY - Host prometheus4001 is UP: PING OK - Packet loss = 0%, RTA = 68.60 ms
[16:09:39] <icinga-wm>	 RECOVERY - Host ncredir4001 is UP: PING OK - Packet loss = 0%, RTA = 68.62 ms
[16:09:41] <icinga-wm>	 RECOVERY - Host netflow4002 is UP: PING OK - Packet loss = 0%, RTA = 68.55 ms
[16:09:50] <icinga-wm>	 RECOVERY - Host ncredir-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 68.23 ms
[16:09:53] <icinga-wm>	 RECOVERY - Host ncredir4002 is UP: PING OK - Packet loss = 0%, RTA = 68.61 ms
[16:09:59] <bblack>	 I see recovs, I was off uploading that patch.  will hold for now :)
[16:10:29] <icinga-wm>	 RECOVERY - BFD status on cr2-eqord is OK: OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:10:30] <jynus>	 I was looking the status page to see impact and saw an increase in global latency, but it happened hours ago
[16:10:50] <XioNoX>	 yeah it's fully rolled back
[16:11:36] <wikibugs>	 (03CR) 10Herron: centrallog: clean up old /srv/syslog/host directories after grace period (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron)
[16:11:37] <cdanis>	 we served a bunch of 5xx from ulsfo, looks to have recovered now
[16:12:12] <_joe_>	 unavoidable I guess
[16:13:06] <bblack>	 yeah
[16:14:33] <cdanis>	 jynus: the status page is not quite realtime, it often lags by 5-10 minutes
[16:14:43] <cdanis>	 https://i.imgur.com/hAImtqm.png
[16:15:09] <jynus>	 cdanis: yeah, I noticed the opposite, a clear latency increase, but long time ago
[16:15:14] <XioNoX>	 I think I found the issue, in my patch
[16:15:59] <cdanis>	 XioNoX: interesting that your patch seemed to cause a traffic spillover to other links too
[16:16:07] <jynus>	 cdanis: see the increase at 14:05- but it is better to use grafana for this, if it is available
[16:16:18] <cdanis>	 jynus: yes
[16:16:20] <XioNoX>	 cdanis: what do you mean?
[16:16:28] <cdanis>	 XioNoX: https://librenms.wikimedia.org/graphs/to=1643645700/id=7220/type=port_bits/from=1643624100/
[16:16:28] <wikibugs>	 (03PS1) 10Vgutierrez: site: Reimage cp3062 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/758512 (https://phabricator.wikimedia.org/T271421)
[16:16:35] <cdanis>	 maybe it is unrelated
[16:16:41] <jynus>	 yeah
[16:17:08] <cdanis>	 jynus: you can plug these queries into grafana explore https://gerrit.wikimedia.org/g/operations/puppet/+/e2b942b78e3d909fc2074e6b1eb80fc01761b8c0/hieradata/common/profile/statograph.yaml#14
[16:17:29] <jynus>	 I mentioned it to research it more, as the net issue seemed recovering
[16:17:32] <jynus>	 maybe a deploy or something
[16:18:54] <jynus>	 confirming BTW ulsfo availability looking good too
[16:19:07] <XioNoX>	 cdanis: it caused ulsfo to be isolated from the rest of the other sites (lost ospf sessions), it could be that traffic briefly went through the other redundant link
[16:20:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298558)', diff saved to https://phabricator.wikimedia.org/P19666 and previous config saved to /var/cache/conftool/dbconfig/20220131-162000-marostegui.json
[16:20:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[16:20:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[16:20:06] <stashbot>	 T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558
[16:20:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[16:20:07] <jynus>	 from traffic server side, there was at first a spike of 503s, then of 502s
[16:20:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[16:20:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298558)', diff saved to https://phabricator.wikimedia.org/P19667 and previous config saved to /var/cache/conftool/dbconfig/20220131-162014-marostegui.json
[16:20:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:53] <jynus>	 (taking about the recent net issue, still researching the older thingy)
[16:21:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298558)', diff saved to https://phabricator.wikimedia.org/P19668 and previous config saved to /var/cache/conftool/dbconfig/20220131-162132-marostegui.json
[16:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:49] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "Ready to go!" [puppet] - 10https://gerrit.wikimedia.org/r/757999 (https://phabricator.wikimedia.org/T298516) (owner: 10Eevans)
[16:22:29] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Add siteinfo data in formatversion=2 too [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm)
[16:23:21] <jynus>	 oh, cdanis- status page is local time, right?
[16:24:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P19669 and previous config saved to /var/cache/conftool/dbconfig/20220131-162403-marostegui.json
[16:24:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:14] <jynus>	 there was a traffic pattern change, but it was at 13:10 UTC: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=62&orgId=1&from=1643624625604&to=1643646225604&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200
[16:25:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated tests: schedule webgen tool in the correct grid [puppet] - 10https://gerrit.wikimedia.org/r/758509 (https://phabricator.wikimedia.org/T300501) (owner: 10Arturo Borrero Gonzalez)
[16:25:16] <wikibugs>	 (03Merged) 10jenkins-bot: Add siteinfo data in formatversion=2 too [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm)
[16:25:23] <wikibugs>	 (03PS1) 10Ladsgroup: db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758513 (https://phabricator.wikimedia.org/T300510)
[16:25:33] <jynus>	 I am going to discard it was not self-influcted and then probaly we can ignore it
[16:25:55] <logmsgbot>	 !log ariel@deploy1002 Started deploy [dumps/dumps@8820784]: add dump of siteinfo in format version 2
[16:25:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:58] <logmsgbot>	 !log ariel@deploy1002 Finished deploy [dumps/dumps@8820784]: add dump of siteinfo in format version 2 (duration: 00m 03s)
[16:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:11] <icinga-wm>	 PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:23] <cdanis>	 jynus: yes the graphs are always local time, despite the TZ of the rest of the page
[16:26:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) @Miriam & @AniketArs they were not part of the `nda` group, they are added now, please try again.
[16:26:40] <jynus>	 cdanis: my fault, as I am +1, it was difficult to notice it at first :-)
[16:26:54] <jynus>	 "off by one errors" :-)
[16:27:43] <jynus>	 nothing ongoing on SAL at 13:06- only db maintenance, which doesn't create more GET traffic :-), so just traffic dependent
[16:29:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10Ladsgroup) I don't know if this is result of this ticket or something unrelated but there is a lot of root@ spam with: ` Cluster configuration incomplete: 'Can...
[16:29:45] <wikibugs>	 (03PS2) 10Ladsgroup: db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758513 (https://phabricator.wikimedia.org/T300510)
[16:29:49] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758513 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup)
[16:30:04] <jouncebot>	 jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T1630).
[16:34:03] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Papaul) Please power down the servers and let me now when this is done
[16:34:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2125.codfw.wmnet with OS bullseye
[16:34:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Papaul) p:05Triage→03Medium
[16:36:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P19670 and previous config saved to /var/cache/conftool/dbconfig/20220131-163637-marostegui.json
[16:36:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] centrallog: clean up old /srv/syslog/host directories after grace period [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron)
[16:36:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:37:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: tweak IcingaOverload alert [alerts] - 10https://gerrit.wikimedia.org/r/758474 (owner: 10Filippo Giunchedi)
[16:38:56] <wikibugs>	 (03PS1) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774)
[16:39:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19671 and previous config saved to /var/cache/conftool/dbconfig/20220131-163908-marostegui.json
[16:39:11] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[16:39:12] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance
[16:39:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:13] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[16:39:14] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[16:39:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:17] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[16:39:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298559)', diff saved to https://phabricator.wikimedia.org/P19672 and previous config saved to /var/cache/conftool/dbconfig/20220131-163921-marostegui.json
[16:39:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:33] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: update external domain name [puppet] - 10https://gerrit.wikimedia.org/r/758515
[16:39:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[16:40:01] <logmsgbot>	 !log mmandere@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) for new host durum6001.drmrs.wmnet
[16:40:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: networktests: update external domain name [puppet] - 10https://gerrit.wikimedia.org/r/758515 (owner: 10Arturo Borrero Gonzalez)
[16:43:38] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "I can not ssh into the running vm so went with a passwordless root account to at least login via the console." [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[16:45:11] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Upgrade remaining aqs_next nodes to 'dev' (Cassandra 3.11.11) [puppet] - 10https://gerrit.wikimedia.org/r/757999 (https://phabricator.wikimedia.org/T298516) (owner: 10Eevans)
[16:45:24] <wikibugs>	 (03PS1) 104nn1l2: Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217)
[16:45:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217) (owner: 104nn1l2)
[16:45:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T300510)', diff saved to https://phabricator.wikimedia.org/P19673 and previous config saved to /var/cache/conftool/dbconfig/20220131-164550-ladsgroup.json
[16:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:56] <stashbot>	 T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510
[16:46:19] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:47:07] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.186 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[16:47:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on prometheus2006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[16:47:35] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[16:47:43] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:47:43] <icinga-wm>	 PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,cassandra-b.service,cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:47:49] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.185 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[16:47:55] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[16:48:25] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.184 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[16:48:31] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[16:48:59] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:49:01] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:50:51] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: PS Redundancy for elastic1077.eqiad.wmnet - https://phabricator.wikimedia.org/T300315 (10Gehel)
[16:51:07] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: PS Redundancy for elastic1080.eqiad.wmnet - https://phabricator.wikimedia.org/T300317 (10Gehel)
[16:51:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P19674 and previous config saved to /var/cache/conftool/dbconfig/20220131-165141-marostegui.json
[16:51:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:18] <urandom>	 !log restarting Cassandra, aqs1011-{a,b}, to apply upgrade to 3.11.11 -- T298516
[16:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:23] <stashbot>	 T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516
[16:53:58] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "please collect +1 from andrew as well." [puppet] - 10https://gerrit.wikimedia.org/r/758510 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[16:55:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298559)', diff saved to https://phabricator.wikimedia.org/P19675 and previous config saved to /var/cache/conftool/dbconfig/20220131-165531-marostegui.json
[16:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:37] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[16:57:41] <icinga-wm>	 RECOVERY - Wikidough DoH Check on doh6001 is OK: OK - Certificate wikimedia-dns.org will expire on Fri 15 Apr 2022 01:00:09 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikidough
[16:57:41] <icinga-wm>	 RECOVERY - Wikidough DoT Check on doh6001 is OK: TCP OK - 0.209 second response time on 185.15.58.11 port 853 https://wikitech.wikimedia.org/wiki/Wikidough
[16:57:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] O:openstack::services: don't use pdns prometheus exporters on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758510 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[17:00:28] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev network tests: update to reflect that proxy-02 is now active [puppet] - 10https://gerrit.wikimedia.org/r/758520 (https://phabricator.wikimedia.org/T297627)
[17:01:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[17:02:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev network tests: update to reflect that proxy-02 is now active [puppet] - 10https://gerrit.wikimedia.org/r/758520 (https://phabricator.wikimedia.org/T297627) (owner: 10Andrew Bogott)
[17:03:57] <urandom>	 !log restarting Cassandra, aqs1012-{a,b}, to apply upgrade to 3.11.11 -- T298516
[17:04:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:02] <stashbot>	 T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516
[17:06:46] <icinga-wm>	 RECOVERY - Wikidough DoH Check on doh6002 is OK: OK - Certificate wikimedia-dns.org will expire on Fri 15 Apr 2022 01:00:09 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikidough
[17:06:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298558)', diff saved to https://phabricator.wikimedia.org/P19676 and previous config saved to /var/cache/conftool/dbconfig/20220131-170646-marostegui.json
[17:06:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[17:06:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[17:06:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:51] <stashbot>	 T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558
[17:06:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19677 and previous config saved to /var/cache/conftool/dbconfig/20220131-170653-marostegui.json
[17:06:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org
[17:06:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:31] <wikibugs>	 (03PS2) 104nn1l2: Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217)
[17:08:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance
[17:08:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance
[17:08:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T300510)', diff saved to https://phabricator.wikimedia.org/P19678 and previous config saved to /var/cache/conftool/dbconfig/20220131-170808-ladsgroup.json
[17:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19679 and previous config saved to /var/cache/conftool/dbconfig/20220131-170812-marostegui.json
[17:08:13] <stashbot>	 T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510
[17:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/758052 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[17:09:06] <icinga-wm>	 RECOVERY - Wikidough DoT Check on doh6002 is OK: TCP OK - 0.208 second response time on 185.15.58.41 port 853 https://wikitech.wikimedia.org/wiki/Wikidough
[17:10:24] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson)
[17:10:31] <wikibugs>	 (03PS3) 104nn1l2: Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217)
[17:10:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P19680 and previous config saved to /var/cache/conftool/dbconfig/20220131-171036-marostegui.json
[17:10:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2148.codfw.wmnet with OS bullseye
[17:11:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:06] <urandom>	 !log restarting Cassandra, aqs1012-{a,b}, to apply upgrade to 3.11.11 -- T298516
[17:11:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:12] <stashbot>	 T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516
[17:11:19] <urandom>	 !log restarting Cassandra, aqs1013-{a,b}, to apply upgrade to 3.11.11 -- T298516
[17:11:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:37] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "The ssh host key can be generated by reconfiguring the ssh server using:" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar)
[17:12:03] <wikibugs>	 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) 1010 is updated, 1019 is locking up, I will need to power off and unplug
[17:13:16] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS buster
[17:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:13:42] <wikibugs>	 (03PS2) 10Cwhite: apifeatureusage: disable gc logging [puppet] - 10https://gerrit.wikimedia.org/r/757955 (https://phabricator.wikimedia.org/T297239)
[17:14:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster
[17:15:45] <urandom>	 !log restarting Cassandra, aqs1014-{a,b}, to apply upgrade to 3.11.11 -- T298516
[17:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:59] <wikibugs>	 (03PS1) 10Ottomata: Set spark maxPartitionBytes to hadoop dfs block size [puppet] - 10https://gerrit.wikimedia.org/r/758529 (https://phabricator.wikimedia.org/T300299)
[17:17:25] <wikibugs>	 (03PS1) 10Eigyan: [wmf-config]: Undeploy gdi survey from cawiki in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544)
[17:19:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:openstack: fix novaenv path [puppet] - 10https://gerrit.wikimedia.org/r/758049 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[17:19:30] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] apifeatureusage: disable gc logging [puppet] - 10https://gerrit.wikimedia.org/r/757955 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite)
[17:21:14] <wikibugs>	 (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan)
[17:22:57] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS buster
[17:23:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster
[17:23:05] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1004.eqiad.wmnet with OS buster
[17:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:10] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster...
[17:23:14] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758491
[17:23:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P19681 and previous config saved to /var/cache/conftool/dbconfig/20220131-172317-marostegui.json
[17:23:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:23] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758491
[17:23:50] <urandom>	 !log restarting Cassandra, aqs1015-{a,b}, to apply upgrade to 3.11.11 -- T298516
[17:23:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:54] <stashbot>	 T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516
[17:23:59] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS buster
[17:24:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster
[17:24:06] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1004.eqiad.wmnet with OS buster
[17:24:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:10] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758491 (owner: 10Ladsgroup)
[17:24:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster...
[17:25:46] <wikibugs>	 (03PS2) 10Ottomata: Set spark maxPartitionBytes to hadoop dfs block size [puppet] - 10https://gerrit.wikimedia.org/r/758529 (https://phabricator.wikimedia.org/T300299)
[17:25:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P19682 and previous config saved to /var/cache/conftool/dbconfig/20220131-172547-marostegui.json
[17:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:34] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33514/console" [puppet] - 10https://gerrit.wikimedia.org/r/758529 (https://phabricator.wikimedia.org/T300299) (owner: 10Ottomata)
[17:30:35] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33514/stat1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/758529 (https://phabricator.wikimedia.org/T300299) (owner: 10Ottomata)
[17:32:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul)
[17:33:47] <wikibugs>	 (03PS1) 10Cwhite: logstash: move safepoint logging flag inside gc_log gate [puppet] - 10https://gerrit.wikimedia.org/r/758533 (https://phabricator.wikimedia.org/T297239)
[17:38:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P19683 and previous config saved to /var/cache/conftool/dbconfig/20220131-173821-marostegui.json
[17:38:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298559)', diff saved to https://phabricator.wikimedia.org/P19684 and previous config saved to /var/cache/conftool/dbconfig/20220131-174052-marostegui.json
[17:40:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[17:40:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance
[17:40:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:58] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[17:41:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T298559)', diff saved to https://phabricator.wikimedia.org/P19685 and previous config saved to /var/cache/conftool/dbconfig/20220131-174059-marostegui.json
[17:41:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:05] <sukhe>	 !log disable puppet on A:rec-dns for T758063
[17:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298559)', diff saved to https://phabricator.wikimedia.org/P19686 and previous config saved to /var/cache/conftool/dbconfig/20220131-174206-marostegui.json
[17:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:12] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] pdns: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[17:44:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2148.codfw.wmnet with OS bullseye
[17:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:29] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:47:26] <wikibugs>	 (03CR) 10Herron: [C: 03+2] centrallog: clean up old /srv/syslog/host directories after grace period [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron)
[17:48:43] <wikibugs>	 (03PS1) 10Cmjohnson: Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934)
[17:49:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934) (owner: 10Cmjohnson)
[17:51:39] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack cloudservies: stop installing python2 git [puppet] - 10https://gerrit.wikimedia.org/r/758538
[17:51:49] <wikibugs>	 (03PS2) 10Cmjohnson: Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934)
[17:52:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934) (owner: 10Cmjohnson)
[17:53:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T300510)', diff saved to https://phabricator.wikimedia.org/P19687 and previous config saved to /var/cache/conftool/dbconfig/20220131-175304-ladsgroup.json
[17:53:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:09] <stashbot>	 T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510
[17:53:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack cloudservies: stop installing python2 git [puppet] - 10https://gerrit.wikimedia.org/r/758538 (owner: 10Andrew Bogott)
[17:53:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19688 and previous config saved to /var/cache/conftool/dbconfig/20220131-175326-marostegui.json
[17:53:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[17:53:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[17:53:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:31] <stashbot>	 T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558
[17:53:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19689 and previous config saved to /var/cache/conftool/dbconfig/20220131-175333-marostegui.json
[17:53:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:04] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2017.codfw.wmnet with reason: Firmware upgrades
[17:54:06] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2017.codfw.wmnet with reason: Firmware upgrades
[17:54:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:18] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:54:23] <wikibugs>	 (03PS3) 10Cmjohnson: Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934)
[17:54:45] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2017.wmnet
[17:54:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19690 and previous config saved to /var/cache/conftool/dbconfig/20220131-175452-marostegui.json
[17:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:09] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934) (owner: 10Cmjohnson)
[17:55:16] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) >>! In T299652#7664448, @Papaul wrote: > Please power down the servers and let me now when this is done  Ideally I'd like to do thi...
[17:57:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P19691 and previous config saved to /var/cache/conftool/dbconfig/20220131-175710-marostegui.json
[17:57:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:10] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10AniketArs) Thanks @jhathaway , Now I'm able to login Finally thanks @Miriam
[18:00:04] <jouncebot>	 ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T1800)
[18:01:11] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudbackup1003.eqiad.wmnet with OS buster
[18:01:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad...
[18:01:31] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) 05Open→03Resolved great, marking as resolved, please reopen if you discover any new issues.
[18:01:36] <moritzm>	 !log installing NSS security updates
[18:01:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS buster
[18:02:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1003.e...
[18:03:05] <wikibugs>	 (03PS1) 10Ssingh: pdns: update config file to remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/758540
[18:04:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[18:04:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:20] <wikibugs>	 (03CR) 10Mepps: [C: 03+1] [wmf-config]: Undeploy gdi survey from cawiki in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan)
[18:05:34] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1003/33517/" [puppet] - 10https://gerrit.wikimedia.org/r/758533 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite)
[18:06:41] <wikibugs>	 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Esanders) This appears to be affecting Patch demo instances too: https://github.com/MatmaRex/patchdemo/issues/422
[18:06:56] <wikibugs>	 (03CR) 10Andrew Bogott: "won't this break on everything pre-bullseye? The -content option wasn't added until 4.4" [puppet] - 10https://gerrit.wikimedia.org/r/758540 (owner: 10Ssingh)
[18:07:47] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:07:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P19693 and previous config saved to /var/cache/conftool/dbconfig/20220131-180956-marostegui.json
[18:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P19694 and previous config saved to /var/cache/conftool/dbconfig/20220131-181215-marostegui.json
[18:12:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:toolforge::redis_sentinel: fix hardcoded interface [puppet] - 10https://gerrit.wikimedia.org/r/758090 (https://phabricator.wikimedia.org/T153810) (owner: 10Majavah)
[18:13:54] <wikibugs>	 (03CR) 10Andrew Bogott: pdns: update config file to remove deprecated option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758540 (owner: 10Ssingh)
[18:17:51] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: move safepoint logging flag inside gc_log gate [puppet] - 10https://gerrit.wikimedia.org/r/758533 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite)
[18:25:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P19695 and previous config saved to /var/cache/conftool/dbconfig/20220131-182501-marostegui.json
[18:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298559)', diff saved to https://phabricator.wikimedia.org/P19696 and previous config saved to /var/cache/conftool/dbconfig/20220131-182719-marostegui.json
[18:27:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[18:27:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance
[18:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:26] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[18:27:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298559)', diff saved to https://phabricator.wikimedia.org/P19697 and previous config saved to /var/cache/conftool/dbconfig/20220131-182728-marostegui.json
[18:27:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:38] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: move safepoint logging flag inside gc_log gate [puppet] - 10https://gerrit.wikimedia.org/r/758533 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite)
[18:28:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298559)', diff saved to https://phabricator.wikimedia.org/P19698 and previous config saved to /var/cache/conftool/dbconfig/20220131-182834-marostegui.json
[18:28:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[18:28:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:06] <wikibugs>	 (03PS1) 10Cwhite: logstash: disable gc logging on logstash collectors [puppet] - 10https://gerrit.wikimedia.org/r/758541 (https://phabricator.wikimedia.org/T288258)
[18:35:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Update logstash Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/758542
[18:37:04] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Looks right!  Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/758542 (owner: 10Muehlenhoff)
[18:37:55] <wikibugs>	 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 4 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Ladsgroup) The problem is that if we want to have the long-term vision in mind, we need to move towards...
[18:39:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update logstash Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/758542 (owner: 10Muehlenhoff)
[18:40:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19699 and previous config saved to /var/cache/conftool/dbconfig/20220131-184006-marostegui.json
[18:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:11] <stashbot>	 T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558
[18:40:24] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1003.eqiad.wmnet with OS buster
[18:40:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster...
[18:41:17] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: disable gc logging on logstash collectors [puppet] - 10https://gerrit.wikimedia.org/r/758541 (https://phabricator.wikimedia.org/T288258) (owner: 10Cwhite)
[18:41:17] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS buster
[18:41:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster
[18:41:24] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1004.eqiad.wmnet with OS buster
[18:41:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster...
[18:41:53] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: disable gc logging on logstash collectors [puppet] - 10https://gerrit.wikimedia.org/r/758541 (https://phabricator.wikimedia.org/T288258) (owner: 10Cwhite)
[18:43:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P19700 and previous config saved to /var/cache/conftool/dbconfig/20220131-184339-marostegui.json
[18:43:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:50] <wikibugs>	 (03PS4) 10Clare Ming: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676)
[18:43:53] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977
[18:43:55] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423
[18:46:08] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:46:55] <wikibugs>	 (03PS5) 10Clare Ming: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676)
[18:52:11] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7411 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[18:52:21] <icinga-wm>	 PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 32699 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops
[18:54:15] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7407 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[18:54:19] <wikibugs>	 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Decide on details of progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) a:05Krinkle→03None
[18:57:54] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) (owner: 10Clare Ming)
[18:58:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P19701 and previous config saved to /var/cache/conftool/dbconfig/20220131-185843-marostegui.json
[18:58:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 RoanKattouw and Urbanecm: That opportune time is upon us again. Time for a UTC evening backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T1900).
[19:00:04] <jouncebot>	 cjming, nn1l2, and eigyan: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:08] <nn1l2>	 hi
[19:00:16] <cjming>	 hello
[19:00:32] <urbanecm>	 hey
[19:00:53] <urbanecm>	 cjming: hi, do you want to deploy today? Or should I?
[19:01:25] <cjming>	 urbanecm - do you mind doing it?  i'm trying to multitask atm which is not my strong suit
[19:01:31] <urbanecm>	 sure
[19:02:10] <urbanecm>	 cjming: should i put your patches at the end? or reviewing doesn't hurt your multitasking that much?
[19:02:32] <cjming>	 that's fine too - and i can take care of them then
[19:02:54] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217) (owner: 104nn1l2)
[19:03:02] <urbanecm>	 cjming: okay, will ping you when done
[19:03:05] <wikibugs>	 (03CR) 10EllenR: "looks like 2 already, but since it is showing up in my dashboard I will answer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan)
[19:03:13] <cjming>	 urbanecm: ty!
[19:03:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217) (owner: 104nn1l2)
[19:04:05] <eigyan>	 Greetings
[19:04:12] <urbanecm>	 nn1l2: I'll just sync this one, since it's a revert
[19:04:19] <nn1l2>	 thanks!
[19:06:00] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a659cb0089da0c6d501263c19dd692a286601d2c: Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" (T300217) (duration: 00m 50s)
[19:06:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:05] <stashbot>	 T300217: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300217
[19:06:10] <urbanecm>	 nn1l2: live :)
[19:06:20] <wikibugs>	 (03PS2) 10Urbanecm: [wmf-config]: Undeploy gdi survey from cawiki in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan)
[19:06:20] <nn1l2>	 Thank you!
[19:06:33] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [wmf-config]: Undeploy gdi survey from cawiki in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan)
[19:06:54] <eigyan>	 thank you
[19:07:00] <wikibugs>	 (03PS2) 10Andrew Bogott: pdns: update config file to remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/758540 (owner: 10Ssingh)
[19:07:05] <urbanecm>	 hi eigyan, do you want to test it at mwdebug1001 (once it's there)?
[19:07:28] <urbanecm>	 (as far as i know surveys, it can't be reasonably tested, but i'm not 100% sure)
[19:07:35] <wikibugs>	 (03Merged) 10jenkins-bot: [wmf-config]: Undeploy gdi survey from cawiki in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan)
[19:07:57] <urbanecm>	 eigyan: it's at mwdebug1001 if you want to test.
[19:08:24] <eigyan>	 Will do urbanecm
[19:08:27] <urbanecm>	 thanks
[19:08:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:08:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:09] <eigyan>	 urbanecm VERIFIED! thank you
[19:09:12] <urbanecm>	 syncing
[19:10:15] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7371 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[19:10:24] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 411af378c606c0f987679a1eebd901326dd5db18: [wmf-config]: Undeploy gdi survey from cawiki in production (T300544) (duration: 00m 50s)
[19:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:29] <stashbot>	 T300544: Undeploy the cawiki test survey from production - https://phabricator.wikimedia.org/T300544
[19:10:38] <urbanecm>	 eigyan: and, live
[19:10:57] <urbanecm>	 cjming: I'm done. I can do yours now, or you can self-serve -- up to you.
[19:11:18] <cjming>	 i can self-serve - thanks urbanecm!
[19:11:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:11:27] <urbanecm>	 great! Ping me if I'm needed then :)
[19:11:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:11:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:42] <wikibugs>	 (03PS3) 10Clare Ming: Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson)
[19:11:53] <eigyan>	 and live ✅
[19:11:55] <wikibugs>	 (03PS6) 10Clare Ming: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676)
[19:13:04] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) (owner: 10Clare Ming)
[19:13:45] <wikibugs>	 (03Merged) 10jenkins-bot: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) (owner: 10Clare Ming)
[19:13:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298559)', diff saved to https://phabricator.wikimedia.org/P19702 and previous config saved to /var/cache/conftool/dbconfig/20220131-191348-marostegui.json
[19:13:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[19:13:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[19:13:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:54] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[19:13:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298559)', diff saved to https://phabricator.wikimedia.org/P19703 and previous config saved to /var/cache/conftool/dbconfig/20220131-191356-marostegui.json
[19:13:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:14:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:37] <wikibugs>	 (03PS4) 10Clare Ming: Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson)
[19:17:55] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/config: Config: [[gerrit:757500|Update config for idwiki: (T299676)]] (duration: 00m 50s)
[19:17:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:00] <stashbot>	 T299676: Turn on desktop improvements by default on idwiki - https://phabricator.wikimedia.org/T299676
[19:19:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:51] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson)
[19:20:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:20:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:43] <wikibugs>	 (03PS3) 10Andrew Bogott: pdns: update config file to remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/758540 (owner: 10Ssingh)
[19:21:23] <wikibugs>	 (03Merged) 10jenkins-bot: Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson)
[19:21:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:21:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] pdns: update config file to remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/758540 (owner: 10Ssingh)
[19:24:51] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:757735|Disable A/B test (T297924)]] (duration: 00m 49s)
[19:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:24:56] <stashbot>	 T297924: Turn A/B test enrollment off and deploy sticky header everywhere - https://phabricator.wikimedia.org/T297924
[19:25:45] <cjming>	 urbanecm: my changes are live - shall i go ahead and close the deployment window?
[19:26:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298559)', diff saved to https://phabricator.wikimedia.org/P19704 and previous config saved to /var/cache/conftool/dbconfig/20220131-192604-marostegui.json
[19:26:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:10] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[19:26:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:26:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:58] <urbanecm>	 cjming: yes please -- you were the last one.
[19:27:19] <cjming>	 !log end of UTC evening backport & config window
[19:27:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:49] <urbanecm>	 thanks!
[19:28:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:28:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[19:28:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[19:29:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P19705 and previous config saved to /var/cache/conftool/dbconfig/20220131-194109-marostegui.json
[19:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:57] <icinga-wm>	 PROBLEM - DNS on thumbor2005.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.193.0.182 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:42:46] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[19:42:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:58] <wikibugs>	 (03PS16) 10Gehel: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper)
[19:48:49] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7267 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[19:55:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul)
[19:55:35] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:56:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P19706 and previous config saved to /var/cache/conftool/dbconfig/20220131-195614-marostegui.json
[19:56:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:11] <wikibugs>	 (03PS3) 10Jdlrobson: Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927)
[20:02:31] <wikibugs>	 10SRE, 10Traffic: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10CDanis) 05Open→03Resolved a:03CDanis @Volans made the suggestion of using wikitech-static.  Given that status.wikipedia.org is currently served from there, this seems quite reas...
[20:02:38] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis)
[20:05:25] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7068 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[20:07:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[20:07:24] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[20:07:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:51] <wikibugs>	 (03PS1) 10JHathaway: ferm: replace systemd unit to ensure success on boot [puppet] - 10https://gerrit.wikimedia.org/r/758548
[20:09:11] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[20:09:12] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[20:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:18] <wikibugs>	 (03CR) 10JHathaway: "Would love a review. I hit this problem on mx1001, but I would love to understand if it is a problem on all ferm hosts with @resolve rules" [puppet] - 10https://gerrit.wikimedia.org/r/758548 (owner: 10JHathaway)
[20:10:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[20:10:24] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[20:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298559)', diff saved to https://phabricator.wikimedia.org/P19707 and previous config saved to /var/cache/conftool/dbconfig/20220131-201118-marostegui.json
[20:11:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:23] <stashbot>	 T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559
[20:12:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[20:12:59] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[20:13:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:07] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[20:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:53] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on prometheus2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:21:13] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED
[20:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:30] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2002.mgmt.codfw.wmnet with reboot policy FORCED
[20:24:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:27] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7402 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[20:27:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) Can someone please update this task with the Partitioning/Raid information?  Thanks.
[20:29:13] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] O:mail::mx: Add mx specific block list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond)
[20:29:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Will merge tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[20:31:26] <wikibugs>	 (03CR) 10JHathaway: P:installserver::proxy: Add domain whitelist to proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond)
[20:31:55] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] C:mw_rc_irc::ircserver: Refresh ircd services on config changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond)
[20:33:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "ok, cool, will let you merge it. thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn)
[20:33:25] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-staging2002.mgmt.codfw.wmnet with reboot policy FORCED
[20:33:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:03] <wikibugs>	 (03PS1) 10Volans: dhcp: case-insensitive match if Dell serial number [software/spicerack] - 10https://gerrit.wikimedia.org/r/758558
[20:39:48] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host etherpad1003.eqiad.wmnet
[20:39:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:47] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests: create bullseye VM for Etherpad upgrade - https://phabricator.wikimedia.org/T300568 (10Dzahn)
[20:43:30] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,cassandra-b.service,cassandra-c.service eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:43:30] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.184 and port 9042: Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://phabricator.wikimedia.org/T93886
[20:43:30] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:43:30] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:43:30] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.185 and port 9042: Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://phabricator.wikimedia.org/T93886
[20:43:30] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:43:30] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:43:31] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.186 and port 9042: Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://phabricator.wikimedia.org/T93886
[20:43:32] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[20:43:32] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:43:49] <mutante>	 scary but nice:) ty
[20:44:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul)
[20:50:12] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host etherpad1003.eqiad.wmnet
[20:50:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:27] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7407 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[20:52:29] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good, would be great to get this merged into extlib" [puppet] - 10https://gerrit.wikimedia.org/r/753786 (owner: 10Jbond)
[20:54:39] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[20:54:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:55] <wikibugs>	 (03PS1) 10Dzahn: DHCP: add MAC address for etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/758559 (https://phabricator.wikimedia.org/T300568)
[20:56:12] <wikibugs>	 (03PS2) 10Dzahn: DHCP: add MAC address for etherpad1003, use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/758559 (https://phabricator.wikimedia.org/T300568)
[20:57:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC address for etherpad1003, use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/758559 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn)
[21:00:05] <jouncebot>	 chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T2100).
[21:01:00] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:01:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Papaul)
[21:01:41] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Papaul)
[21:12:15] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) @Joe The one package I was talking about is "ttf-bitstream-vera" which gets installed when you remove "fonts-dejacu-core*" and since I did "--purge fonts*"...
[21:12:35] <icinga-wm>	 RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:15:39] <mutante>	 !log installed bullseye on new VM etherpad1003, signing puppet certs for etherpad1003.eqiad.wmnet - puppet error expected until we add the role (T300568)
[21:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:45] <stashbot>	 T300568: create bullseye VM for Etherpad upgrade - https://phabricator.wikimedia.org/T300568
[21:17:10] <wikibugs>	 (03PS1) 10Dzahn: site: add etherpad1003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/758560 (https://phabricator.wikimedia.org/T300568)
[21:17:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul)
[21:17:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: add etherpad1003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/758560 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn)
[21:19:35] <icinga-wm>	 PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,cassandra-b.service,cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:38] <wikibugs>	 (03PS1) 10Dzahn: switch etherpad.discovery.wmnet to etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/758561 (https://phabricator.wikimedia.org/T300568)
[21:23:19] <wikibugs>	 (03PS1) 10Dzahn: site: add etherpad role to etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/758562 (https://phabricator.wikimedia.org/T300568)
[21:23:50] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn)
[21:25:32] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) a:05Dzahn→03None
[21:25:34] <wikibugs>	 (03Abandoned) 10Majavah: Bare minimum port to Python 3 to support Debian Bullseye [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah)
[21:28:09] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn)
[21:28:54] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "Have to be careful because I don't want to repeat what happened last time, got reminded when I read the old ticket: T224580#5828883" [puppet] - 10https://gerrit.wikimedia.org/r/758562 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn)
[21:30:27] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "so yea, we need to coordinate and mask the service on one server before starting it on the other etc..." [puppet] - 10https://gerrit.wikimedia.org/r/758562 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn)
[21:31:40] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "this will be the last step after everything is confirmed working. just pre-created it but not ready" [dns] - 10https://gerrit.wikimedia.org/r/758561 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn)
[21:35:14] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn)
[21:41:41] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) @akosiaris   Looking at the old ticket when we upgraded to buster, I don't want to repeat the mistake and run Etherpad on 2 servers at a...
[21:46:26] <wikibugs>	 (03PS5) 10JHathaway: [WIP] team-sre: add hardware-related checks [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans)
[21:46:57] <wikibugs>	 (03CR) 10JHathaway: [WIP] team-sre: add hardware-related checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans)
[21:57:40] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10DannyH)
[21:58:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10DannyH) I hope that I've done this correctly; please let me know if I've made a mistake. Thanks!
[22:00:05] <jouncebot>	 Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T2200).
[22:11:58] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7124 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[22:12:29] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2004-dev.codfw.wmnet with OS bullseye
[22:12:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:11] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn)
[22:19:16] <wikibugs>	 (03PS11) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074
[22:19:52] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7137 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[22:19:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn)
[22:21:53] <wikibugs>	 (03PS12) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074
[22:21:59] <urbanecm>	 sbassett: Reedy: hi, if none of you is deploying something, is it ok for me to roll https://phabricator.wikimedia.org/T298312#7663152 out?
[22:22:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10RhinosF1) Adding Andrew & Olja as they normally approve for this group.  @DannyH: it looks good. @Ladsgroup is on clinic duty this week and will pick it up for you! Please get yo...
[22:22:46] <sbassett>	 urbanecm: Yep, feel free.  Thanks.
[22:22:52] <urbanecm>	 thanks sbassett 
[22:26:15] <wikibugs>	 (03PS13) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074
[22:29:00] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33523/snapshot1008.eqiad.wmnet/index.html compiles now but there is still something bad" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn)
[22:32:22] <wikibugs>	 (03PS14) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074
[22:32:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn)
[22:36:29] <wikibugs>	 (03PS15) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074
[22:38:46] <urbanecm>	 !log Deploy security patch for T298312
[22:38:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:36] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "@ArielGlenn finally got back to this to get it done with. now it passes jenkins and compiles and I see no changes anymore INSIDE files/tem" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn)
[22:40:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[22:41:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:01] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1027 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:42:04] <urbanecm>	 sbassett: all done. 
[22:42:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[22:42:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[22:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:42:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ottomata) Approved!
[22:43:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[22:43:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ottomata) Looks like Danny will not need shell access, just ssh-keyless group membership.
[22:43:36] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] rdf-streaming-updater: add the reconciliation stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753788 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse)
[22:47:19] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:50:27] <wikibugs>	 (03PS3) 10Ebernhardson: Provide a specific user agent when checking servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/743222
[22:54:40] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/758575
[23:03:44] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@f0287fb]: 0.3.101
[23:03:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:41] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet2004-dev.codfw.wmnet with OS bullseye
[23:04:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:06:57] <inflatador>	 !log [WDQS Deploy] Tests passing following deploy of 0.3.101 on canary `wdqs1003`; proceeding to rest of fleet
[23:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:02] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@f0287fb]: 0.3.101 (duration: 08m 18s)
[23:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:14:20] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack Neutron: install iptables on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758578
[23:15:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Neutron: install iptables on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758578 (owner: 10Andrew Bogott)
[23:16:30] <inflatador>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[23:16:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:51] <inflatador>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[23:16:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:04] <inflatador>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[23:17:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:26:35] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@f0287fb] (wcqs): Deploy 0.3.101 to WCQS
[23:26:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:26:41] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7370 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[23:28:14] <inflatador>	 !log [WCQS Deploy] Tests look good following deploy of `0.3.101` to canary `wcqs1002.eqiad.wmnet`, proceeding to rest of fleet
[23:28:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:15] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@f0287fb] (wcqs): Deploy 0.3.101 to WCQS (duration: 02m 39s)
[23:29:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:17] <inflatador>	 !log [WCQS Deploy] Restarted `wcqs-updater` across all hosts: `sudo cumin -b 6 'wcqs*' 'sudo systemctl restart wcqs-updater'`
[23:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:37] <wikibugs>	 (03PS2) 10Ryan Kemper: Add cname for commons-query.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/717606 (https://phabricator.wikimedia.org/T282117) (owner: 10Ebernhardson)
[23:39:12] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/758575 (owner: 10PipelineBot)
[23:42:53] <wikibugs>	 (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/758575 (owner: 10PipelineBot)
[23:43:29] <icinga-wm>	 PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:44:10] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply on staging
[23:44:12] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply on production
[23:44:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:27] <logmsgbot>	 !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: sync on staging
[23:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:48:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10CDunn) Approved
[23:49:12] <logmsgbot>	 !log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply on production
[23:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:15] <logmsgbot>	 !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply on staging
[23:49:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:53] <logmsgbot>	 !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: sync on production
[23:49:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:16] <logmsgbot>	 !log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply on production
[23:50:18] <logmsgbot>	 !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply on staging
[23:50:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:43] <logmsgbot>	 !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: sync on production
[23:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:03] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7256 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[23:56:46] <wikibugs>	 (03PS1) 10Gergő Tisza: Beta: Replace mediawiki11 with mediawiki12 [puppet] - 10https://gerrit.wikimedia.org/r/758584 (https://phabricator.wikimedia.org/T300591)
[23:58:11] <tgr>	 hello! any sre around for a quick beta-only puppet patch review? The whole beta cluster is broken: https://phabricator.wikimedia.org/T300591