[00:23:55] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:26:15] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:42:21] RECOVERY - Check systemd state on apifeatureusage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:25] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:31] PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:29] RECOVERY - Disk space on centrallog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [03:27:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [03:37:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [04:47:53] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (bast6001), Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:50:01] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:51:37] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [05:55:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [05:55:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [05:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:33] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) Thanks everyone! I will get this scheduled for Thursday 3rd Feb at 9:00AM UTC [05:56:43] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) [05:58:01] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [05:59:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113 (s5,s6) T299479', diff saved to https://phabricator.wikimedia.org/P19570 and previous config saved to /var/cache/conftool/dbconfig/20220131-055947-marostegui.json [05:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:52] T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479 [06:00:42] (03PS1) 10Marostegui: db1113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758271 (https://phabricator.wikimedia.org/T299479) [06:02:27] (03CR) 10Marostegui: [C: 03+2] db1113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758271 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [06:03:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:03:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [06:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T298559)', diff saved to https://phabricator.wikimedia.org/P19571 and previous config saved to /var/cache/conftool/dbconfig/20220131-060326-marostegui.json [06:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:31] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [06:04:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1113.eqiad.wmnet with OS bullseye [06:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:51] (03PS1) 10Marostegui: drop_ft_title_ft_namesapce_T297189.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758280 (https://phabricator.wikimedia.org/T297189) [06:11:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove logpager from s4 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P19572 and previous config saved to /var/cache/conftool/dbconfig/20220131-061121-marostegui.json [06:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:27] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [06:12:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298559)', diff saved to https://phabricator.wikimedia.org/P19573 and previous config saved to /var/cache/conftool/dbconfig/20220131-061219-marostegui.json [06:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:23] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [06:18:39] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:20:47] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:27:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P19574 and previous config saved to /var/cache/conftool/dbconfig/20220131-062723-marostegui.json [06:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1113.eqiad.wmnet with OS bullseye [06:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:56] (03PS1) 10Marostegui: switchover-tmpl.sh: Changed notes [software] - 10https://gerrit.wikimedia.org/r/758286 [06:34:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19575 and previous config saved to /var/cache/conftool/dbconfig/20220131-063437-root.json [06:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19576 and previous config saved to /var/cache/conftool/dbconfig/20220131-063448-root.json [06:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:59] RECOVERY - Disk space on prometheus1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [06:42:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P19577 and previous config saved to /var/cache/conftool/dbconfig/20220131-064228-marostegui.json [06:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:53] (03PS1) 10Marostegui: Revert "db1113: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758024 [06:49:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19578 and previous config saved to /var/cache/conftool/dbconfig/20220131-064941-root.json [06:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:49] (03CR) 10Marostegui: [C: 03+2] Revert "db1113: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758024 (owner: 10Marostegui) [06:49:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19579 and previous config saved to /var/cache/conftool/dbconfig/20220131-064952-root.json [06:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:44] (03PS1) 10Marostegui: s5 codfw hosts: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758288 (https://phabricator.wikimedia.org/T300473) [06:52:57] (03CR) 10Marostegui: [C: 03+2] s5 codfw hosts: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758288 (https://phabricator.wikimedia.org/T300473) (owner: 10Marostegui) [06:54:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2137.codfw.wmnet with OS bullseye [06:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2128.codfw.wmnet with OS bullseye [06:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298559)', diff saved to https://phabricator.wikimedia.org/P19580 and previous config saved to /var/cache/conftool/dbconfig/20220131-065733-marostegui.json [06:57:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:57:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:38] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [06:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19581 and previous config saved to /var/cache/conftool/dbconfig/20220131-070444-root.json [07:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19582 and previous config saved to /var/cache/conftool/dbconfig/20220131-070456-root.json [07:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:05:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [07:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:56] PROBLEM - Disk space on elastic2035 is CRITICAL: DISK CRITICAL - free space: / 1087 MB (4% inode=94%): /tmp 1087 MB (4% inode=94%): /var/tmp 1087 MB (4% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops [07:13:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [07:13:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [07:13:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [07:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298559)', diff saved to https://phabricator.wikimedia.org/P19583 and previous config saved to /var/cache/conftool/dbconfig/20220131-071350-marostegui.json [07:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:56] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [07:19:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19584 and previous config saved to /var/cache/conftool/dbconfig/20220131-071948-root.json [07:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19585 and previous config saved to /var/cache/conftool/dbconfig/20220131-071959-root.json [07:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298559)', diff saved to https://phabricator.wikimedia.org/P19586 and previous config saved to /var/cache/conftool/dbconfig/20220131-072249-marostegui.json [07:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:54] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [07:28:08] RECOVERY - Disk space on elastic2035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic2035&var-datasource=codfw+prometheus/ops [07:29:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2137.codfw.wmnet with OS bullseye [07:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2113.codfw.wmnet with OS bullseye [07:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2128.codfw.wmnet with OS bullseye [07:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2111.codfw.wmnet with OS bullseye [07:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19587 and previous config saved to /var/cache/conftool/dbconfig/20220131-073754-marostegui.json [07:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2075.codfw.wmnet with OS bullseye [07:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1010.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [07:50:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1010.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [07:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:07] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [07:52:18] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1010 [07:52:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P19588 and previous config saved to /var/cache/conftool/dbconfig/20220131-075258-marostegui.json [07:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:22] (03CR) 10Ladsgroup: [C: 03+1] drop_ft_title_ft_namesapce_T297189.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758280 (https://phabricator.wikimedia.org/T297189) (owner: 10Marostegui) [08:00:04] (03PS1) 10Marostegui: Revert "s5 codfw hosts: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758315 [08:01:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2113.codfw.wmnet with OS bullseye [08:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:23] (03CR) 10Filippo Giunchedi: thanos::frontend: fix envoy configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) (owner: 10Giuseppe Lavagetto) [08:04:35] (03CR) 10Filippo Giunchedi: "Untested but LGTM" [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [08:04:52] (03CR) 10Marostegui: [V: 03+2 C: 03+2] drop_ft_title_ft_namesapce_T297189.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758280 (https://phabricator.wikimedia.org/T297189) (owner: 10Marostegui) [08:05:44] (03CR) 10Marostegui: [C: 03+2] Revert "s5 codfw hosts: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758315 (owner: 10Marostegui) [08:06:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2111.codfw.wmnet with OS bullseye [08:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:24] (03PS1) 10Marostegui: db2123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758419 (https://phabricator.wikimedia.org/T300473) [08:08:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T298559)', diff saved to https://phabricator.wikimedia.org/P19589 and previous config saved to /var/cache/conftool/dbconfig/20220131-080803-marostegui.json [08:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:08:08] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [08:08:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [08:08:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [08:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [08:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:54] (03CR) 10Marostegui: [C: 03+2] db2123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758419 (https://phabricator.wikimedia.org/T300473) (owner: 10Marostegui) [08:09:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2123.codfw.wmnet with OS bullseye [08:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:14] (03CR) 10Filippo Giunchedi: apifeatureusage: disable gc logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757955 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [08:11:53] (03PS1) 10Marostegui: production.my.cnf: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/758420 (https://phabricator.wikimedia.org/T287244) [08:12:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2075.codfw.wmnet with OS bullseye [08:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:12] (03CR) 10Marostegui: [C: 03+2] production.my.cnf: innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/758420 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [08:13:53] (03CR) 10Filippo Giunchedi: [C: 03+2] site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:13:59] (03PS5) 10Filippo Giunchedi: site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199) [08:14:17] (03PS1) 10Gehel: Revert "cirrussearch: Reenable saneitizer" [puppet] - 10https://gerrit.wikimedia.org/r/758317 [08:14:53] (03CR) 10DCausse: [C: 03+1] Revert "cirrussearch: Reenable saneitizer" [puppet] - 10https://gerrit.wikimedia.org/r/758317 (owner: 10Gehel) [08:16:48] (03CR) 10Gehel: [C: 03+2] Revert "cirrussearch: Reenable saneitizer" [puppet] - 10https://gerrit.wikimedia.org/r/758317 (owner: 10Gehel) [08:19:29] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:21:05] !log Set innodb_adaptive_hash_index=OFF on es2028, es2029, es2026 T268869 [08:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:10] T268869: Consider setting innodb_adaptive_hash_index=OFF by default - https://phabricator.wikimedia.org/T268869 [08:21:47] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [08:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:06] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [08:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:15] !log Set innodb_adaptive_hash_index=OFF on es2020, es2024 T268869 [08:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:27] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [08:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:25:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298559)', diff saved to https://phabricator.wikimedia.org/P19590 and previous config saved to /var/cache/conftool/dbconfig/20220131-082534-marostegui.json [08:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:39] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [08:29:13] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [08:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:44] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [08:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (25) Elasticsearch instance elastic2026-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [08:32:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [08:34:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298559)', diff saved to https://phabricator.wikimedia.org/P19591 and previous config saved to /var/cache/conftool/dbconfig/20220131-083432-marostegui.json [08:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:37] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [08:34:45] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10Ladsgroup) p:05Triage→03Medium [08:36:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (70) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [08:36:53] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:37:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [08:38:13] the thanos rule alert is me [08:38:59] 10SRE, 10DNS, 10Domains, 10Traffic, and 2 others: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Ladsgroup) [08:41:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (70) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [08:41:19] 10SRE, 10Traffic: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10Ladsgroup) p:05Triage→03Medium Feel free to change priority. [08:41:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315 T297189', diff saved to https://phabricator.wikimedia.org/P19592 and previous config saved to /var/cache/conftool/dbconfig/20220131-084157-marostegui.json [08:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:02] T297189: Schema change for dropping ft_title and ft_namesapce - https://phabricator.wikimedia.org/T297189 [08:43:27] 10SRE, 10Wikimedia-Mailing-lists: Mailing lists are not indexed by Google - https://phabricator.wikimedia.org/T299293 (10Ladsgroup) p:05Triage→03Low [08:43:47] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:44:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2123.codfw.wmnet with OS bullseye [08:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:08] marostegui: can you change the clinic duty in the header please? :D [08:45:16] s/header/topic [08:46:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (70) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [08:46:32] (03PS2) 10Giuseppe Lavagetto: [DRAFT] Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [08:46:34] (03PS1) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [08:46:36] (03PS1) 10Giuseppe Lavagetto: fixup refactor [deployment-charts] - 10https://gerrit.wikimedia.org/r/758424 [08:46:54] (03Abandoned) 10Giuseppe Lavagetto: fixup refactor [deployment-charts] - 10https://gerrit.wikimedia.org/r/758424 (owner: 10Giuseppe Lavagetto) [08:47:04] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 (owner: 10Giuseppe Lavagetto) [08:47:30] 10SRE, 10DNS, 10Domains, 10Traffic, and 2 others: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Ladsgroup) p:05Triage→03High Given the time-pressure. [08:49:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19593 and previous config saved to /var/cache/conftool/dbconfig/20220131-084936-root.json [08:49:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P19594 and previous config saved to /var/cache/conftool/dbconfig/20220131-084937-marostegui.json [08:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (69) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [08:51:37] the many cirrus alerts are also me, fixing [08:51:51] thanks! [08:52:37] dcausse: sure np! thanks for bearing with me while I'm figuring out T296199 as I go :) [08:52:38] T296199: Prometheus hardware refresh (+ Bullseye upgrade) - https://phabricator.wikimedia.org/T296199 [08:52:58] np! :) [08:56:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: (34) Elasticsearch instance elastic2025-production-search-omega-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [08:57:24] (03PS1) 10Marostegui: drop_ft_title_ft_namespace_T297189.py: Rename file [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758425 [08:57:35] (03PS1) 10JMeybohm: echoserver: Allow to easily mount external secret as certificate source [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/758426 [08:58:33] (03PS2) 10Marostegui: drop_ft_title_ft_namespace_T297189.py: Rename file [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758425 [08:58:45] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] echoserver: Allow to easily mount external secret as certificate source [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/758426 (owner: 10JMeybohm) [08:58:54] (03CR) 10Marostegui: [V: 03+2 C: 03+2] drop_ft_title_ft_namespace_T297189.py: Rename file [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758425 (owner: 10Marostegui) [08:59:44] (03PS1) 10Marostegui: Revert "db2123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758319 [09:00:27] (03CR) 10Marostegui: [C: 03+2] Revert "db2123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758319 (owner: 10Marostegui) [09:03:08] !log published image docker-registry.discovery.wmnet/echoserver:1.10.0-2 [09:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19595 and previous config saved to /var/cache/conftool/dbconfig/20220131-090439-root.json [09:04:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P19596 and previous config saved to /var/cache/conftool/dbconfig/20220131-090441-marostegui.json [09:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:41] (03PS3) 10Giuseppe Lavagetto: [DRAFT] Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [09:07:43] (03PS2) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [09:11:28] 10SRE, 10SRE-Access-Requests: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10Ladsgroup) Hi @NRodriguez, Can you access production now? So we can close this ticket. Thanks! [09:12:07] PROBLEM - Check for large files in client bucket on mwmaint1002 is CRITICAL: WARNING: large files in client bucket https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [09:18:19] 10SRE, 10Infrastructure-Foundations: Setup a new build host based on bullseye - https://phabricator.wikimedia.org/T298463 (10Ladsgroup) p:05Triage→03Medium Feel free to change the priority. [09:19:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19597 and previous config saved to /var/cache/conftool/dbconfig/20220131-091943-root.json [09:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298559)', diff saved to https://phabricator.wikimedia.org/P19598 and previous config saved to /var/cache/conftool/dbconfig/20220131-091952-marostegui.json [09:19:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:19:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:57] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [09:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298559)', diff saved to https://phabricator.wikimedia.org/P19599 and previous config saved to /var/cache/conftool/dbconfig/20220131-091959-marostegui.json [09:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1007.eqiad.wmnet with OS buster [09:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:40] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1007.eqiad.wmnet with OS buster [09:24:12] (03PS1) 10Marostegui: add_linter_namespace_T300402.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758429 (https://phabricator.wikimedia.org/T300402) [09:25:27] (03PS2) 10Marostegui: add_linter_namespace_T300402.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758429 (https://phabricator.wikimedia.org/T300402) [09:25:51] (03CR) 10Ladsgroup: [C: 03+1] add_linter_namespace_T300402.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758429 (https://phabricator.wikimedia.org/T300402) (owner: 10Marostegui) [09:26:01] that was fast Amir1! [09:26:06] (03CR) 10Marostegui: [V: 03+2 C: 03+2] add_linter_namespace_T300402.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/758429 (https://phabricator.wikimedia.org/T300402) (owner: 10Marostegui) [09:26:27] marostegui: I saw it in the IRC :P [09:26:35] haha [09:29:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298559)', diff saved to https://phabricator.wikimedia.org/P19600 and previous config saved to /var/cache/conftool/dbconfig/20220131-092917-marostegui.json [09:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:22] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [09:34:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19601 and previous config saved to /var/cache/conftool/dbconfig/20220131-093450-root.json [09:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:39] !log restart blazegraph on wdqs1012 (jvm stuck for 6hours) [09:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:54] (03CR) 10Muehlenhoff: "Looks good, two nits inline. It's quite likely that 4.4.2 received additional metrics (https://doc.powerdns.com/recursor/metrics.html), bu" [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [09:43:33] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10MoritzMuehlenhoff) >>! In T299839#7649515, @Volans wrote: > Adding #WMF-NDA-Requests, @mark, @faidon and @MoritzMuehlenhoff for SRE, #security and... [09:44:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P19602 and previous config saved to /var/cache/conftool/dbconfig/20220131-094422-marostegui.json [09:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1007.eqiad.wmnet with OS buster [09:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:25] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1007.eqiad.wmnet with OS buster completed: - ganeti1007 (**PASS**)... [09:45:43] (03CR) 10Kormat: [C: 03+1] switchover-tmpl.sh: Changed notes [software] - 10https://gerrit.wikimedia.org/r/758286 (owner: 10Marostegui) [09:46:13] (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.sh: Changed notes [software] - 10https://gerrit.wikimedia.org/r/758286 (owner: 10Marostegui) [09:46:29] (03PS1) 10Vgutierrez: site: Reimage cp5011 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/758430 (https://phabricator.wikimedia.org/T271421) [09:46:41] (03Merged) 10jenkins-bot: switchover-tmpl.sh: Changed notes [software] - 10https://gerrit.wikimedia.org/r/758286 (owner: 10Marostegui) [09:47:14] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10Majavah) >>! In T299839#7649515, @Volans wrote: > One thing to clarify is how we can ensure that the off-boarding process from this group will be p... [09:48:17] !log cp3061: upgrade varnish to 6.0.10-1wm1 T300264 [09:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:43] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [09:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:33] (03PS5) 10Majavah: Bare minimum port to Python 3 to support Debian Bullseye [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) [09:53:18] (03CR) 10Majavah: Bare minimum port to Python 3 to support Debian Bullseye (033 comments) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [09:53:24] !log depool cp5011 to be reimaged as cache::text_envoy - T271421 [09:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:29] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [09:53:58] !log cp3062: upgrade varnish to 6.0.10-1wm1 T300264 [09:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:24] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5011 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/758430 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:54:42] 10SRE, 10MW-on-K8s, 10serviceops: Evaluate nginx-controller as an Ingress - https://phabricator.wikimedia.org/T286197 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:03Joe We went with istio-ingress after some evaluation which wasn't reported here. [09:54:48] 10SRE, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) [09:54:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I can update the deb on apt.wikimedia.org in the afternoon." [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [09:55:40] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5011.eqsin.wmnet with OS buster [09:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:49] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster [09:59:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P19603 and previous config saved to /var/cache/conftool/dbconfig/20220131-095926-marostegui.json [09:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1015.eqiad.wmnet with OS buster [10:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:49] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1015.eqiad.wmnet with OS buster [10:13:34] (03CR) 10Arturo Borrero Gonzalez: "heads up, in recent PDNS versions, the recursor has its own REST API that includes a /metrics endpoint that generates prometheus metrics, " [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [10:13:53] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10MoritzMuehlenhoff) >>! In T299839#7662797, @Majavah wrote: >>>! In T299839#7649515, @Volans wrote: >> One thing to clarify is how we can ensure tha... [10:14:00] 10SRE, 10Traffic: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 (10MMandere) 05Open→03In progress [10:14:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298559)', diff saved to https://phabricator.wikimedia.org/P19604 and previous config saved to /var/cache/conftool/dbconfig/20220131-101431-marostegui.json [10:14:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:14:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:37] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [10:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298559)', diff saved to https://phabricator.wikimedia.org/P19605 and previous config saved to /var/cache/conftool/dbconfig/20220131-101439-marostegui.json [10:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:45] !log cp[6001-6016].drmrs.wmnet remove unused libvarnishapi1 T300247 [10:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:50] T300247: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 [10:18:20] I'm confused, the dewiki API is reporting an enormous maxlag of ~1.5h, but I don't see any corresponding lag in the grafana board for db replication. [10:20:14] (03PS5) 10Filippo Giunchedi: sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) [10:21:17] huh. "Waiting for 10.64.0.163:3315: 5758.188369 seconds lagged." [10:21:34] !log installing apache/apache-modsecurity2 security updates [10:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:27] awight: orchestrator.w.o also shows db1096 as lagged, “not replicating” [10:23:20] cc marostegui ^ you applied a schema change to that host earlier today [10:23:28] Lucas_WMDE: I see that server was depooled, reimaged, and repooled just a few hours ago so maybe this is expected. [10:23:38] (03CR) 10ZPapierski: [C: 03+1] sre.wdqs.data-reload: few fixes and cleanups [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse) [10:23:39] Lucas_WMDE: checking, thanks [10:23:42] The most surprising detail is that I can't see the issue from grafana, though. [10:24:02] (03CR) 10Arturo Borrero Gonzalez: "I would need to better understand the network flows involved (now and then) before merging this patch. Could you please elaborate more her" [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah) [10:24:20] (03PS1) 10Volans: junos: catch another timeout exception on close [software/homer] - 10https://gerrit.wikimedia.org/r/758438 [10:24:25] Lucas_WMDE: that one didn't have the schema change applied, but for some reason it wasn't replicating [10:24:28] I have started it now [10:24:32] GOing to depool it [10:24:37] ok thanks [10:24:50] Lucas_WMDE: thanks a lot for the heads uop [10:24:55] np [10:24:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315', diff saved to https://phabricator.wikimedia.org/P19606 and previous config saved to /var/cache/conftool/dbconfig/20220131-102457-marostegui.json [10:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:20] https://de.wikipedia.org/w/api.php?action=query&maxlag=-1 looks better now [10:25:41] Lucas_WMDE: will repool it back once it has catch up [10:25:45] That was fast! [10:26:21] it is now back, so repooling slowly again! [10:26:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: repooling', diff saved to https://phabricator.wikimedia.org/P19607 and previous config saved to /var/cache/conftool/dbconfig/20220131-102636-root.json [10:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:52] wow, that was fast indeed [10:27:13] 10SRE, 10User-Ladsgroup: Adding aquhen@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T298778 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I added this but for the future please do it yourself. cc. @odimitrijevic [10:27:15] Glad to see people using orchestrator :) [10:27:32] I have it bookmarked as “aka new dbtree” so I remember the name ;) [10:27:36] 10SRE, 10User-Ladsgroup: Add user nmaphophe@wikimedia.org to the analytics-alerts mail alias - https://phabricator.wikimedia.org/T298770 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup T298778#7662871 [10:27:48] !log cp[5006,5012].eqsin.wmnet remove unused libvarnishapi1 T300247 [10:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:53] T300247: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 [10:28:14] (03CR) 10Ayounsi: [C: 03+1] junos: catch another timeout exception on close [software/homer] - 10https://gerrit.wikimedia.org/r/758438 (owner: 10Volans) [10:29:46] (03CR) 10Volans: [C: 03+2] junos: catch another timeout exception on close [software/homer] - 10https://gerrit.wikimedia.org/r/758438 (owner: 10Volans) [10:31:07] !log cp[4021,4025-4026,4032-4034,4036].ulsfo.wmnet remove unused libvarnishapi1 T300247 [10:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:34] (03Merged) 10jenkins-bot: junos: catch another timeout exception on close [software/homer] - 10https://gerrit.wikimedia.org/r/758438 (owner: 10Volans) [10:33:26] !log cp[3052,3064-3065].esams.wmnet remove unused libvarnishapi1 T300247 [10:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:31] T300247: Remove old and unused libvarnishapi - https://phabricator.wikimedia.org/T300247 [10:33:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298559)', diff saved to https://phabricator.wikimedia.org/P19608 and previous config saved to /var/cache/conftool/dbconfig/20220131-103350-marostegui.json [10:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:55] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [10:36:04] !log cp[2041-2042] remove unused libvarnishapi1 T300247 [10:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:46] Lucas_WMDE: would be nice if it included the decoder ring to map wiki -> partition [10:37:54] !log cp[1087,1089-1090] remove unused libvarnishapi1 T300247 [10:37:55] I hope that I don't have the permissions to cause any actual change by dragging and dropping replicas [10:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: repooling', diff saved to https://phabricator.wikimedia.org/P19609 and previous config saved to /var/cache/conftool/dbconfig/20220131-104140-root.json [10:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1015.eqiad.wmnet with OS buster [10:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:08] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1015.eqiad.wmnet with OS buster completed: - ganeti1015 (**PASS**)... [10:43:04] 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) p:05Triage→03Medium [10:48:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19610 and previous config saved to /var/cache/conftool/dbconfig/20220131-104855-marostegui.json [10:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: repooling', diff saved to https://phabricator.wikimedia.org/P19611 and previous config saved to /var/cache/conftool/dbconfig/20220131-105643-root.json [10:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:09] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5011.eqsin.wmnet with OS buster [10:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:17] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5011.eqsin.wmnet with OS buster completed: - cp5011 (**WARN*... [10:58:13] !log pool cp5011 running envoy as TLS terminator - T271421 [10:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:17] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [10:59:21] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [11:04:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P19612 and previous config saved to /var/cache/conftool/dbconfig/20220131-110400-marostegui.json [11:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:26] (03CR) 10Majavah: openstack encapi: Drop special treatment for puppetmasters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah) [11:07:54] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/M [11:07:54] g/restbase [11:08:14] (03PS6) 10Filippo Giunchedi: sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) [11:08:23] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 15 days, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing [11:08:24] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15 days, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing [11:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:34] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:11:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: repooling', diff saved to https://phabricator.wikimedia.org/P19613 and previous config saved to /var/cache/conftool/dbconfig/20220131-111147-root.json [11:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:02] I'm deploying a beta only config change [11:12:32] (03CR) 10Majavah: [C: 03+2] beta: READ_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758038 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [11:13:09] (03Merged) 10jenkins-bot: beta: READ_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758038 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [11:15:49] (03CR) 10Arturo Borrero Gonzalez: openstack encapi: Drop special treatment for puppetmasters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah) [11:16:32] (03CR) 10Majavah: openstack encapi: Drop special treatment for puppetmasters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah) [11:19:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298559)', diff saved to https://phabricator.wikimedia.org/P19614 and previous config saved to /var/cache/conftool/dbconfig/20220131-111904-marostegui.json [11:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:10] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [11:19:35] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1024.eqiad.wmnet [11:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:05] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1025.eqiad.wmnet with OS buster [11:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:21:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:49] (03PS1) 10Majavah: prod: READ_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758443 (https://phabricator.wikimedia.org/T289068) [11:23:07] (03CR) 10Majavah: "this worked fine in beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758443 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [11:23:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack encapi: Drop special treatment for puppetmasters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758091 (owner: 10Majavah) [11:32:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:32:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:19] (03PS1) 10Majavah: P:openstack::puppetmaster: fix passing non-existent variables [puppet] - 10https://gerrit.wikimedia.org/r/758444 [11:39:02] (03PS1) 104nn1l2: azwikiquote: Add autopatrolled user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758445 (https://phabricator.wikimedia.org/T300435) [11:39:18] (03CR) 10jerkins-bot: [V: 04-1] P:openstack::puppetmaster: fix passing non-existent variables [puppet] - 10https://gerrit.wikimedia.org/r/758444 (owner: 10Majavah) [11:39:53] (03PS2) 10Majavah: P:openstack::puppetmaster: fix passing non-existent variables [puppet] - 10https://gerrit.wikimedia.org/r/758444 [11:42:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::puppetmaster: fix passing non-existent variables [puppet] - 10https://gerrit.wikimedia.org/r/758444 (owner: 10Majavah) [11:44:20] (03PS1) 10Ladsgroup: Add unseen.wikimedia.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/758446 (https://phabricator.wikimedia.org/T300398) [11:48:30] (03PS1) 10Ladsgroup: Add unseen.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/758447 (https://phabricator.wikimedia.org/T300398) [11:49:33] (03CR) 10Ayounsi: "1 comment then LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) (owner: 10Jbond) [11:50:41] (03PS1) 10Ladsgroup: redirects: Fix url shortener documentation [puppet] - 10https://gerrit.wikimedia.org/r/758448 [11:52:18] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:57:46] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] redirects: Fix url shortener documentation [puppet] - 10https://gerrit.wikimedia.org/r/758448 (owner: 10Ladsgroup) [11:58:20] 10ops-ulsfo, 10Traffic: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10Vgutierrez) [11:58:23] (03PS1) 104nn1l2: commonswiki: Add four domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758449 (https://phabricator.wikimedia.org/T300375) [11:58:57] (03PS4) 10Ladsgroup: Beta: maintenance: skip mediawiki::state function [puppet] - 10https://gerrit.wikimedia.org/r/462019 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani) [11:59:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:59:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:59:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [12:00:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [12:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T1200). [12:00:06] nn1l2: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298559)', diff saved to https://phabricator.wikimedia.org/P19615 and previous config saved to /var/cache/conftool/dbconfig/20220131-120007-marostegui.json [12:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:10] hi [12:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:13] o/ [12:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:17] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [12:00:36] I can deploy today [12:01:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298559)', diff saved to https://phabricator.wikimedia.org/P19616 and previous config saved to /var/cache/conftool/dbconfig/20220131-120113-marostegui.json [12:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:37] (03CR) 10Vgutierrez: [C: 03+1] Add unseen.wikimedia.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/758446 (https://phabricator.wikimedia.org/T300398) (owner: 10Ladsgroup) [12:01:58] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1025.eqiad.wmnet with OS buster [12:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:15] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1025.eqiad.wmnet [12:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:32] (03CR) 10Vgutierrez: [C: 03+1] "redirects under canonical domains are still handled by mediawiki, so that's why it's the place to put it :)" [puppet] - 10https://gerrit.wikimedia.org/r/758447 (https://phabricator.wikimedia.org/T300398) (owner: 10Ladsgroup) [12:02:38] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1026.eqiad.wmnet with OS buster [12:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] azwikiquote: Add autopatrolled user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758445 (https://phabricator.wikimedia.org/T300435) (owner: 104nn1l2) [12:03:42] (03Merged) 10jenkins-bot: azwikiquote: Add autopatrolled user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758445 (https://phabricator.wikimedia.org/T300435) (owner: 104nn1l2) [12:04:46] (03CR) 10Ladsgroup: "Confirming it's noop in production https://puppet-compiler.wmflabs.org/pcc-worker1002/33504/" [puppet] - 10https://gerrit.wikimedia.org/r/462019 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani) [12:04:50] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Beta: maintenance: skip mediawiki::state function [puppet] - 10https://gerrit.wikimedia.org/r/462019 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani) [12:05:04] nn1l2: the azwikiquote change is on mwdebug1001, please test it [12:05:08] (03PS4) 10Ladsgroup: Beta: maintenance: no openldap management [puppet] - 10https://gerrit.wikimedia.org/r/462020 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani) [12:05:10] ok [12:05:28] Amir1: you're going to love the various other hacks currently only applied to deployment-puppetmaster04 /var/lib/git/operations/puppet [12:05:32] (03PS5) 10Arturo Borrero Gonzalez: toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697 [12:06:06] taavi: I know :( I hope this removes some of the hacks, let me know if I can clean up more [12:06:47] LGTM [12:06:52] ok [12:07:27] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "Noop in production https://puppet-compiler.wmflabs.org/pcc-worker1003/33505/" [puppet] - 10https://gerrit.wikimedia.org/r/462020 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani) [12:07:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:07:30] Amir1: this one is currently my favourite I think https://phabricator.wikimedia.org/P19617 [12:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:19] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:758445|azwikiquote: Add autopatrolled user group (T300435)]] (duration: 00m 50s) [12:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:23] T300435: Add autopatrolled user group to az.wikiquote - https://phabricator.wikimedia.org/T300435 [12:08:43] taavi: 😭 [12:08:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:09:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [12:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:09:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:09:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:09:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19618 and previous config saved to /var/cache/conftool/dbconfig/20220131-120952-marostegui.json [12:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:57] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [12:10:02] (03PS2) 10Lucas Werkmeister (WMDE): commonswiki: Add four domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758449 (https://phabricator.wikimedia.org/T300375) (owner: 104nn1l2) [12:11:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] commonswiki: Add four domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758449 (https://phabricator.wikimedia.org/T300375) (owner: 104nn1l2) [12:11:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:11:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:28] (03Merged) 10jenkins-bot: commonswiki: Add four domains to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758449 (https://phabricator.wikimedia.org/T300375) (owner: 104nn1l2) [12:12:58] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Miriam) >>! In T299919#7660039, @jhathaway wrote: > @Miriam I assume you mean 2022-06-30 😉, though with covid still with us, who knows what year it is!... [12:13:10] nn1l2: the four new domains are also on mwdebug1001 now, please test [12:13:26] give me some time pelase [12:13:33] sure [12:15:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:07] All successful :) 1) https://commons.wikimedia.org/wiki/File:Cbdg_44379-r.jpg 2) https://commons.wikimedia.org/wiki/File:Voornesduin2.jpg 3) https://commons.wikimedia.org/wiki/File:3d176240-8407-4d1b-992d-ad5f00fc2bcb.jpg 4) https://commons.wikimedia.org/wiki/File:D37D8C6587C6F637EBE60B6A151D19A51EBDA6F8E24BC4C8FF9F59E1DF2B661E.jpg [12:16:13] Good to go [12:16:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P19619 and previous config saved to /var/cache/conftool/dbconfig/20220131-121618-marostegui.json [12:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:56] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) [12:17:16] Lucas_WMDE: GTG [12:17:20] ok [12:18:34] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:758449|commonswiki: Add four domains to the wgCopyUploadsDomains allowlist (T300375, T300360, T300359, T300357)]] (duration: 00m 50s) [12:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:42] T300357: Add www.nmr-pics.nl to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300357 [12:18:42] T300360: Add arter.dk to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300360 [12:18:43] T300375: Add researcharchive.calacademy.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300375 [12:18:43] T300359: Add files.plutof.ut.ee to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300359 [12:20:01] (03PS3) 10Ayounsi: Move sandbox filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748080 (https://phabricator.wikimedia.org/T273865) [12:20:03] (03PS3) 10Ayounsi: Move core routers loopback filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) [12:20:05] (03PS3) 10Ayounsi: Move core routers border-in filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) [12:20:41] !log UTC morning backport window done [12:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:46] hi taavi, do you have a min for a quick consultation? [12:22:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:22:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:29] (03PS14) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [12:25:41] nn1l2: very quick, I need to leave in like 5 minutes [12:25:53] thanks [12:25:56] see https://phabricator.wikimedia.org/rOMWC6dcc2c6d8db872b931e0eac4fe4e2569fc4e11d0 [12:26:00] (03PS6) 10Arturo Borrero Gonzalez: toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697 [12:26:05] (03CR) 10jerkins-bot: [V: 04-1] etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [12:26:16] patrolmarks is already implied by patrol [12:26:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697 (owner: 10Arturo Borrero Gonzalez) [12:26:48] yes? [12:27:05] is it worth if I clean up the InitialiseSettings.php file and remove redundant permissions? [12:28:03] what do other similar groups do? [12:28:16] (03PS15) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [12:28:26] I don't understand you [12:28:43] anybody who has patrol flag does no need patrolmarks [12:29:00] anybody who has patrol flag does not need patrolmarks [12:29:14] yeah, I guess it can be cleaned up [12:29:35] I was wondering if other 'patroller' groups on other wikis also grant 'patrolmarks', but I guess that is a no [12:29:46] Thanks, I will upload a patch for the next window :) [12:31:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P19620 and previous config saved to /var/cache/conftool/dbconfig/20220131-123123-marostegui.json [12:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:36] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1026.eqiad.wmnet with OS buster [12:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298559)', diff saved to https://phabricator.wikimedia.org/P19621 and previous config saved to /var/cache/conftool/dbconfig/20220131-124627-marostegui.json [12:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:33] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [12:46:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [12:46:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [12:46:36] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1026.eqiad.wmnet [12:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [12:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [12:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [12:46:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [12:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298559)', diff saved to https://phabricator.wikimedia.org/P19622 and previous config saved to /var/cache/conftool/dbconfig/20220131-124655-marostegui.json [12:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:03] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1027.eqiad.wmnet with OS buster [12:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298559)', diff saved to https://phabricator.wikimedia.org/P19623 and previous config saved to /var/cache/conftool/dbconfig/20220131-124801-marostegui.json [12:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:10] (03CR) 10Michael Große: [C: 03+1] "checked that it is based on the most recent commit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757659 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE)) [13:01:08] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add unseen.wikimedia.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/758446 (https://phabricator.wikimedia.org/T300398) (owner: 10Ladsgroup) [13:02:45] (03PS1) 10Marostegui: db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758465 (https://phabricator.wikimedia.org/T300473) [13:03:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P19624 and previous config saved to /var/cache/conftool/dbconfig/20220131-130306-marostegui.json [13:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:27] (03CR) 10Marostegui: [C: 03+2] db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758465 (https://phabricator.wikimedia.org/T300473) (owner: 10Marostegui) [13:06:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1154.eqiad.wmnet with OS bullseye [13:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:25] (03PS2) 10Ladsgroup: Add unseen.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/758447 (https://phabricator.wikimedia.org/T300398) [13:06:41] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add unseen.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/758447 (https://phabricator.wikimedia.org/T300398) (owner: 10Ladsgroup) [13:08:07] 10SRE, 10DNS, 10Domains, 10Traffic, and 4 others: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup It'll take a bit but it will be there. Ping me if it doesn't work. [13:10:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19625 and previous config saved to /var/cache/conftool/dbconfig/20220131-131011-marostegui.json [13:10:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1007.eqiad.wmnet [13:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:16] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [13:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:20] 10SRE, 10Beta-Cluster-Infrastructure, 10Wikidata, 10serviceops, and 2 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10Ladsgroup) [13:11:03] PROBLEM - MariaDB Replica IO: s8 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3318 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:11:53] marostegui: are you aware of ^ [13:14:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1007.eqiad.wmnet [13:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1007.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [13:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:10] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [13:16:41] PROBLEM - MariaDB Replica IO: s5 on clouddb1020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:16:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1007.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [13:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P19626 and previous config saved to /var/cache/conftool/dbconfig/20220131-131811-marostegui.json [13:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:15] PROBLEM - MariaDB Replica IO: s5 on clouddb1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1154.eqiad.wmnet:3315 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1154.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:19:21] ^ [13:19:27] me, silencing [13:20:53] 10SRE, 10Python3-Porting: git-fat needs to be ported to Python 3 - https://phabricator.wikimedia.org/T279509 (10Ladsgroup) >>! In T279509#6979904, @MoritzMuehlenhoff wrote: > git-fat is the only package requiring Python 2 in a base bullseye setup at this point. Is there a way to migrate to git-lfs instead? [13:22:13] (03PS1) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) [13:22:34] (03PS2) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) [13:23:45] (03PS7) 10Filippo Giunchedi: sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) [13:25:07] (03CR) 10Filippo Giunchedi: "Tested validate_cmd in Pontoon and works as expected" [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [13:25:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P19627 and previous config saved to /var/cache/conftool/dbconfig/20220131-132516-marostegui.json [13:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:42] (03PS3) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) [13:31:15] (03PS4) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) [13:31:48] RECOVERY - MariaDB Replica IO: s5 on clouddb1020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:33:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298559)', diff saved to https://phabricator.wikimedia.org/P19628 and previous config saved to /var/cache/conftool/dbconfig/20220131-133316-marostegui.json [13:33:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [13:33:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [13:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:21] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [13:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298559)', diff saved to https://phabricator.wikimedia.org/P19629 and previous config saved to /var/cache/conftool/dbconfig/20220131-133323-marostegui.json [13:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298559)', diff saved to https://phabricator.wikimedia.org/P19630 and previous config saved to /var/cache/conftool/dbconfig/20220131-133430-marostegui.json [13:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:23] (03PS3) 10Ssingh: site: add role for durum hosts in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/757741 (https://phabricator.wikimedia.org/T300158) [13:36:26] (03PS5) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) [13:36:40] RECOVERY - MariaDB Replica IO: s5 on clouddb1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:36:50] 10SRE, 10Icinga, 10Observability-Alerting, 10Scap, 10observability: expose hosts in maintenance state so we can prevent scap from running on them - https://phabricator.wikimedia.org/T100777 (10lmata) [13:37:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1154.eqiad.wmnet with OS bullseye [13:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33509/console" [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [13:37:26] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, and 2 others: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060 (10lmata) [13:37:40] 10SRE, 10Traffic: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10Ladsgroup) [13:37:52] (03CR) 10MMandere: [C: 03+2] site: add role for durum hosts in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/757741 (https://phabricator.wikimedia.org/T300158) (owner: 10Ssingh) [13:37:54] 10SRE, 10Icinga, 10Observability-Alerting, 10observability: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10lmata) [13:38:17] 10SRE, 10Traffic, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10Ladsgroup) [13:39:35] 10SRE, 10Traffic, 10Patch-For-Review: Create Ganeti VMs for durum in drmrs - https://phabricator.wikimedia.org/T300158 (10ssingh) [13:40:06] 10SRE, 10Observability-Metrics, 10Traffic-Icebox: varnishmtail silently stops working if varnishncsa crashes - https://phabricator.wikimedia.org/T259020 (10lmata) [13:40:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P19631 and previous config saved to /var/cache/conftool/dbconfig/20220131-134021-marostegui.json [13:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:40] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, and 2 others: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10lmata) [13:41:08] 10SRE, 10Icinga, 10Observability-Alerting, 10User-CDanis: CLI script for manual paging - https://phabricator.wikimedia.org/T82937 (10lmata) [13:41:08] RECOVERY - MariaDB Replica IO: s8 on clouddb1020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:41:35] 10SRE, 10Observability-Metrics: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10lmata) [13:43:17] 10SRE, 10Observability-Metrics, 10serviceops, 10Patch-For-Review: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10lmata) [13:45:00] 10SRE, 10Observability-Alerting, 10Documentation, 10Service-Architecture: Create a doc explaining the SLA between services and the monitoring tool - https://phabricator.wikimedia.org/T105780 (10lmata) [13:45:18] 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Aggregate Proton, Restbase and mobileapps icinga alerts - https://phabricator.wikimedia.org/T250017 (10lmata) [13:45:33] (03CR) 10Giuseppe Lavagetto: "LGTM overall, see two comments mostly about style." [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [13:45:49] 10SRE, 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, and 2 others: ops-monitoring-bot creating dupes - https://phabricator.wikimedia.org/T226908 (10lmata) [13:45:55] 10SRE, 10Icinga, 10Observability-Alerting, 10observability: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 (10lmata) [13:46:07] 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Icinga notifications didn't get applied after a puppet run - https://phabricator.wikimedia.org/T251407 (10lmata) [13:46:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [13:47:29] 10SRE, 10Observability-Metrics, 10Traffic-Icebox, 10User-ema: Multiple ATS HTTP2 stats missing from Prometheus - https://phabricator.wikimedia.org/T292817 (10lmata) [13:47:51] 10SRE, 10Observability-Alerting: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996 (10lmata) [13:47:54] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:48:05] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, and 2 others: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10lmata) [13:48:08] (03CR) 10Ayounsi: [C: 03+2] Move sandbox filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748080 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [13:48:42] (03Merged) 10jenkins-bot: Move sandbox filter to Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/748080 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [13:49:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P19632 and previous config saved to /var/cache/conftool/dbconfig/20220131-134934-marostegui.json [13:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:40] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting: Icinga alert for hosts with no Puppet roles - https://phabricator.wikimedia.org/T238006 (10lmata) [13:50:11] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, 10User-jbond: Monitoring for puppetdb queue size - https://phabricator.wikimedia.org/T236707 (10lmata) [13:50:32] 10SRE, 10Icinga, 10Infrastructure-Foundations, 10Mail, and 2 others: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890 (10lmata) [13:50:41] 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Icinga monitoring for Yubikey components - https://phabricator.wikimedia.org/T151048 (10lmata) [13:51:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [13:52:19] !log Move sandbox filter to Capirca on all core routers [13:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19633 and previous config saved to /var/cache/conftool/dbconfig/20220131-135525-marostegui.json [13:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:30] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [13:55:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [13:55:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [13:55:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [13:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [13:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [13:56:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [13:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298558)', diff saved to https://phabricator.wikimedia.org/P19634 and previous config saved to /var/cache/conftool/dbconfig/20220131-135610-marostegui.json [13:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298558)', diff saved to https://phabricator.wikimedia.org/P19635 and previous config saved to /var/cache/conftool/dbconfig/20220131-140127-marostegui.json [14:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:34] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [14:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P19636 and previous config saved to /var/cache/conftool/dbconfig/20220131-140439-marostegui.json [14:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:44] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:07:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1015.eqiad.wmnet [14:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:38] 10SRE, 10DBA, 10Epic, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Decide how to improve parsercache replication, sharding and HA - https://phabricator.wikimedia.org/T133523 (10LSobanski) a:05Marostegui→03None Removing assignment as I don't believe Manuel will be looking into th... [14:09:42] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [14:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:04] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1262184 and 2310 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:10:08] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host prometheus2006.codfw.wmnet [14:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:11] (03CR) 10BBlack: "LGTM on all the interrelated changes to the socket path / install_from_component stuff. Inline question about the last bit for the owner " [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [14:10:46] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [14:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:08] (03PS1) 10Ayounsi: Delete now unused analytics policy file [homer/public] - 10https://gerrit.wikimedia.org/r/758470 [14:13:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1015.eqiad.wmnet [14:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:10] (03CR) 10Ottomata: "Luca, what if someone wants to spin up a new Kafka cluster in Cloud with TLS that does not use the certs John is going to create? Is ther" [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:14:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1015.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [14:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:12] (03PS2) 10Lucas Werkmeister (WMDE): Update termbox to 2022-01-25-175409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757659 (https://phabricator.wikimedia.org/T296202) [14:15:28] jouncebot: nowandnext [14:15:28] No deployments scheduled for the next 2 hour(s) and 14 minute(s) [14:15:28] In 2 hour(s) and 14 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T1630) [14:15:39] alright, I’ll probably deploy ^ that termbox update in deployment-charts [14:15:41] 10SRE, 10Icinga, 10Observability-Alerting, 10observability: Icinga monitoring for Yubikey components - https://phabricator.wikimedia.org/T151048 (10MoritzMuehlenhoff) 05Open→03Declined This is no longer needed, we longer use the YubiHSM stack, closing [14:15:43] 10SRE: Extending Yubico 2FA for production use (meta bug) - https://phabricator.wikimedia.org/T151045 (10MoritzMuehlenhoff) [14:16:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Update termbox to 2022-01-25-175409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757659 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE)) [14:16:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P19637 and previous config saved to /var/cache/conftool/dbconfig/20220131-141633-marostegui.json [14:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:37] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [14:17:01] (03CR) 10Majavah: [V: 03+1] pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [14:17:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1015.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [14:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:18] !log draining ganeti1008 for eventual reimage [14:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:45] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [14:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:23] !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host durum6001.drmrs.wmnet [14:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [14:19:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298559)', diff saved to https://phabricator.wikimedia.org/P19638 and previous config saved to /var/cache/conftool/dbconfig/20220131-141943-marostegui.json [14:19:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:19:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:49] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [14:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19639 and previous config saved to /var/cache/conftool/dbconfig/20220131-141951-marostegui.json [14:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:57] (03Merged) 10jenkins-bot: Update termbox to 2022-01-25-175409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757659 (https://phabricator.wikimedia.org/T296202) (owner: 10Lucas Werkmeister (WMDE)) [14:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:32] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply on staging [14:20:32] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply on test [14:20:35] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply on production [14:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:53] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1027.eqiad.wmnet with OS buster [14:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19640 and previous config saved to /var/cache/conftool/dbconfig/20220131-142057-marostegui.json [14:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:12] hm, the chart label changed from termbox-0.0.20 to termbox-0.1.1 [14:21:17] I assume that’s fine to apply [14:22:51] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: sync on test [14:22:51] !log lucaswerkmeister-wmde@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: sync on staging [14:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:19] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1027.eqiad.wmnet [14:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:44] (03PS1) 10Ladsgroup: db2107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758472 (https://phabricator.wikimedia.org/T300510) [14:24:05] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db2107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758472 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [14:24:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [14:24:38] seems to work fine on test.wikidata.org (staging cluster), proceeding with sync to codfw and eqiad [14:24:44] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply on production [14:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:48] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply on staging [14:24:49] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply on test [14:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [14:25:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [14:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2107 (T300510)', diff saved to https://phabricator.wikimedia.org/P19641 and previous config saved to /var/cache/conftool/dbconfig/20220131-142550-ladsgroup.json [14:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:55] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [14:27:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2107.codfw.wmnet with OS bullseye [14:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:00] 10SRE, 10Traffic: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10CDanis) After discussing with @BBlack and @Vgutierrez it seems that this isn't a good use case for ncredir as ncredir only supports dns-01 challenges. So we need to find some other e... [14:28:14] !log filippo@deploy1002 Started deploy [librenms/librenms@f049593]: Add custom patches to librenms 21.4.0 [14:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:24] !log filippo@deploy1002 Finished deploy [librenms/librenms@f049593]: Add custom patches to librenms 21.4.0 (duration: 00m 10s) [14:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:03] hm, my helmfile apply has been running for a few minutes now… I hope everything’s alright there [14:29:20] I’ll wait a bit longer though [14:31:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P19642 and previous config saved to /var/cache/conftool/dbconfig/20220131-143138-marostegui.json [14:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:57] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: sync on production [14:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:10] well, it timed out [14:35:10] (03PS3) 10Majavah: pdns: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) [14:35:14] after, I think, ten minutes [14:36:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P19643 and previous config saved to /var/cache/conftool/dbconfig/20220131-143602-marostegui.json [14:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:21] Huh, wikimedia.org went down for me for a hot second [14:36:33] It's back up now though; might be a DNS thing on my end [14:39:13] if anyone with [[wikitech:Kubernetes/Deployments]] expertise is around, I’d appreciate some help [14:39:30] it’s probably nothing serious but I’m not very confident on my own ^^ [14:39:49] Lucas_WMDE: what do you need? [14:40:00] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33510/console" [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [14:40:02] I ran [14:40:05] lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox (master $ u=) $ helmfile -e codfw -i apply [14:40:15] and the production release failed with a timeout [14:40:29] I’m guessing I should retry and hope for the best, but I’m not sure ^^ [14:40:38] as far as I can tell there’s no other output indicating what went wrong [14:40:44] Error: UPGRADE FAILED: release production failed, and has been rolled back due to atomic being set: timed out waiting for the condition [14:41:09] (if I understand correctly, there are three releases(?) in the codfw(?) cluster, and only the production one failed, and the other two – staging and test? – went through) [14:41:28] *in the codfw cluster(?), to put the question mark on the right word that I’m uncertain about ^^ [14:42:02] `kube_env termbox codfw; kubectl get pod` shows one new pod and 3 old ones [14:42:22] (03PS1) 10Filippo Giunchedi: o11y: tweak IcingaOverload alert [alerts] - 10https://gerrit.wikimedia.org/r/758474 [14:43:33] hm, the new pod still has the old image AFAICT [14:43:39] from 2021 instead of 2022 [14:43:59] all four of them have the same image [14:46:06] kubectl get events has two errors about failing to pull the image o_O [14:46:17] (in the same kube_env) [14:46:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298558)', diff saved to https://phabricator.wikimedia.org/P19644 and previous config saved to /var/cache/conftool/dbconfig/20220131-144642-marostegui.json [14:46:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:46:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:48] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [14:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298558)', diff saved to https://phabricator.wikimedia.org/P19645 and previous config saved to /var/cache/conftool/dbconfig/20220131-144650-marostegui.json [14:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:59] I'm not sure what exactly happened, and apparently the kubernetes user does not have enough permissions for any manual helm operations (even those listed on the wikitech page) [14:47:50] I think I’ll try the command again [14:48:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298558)', diff saved to https://phabricator.wikimedia.org/P19646 and previous config saved to /var/cache/conftool/dbconfig/20220131-144806-marostegui.json [14:48:09] the new image is definitely working in the staging release (powering test.wikidata.org), I can see the differences in the SSR HTML [14:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:19] what if you try again and see what happens? [14:48:22] in the staging *cluster (I think) [14:48:28] yeah, let’s do that [14:48:35] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] START helmfile.d/services/termbox: apply on production [14:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:39] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply on test [14:48:40] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: apply on staging [14:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:07] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10jhathaway) [14:50:21] looks like it’s waiting again [14:50:44] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] DONE helmfile.d/services/termbox: sync on production [14:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:51] yay! [14:51:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P19647 and previous config saved to /var/cache/conftool/dbconfig/20220131-145107-marostegui.json [14:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:20] now there are three new running pods (and a fourth one ContainerCreating) [14:51:28] ok let’s go for eqiad [14:51:33] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] START helmfile.d/services/termbox: apply on production [14:51:36] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply on staging [14:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:37] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply on test [14:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:10] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] DONE helmfile.d/services/termbox: sync on production [14:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:32] yup, new SSR is running on www.wikidata.org [14:53:37] thanks taavi! [14:53:46] (well, on m.wikidata.org ^^) [14:55:24] (03PS6) 10JMeybohm: Add hostname-override and cluster-cidr to kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) [14:56:27] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33511/console" [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [14:58:05] !log update scap to 4.2.2 on A:mw-canary or A:parsoid-canary or A:mw-jobrunner-canary - T300392 [14:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:10] T300392: Deploy Scap version 4.2.2 - https://phabricator.wikimedia.org/T300392 [14:58:13] (03CR) 10JMeybohm: [V: 03+1] Add hostname-override and cluster-cidr to kube-proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/758466 (https://phabricator.wikimedia.org/T300500) (owner: 10JMeybohm) [14:58:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2107.codfw.wmnet with OS bullseye [14:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P19648 and previous config saved to /var/cache/conftool/dbconfig/20220131-150311-marostegui.json [15:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:10] !log update scap to 4.2.2 on A:restbase-canary - T300392 [15:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:15] T300392: Deploy Scap version 4.2.2 - https://phabricator.wikimedia.org/T300392 [15:06:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19649 and previous config saved to /var/cache/conftool/dbconfig/20220131-150611-marostegui.json [15:06:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:06:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:16] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [15:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298559)', diff saved to https://phabricator.wikimedia.org/P19650 and previous config saved to /var/cache/conftool/dbconfig/20220131-150619-marostegui.json [15:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298559)', diff saved to https://phabricator.wikimedia.org/P19651 and previous config saved to /var/cache/conftool/dbconfig/20220131-150725-marostegui.json [15:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:13] (03CR) 10Vgutierrez: [C: 03+1] sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [15:18:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P19652 and previous config saved to /var/cache/conftool/dbconfig/20220131-151816-marostegui.json [15:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P19653 and previous config saved to /var/cache/conftool/dbconfig/20220131-152230-marostegui.json [15:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:05] !log hnowlan@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) [15:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:19] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 00m 13s) [15:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:25] 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 4 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Lucas_Werkmeister_WMDE) That sounds like a very cumbersome hack to me, and I also think it’s too early t... [15:33:16] !log jelto@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) [15:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298558)', diff saved to https://phabricator.wikimedia.org/P19654 and previous config saved to /var/cache/conftool/dbconfig/20220131-153320-marostegui.json [15:33:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:33:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:26] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [15:33:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298558)', diff saved to https://phabricator.wikimedia.org/P19655 and previous config saved to /var/cache/conftool/dbconfig/20220131-153328-marostegui.json [15:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298558)', diff saved to https://phabricator.wikimedia.org/P19656 and previous config saved to /var/cache/conftool/dbconfig/20220131-153446-marostegui.json [15:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P19657 and previous config saved to /var/cache/conftool/dbconfig/20220131-153734-marostegui.json [15:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:51] !log jelto@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 04m 34s) [15:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:17] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Miriam) @jhathaway could you double check that @AniketArs has LDAP access? They are not able to access the notebooks. He is able to access the stat ma... [15:45:49] Lucas_WMDE: sorry, did spot your messages here. Reading the backlog it seems that at least one node had/has issues pulling the image (wikibase-termbox:2022-01-25-175409-production) [15:46:08] (03CR) 10Herron: [C: 03+1] "LGTM! I'm on the fence about page defaulting to true, but let's try it" [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:46:33] jayme: it seemed to work fine on the second attempt, do you think any further action is necessary? [15:48:06] Lucas_WMDE: there is still one pod failing (at least in codfw). It's events (kubectl describe po termbox-production-6f5b9d8cf-hclqs) show a "contect canceled" error pulling the image [15:48:17] oh [15:48:22] that usually means that docker was unable to pull the image in 2m [15:48:37] pull & extract that is [15:48:43] <_joe_> which is strange indeed [15:49:04] it's an HDD node...so maybe termbox image grew? [15:49:20] possibly, though not by very much I would’ve thought [15:49:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P19658 and previous config saved to /var/cache/conftool/dbconfig/20220131-154950-marostegui.json [15:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:57] (03PS1) 10Ladsgroup: db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758506 (https://phabricator.wikimedia.org/T300510) [15:50:05] (03CR) 10BBlack: [C: 03+1] "The recursors parts seem fine for traffic's use (should be nop on buster and work fine for our own bullseye transition), and we don't use " [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [15:50:18] (03PS1) 10Ladsgroup: Revert "db2107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758489 [15:50:26] just ~60MB compated to the version from 2021-12-06 [15:50:26] (03CR) 10Andrew Bogott: pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [15:50:41] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db2125: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758506 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [15:50:47] but >500MB compared to 2021-03-09 :) [15:52:18] (03CR) 10Majavah: [V: 03+1] pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [15:52:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298559)', diff saved to https://phabricator.wikimedia.org/P19659 and previous config saved to /var/cache/conftool/dbconfig/20220131-155239-marostegui.json [15:52:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [15:52:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [15:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:45] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [15:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19660 and previous config saved to /var/cache/conftool/dbconfig/20220131-155246-marostegui.json [15:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19661 and previous config saved to /var/cache/conftool/dbconfig/20220131-155353-marostegui.json [15:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:39] (03CR) 10Andrew Bogott: pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [15:54:50] (03CR) 10Herron: [C: 03+1] "LGTM, have not been using these logs either." [puppet] - 10https://gerrit.wikimedia.org/r/757955 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [15:55:20] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated tests: schedule webgen tool in the correct grid [puppet] - 10https://gerrit.wikimedia.org/r/758509 (https://phabricator.wikimedia.org/T300501) [15:55:30] Lucas_WMDE: no immediate action required from your side. I'll cycle back to this (potentially implementing a workaround) after a meeting [15:55:34] (03CR) 10Majavah: [V: 03+1] pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [15:55:35] looks like there’s a new pod that successfully pulled the image now [15:55:39] ok! [15:55:56] (03PS1) 10Majavah: O:openstack::services: don't use pdns prometheus exporters on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758510 [15:56:15] (03PS2) 10Majavah: O:openstack::services: don't use pdns prometheus exporters on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758510 (https://phabricator.wikimedia.org/T300254) [15:56:28] (03CR) 10Ssingh: [C: 03+1] "No change for doh* hosts." [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [15:57:46] Lucas_WMDE: yeah, I've killed the other one which came with a good chance of the new pod being scheduled on a node with SSD's instead of HDD's [15:58:22] ah ok [15:58:27] so it wasn’t a coincidence ^^ [15:58:46] (03CR) 10Ssingh: [C: 03+1] pdns: support bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [15:59:05] (03CR) 10Herron: [C: 03+1] o11y: tweak IcingaOverload alert [alerts] - 10https://gerrit.wikimedia.org/r/758474 (owner: 10Filippo Giunchedi) [15:59:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T300510)', diff saved to https://phabricator.wikimedia.org/P19662 and previous config saved to /var/cache/conftool/dbconfig/20220131-155905-ladsgroup.json [15:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:13] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [16:00:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:00:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T300510)', diff saved to https://phabricator.wikimedia.org/P19663 and previous config saved to /var/cache/conftool/dbconfig/20220131-160054-ladsgroup.json [16:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:59] !log Move core routers loopback filter to Capirca [16:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2125.codfw.wmnet with OS bullseye [16:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:10] (03PS2) 10Ladsgroup: Revert "db2107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758489 [16:03:16] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db2107: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758489 (owner: 10Ladsgroup) [16:04:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P19664 and previous config saved to /var/cache/conftool/dbconfig/20220131-160456-marostegui.json [16:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:26] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) 05Open→03Resolved [16:06:34] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Epic: Capacity planning for (& optimization of) transport backhaul vs edge egress - https://phabricator.wikimedia.org/T263275 (10JAllemandou) [16:06:39] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [16:06:54] PROBLEM - Host ncredir-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:06:54] PROBLEM - Host ncredir4001 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:54] PROBLEM - Host ncredir4002 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:57] PROBLEM - Host netflow4002 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:46] PROBLEM - Host cr3-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [16:07:47] (03PS5) 10Herron: centrallog: clean up old /srv/syslog/host directories after grace period [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) [16:08:01] PROBLEM - BFD status on cr2-eqord is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:08:05] XioNoX: ^ [16:08:15] er [16:08:17] PROBLEM - Host bast4003 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:28] * Emperor is here. Have we a problem? [16:08:31] * volans here [16:08:33] that doesn't look good [16:08:36] here [16:08:39] I'm here too [16:08:41] rolling back my chane [16:08:42] change [16:08:55] PROBLEM - Host install4001 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:57] PROBLEM - Host doh4001 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:57] PROBLEM - Host doh4002 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:59] PROBLEM - Host durum4001 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P19665 and previous config saved to /var/cache/conftool/dbconfig/20220131-160859-marostegui.json [16:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:09] here but in meeting, watching and can help if needed [16:09:11] PROBLEM - Host prometheus4001 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:17] here as well [16:09:19] <_joe_> should we depool ulsfo? [16:09:23] RECOVERY - Host doh4001 is UP: PING OK - Packet loss = 0%, RTA = 68.62 ms [16:09:23] RECOVERY - Host durum4001 is UP: PING OK - Packet loss = 0%, RTA = 68.55 ms [16:09:23] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 69.04 ms [16:09:25] RECOVERY - Host doh4002 is UP: PING OK - Packet loss = 0%, RTA = 68.61 ms [16:09:26] RECOVERY - Host cr3-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 69.14 ms [16:09:26] nah [16:09:27] RECOVERY - Host bast4003 is UP: PING OK - Packet loss = 0%, RTA = 68.51 ms [16:09:27] RECOVERY - Host install4001 is UP: PING OK - Packet loss = 0%, RTA = 68.48 ms [16:09:28] <_joe_> I guess not [16:09:34] (03PS1) 10BBlack: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/758511 [16:09:39] RECOVERY - Host prometheus4001 is UP: PING OK - Packet loss = 0%, RTA = 68.60 ms [16:09:39] RECOVERY - Host ncredir4001 is UP: PING OK - Packet loss = 0%, RTA = 68.62 ms [16:09:41] RECOVERY - Host netflow4002 is UP: PING OK - Packet loss = 0%, RTA = 68.55 ms [16:09:50] RECOVERY - Host ncredir-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 68.23 ms [16:09:53] RECOVERY - Host ncredir4002 is UP: PING OK - Packet loss = 0%, RTA = 68.61 ms [16:09:59] I see recovs, I was off uploading that patch. will hold for now :) [16:10:29] RECOVERY - BFD status on cr2-eqord is OK: OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:10:30] I was looking the status page to see impact and saw an increase in global latency, but it happened hours ago [16:10:50] yeah it's fully rolled back [16:11:36] (03CR) 10Herron: centrallog: clean up old /srv/syslog/host directories after grace period (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [16:11:37] we served a bunch of 5xx from ulsfo, looks to have recovered now [16:12:12] <_joe_> unavoidable I guess [16:13:06] yeah [16:14:33] jynus: the status page is not quite realtime, it often lags by 5-10 minutes [16:14:43] https://i.imgur.com/hAImtqm.png [16:15:09] cdanis: yeah, I noticed the opposite, a clear latency increase, but long time ago [16:15:14] I think I found the issue, in my patch [16:15:59] XioNoX: interesting that your patch seemed to cause a traffic spillover to other links too [16:16:07] cdanis: see the increase at 14:05- but it is better to use grafana for this, if it is available [16:16:18] jynus: yes [16:16:20] cdanis: what do you mean? [16:16:28] XioNoX: https://librenms.wikimedia.org/graphs/to=1643645700/id=7220/type=port_bits/from=1643624100/ [16:16:28] (03PS1) 10Vgutierrez: site: Reimage cp3062 as cache::text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/758512 (https://phabricator.wikimedia.org/T271421) [16:16:35] maybe it is unrelated [16:16:41] yeah [16:17:08] jynus: you can plug these queries into grafana explore https://gerrit.wikimedia.org/g/operations/puppet/+/e2b942b78e3d909fc2074e6b1eb80fc01761b8c0/hieradata/common/profile/statograph.yaml#14 [16:17:29] I mentioned it to research it more, as the net issue seemed recovering [16:17:32] maybe a deploy or something [16:18:54] confirming BTW ulsfo availability looking good too [16:19:07] cdanis: it caused ulsfo to be isolated from the rest of the other sites (lost ospf sessions), it could be that traffic briefly went through the other redundant link [16:20:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298558)', diff saved to https://phabricator.wikimedia.org/P19666 and previous config saved to /var/cache/conftool/dbconfig/20220131-162000-marostegui.json [16:20:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:20:06] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [16:20:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:20:07] from traffic server side, there was at first a spike of 503s, then of 502s [16:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298558)', diff saved to https://phabricator.wikimedia.org/P19667 and previous config saved to /var/cache/conftool/dbconfig/20220131-162014-marostegui.json [16:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:53] (taking about the recent net issue, still researching the older thingy) [16:21:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298558)', diff saved to https://phabricator.wikimedia.org/P19668 and previous config saved to /var/cache/conftool/dbconfig/20220131-162132-marostegui.json [16:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:49] (03CR) 10Eevans: [C: 03+1] "Ready to go!" [puppet] - 10https://gerrit.wikimedia.org/r/757999 (https://phabricator.wikimedia.org/T298516) (owner: 10Eevans) [16:22:29] (03CR) 10ArielGlenn: [C: 03+2] Add siteinfo data in formatversion=2 too [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [16:23:21] oh, cdanis- status page is local time, right? [16:24:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P19669 and previous config saved to /var/cache/conftool/dbconfig/20220131-162403-marostegui.json [16:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:14] there was a traffic pattern change, but it was at 13:10 UTC: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=62&orgId=1&from=1643624625604&to=1643646225604&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [16:25:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated tests: schedule webgen tool in the correct grid [puppet] - 10https://gerrit.wikimedia.org/r/758509 (https://phabricator.wikimedia.org/T300501) (owner: 10Arturo Borrero Gonzalez) [16:25:16] (03Merged) 10jenkins-bot: Add siteinfo data in formatversion=2 too [dumps] - 10https://gerrit.wikimedia.org/r/747987 (owner: 10Legoktm) [16:25:23] (03PS1) 10Ladsgroup: db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758513 (https://phabricator.wikimedia.org/T300510) [16:25:33] I am going to discard it was not self-influcted and then probaly we can ignore it [16:25:55] !log ariel@deploy1002 Started deploy [dumps/dumps@8820784]: add dump of siteinfo in format version 2 [16:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:58] !log ariel@deploy1002 Finished deploy [dumps/dumps@8820784]: add dump of siteinfo in format version 2 (duration: 00m 03s) [16:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:11] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:23] jynus: yes the graphs are always local time, despite the TZ of the rest of the page [16:26:29] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) @Miriam & @AniketArs they were not part of the `nda` group, they are added now, please try again. [16:26:40] cdanis: my fault, as I am +1, it was difficult to notice it at first :-) [16:26:54] "off by one errors" :-) [16:27:43] nothing ongoing on SAL at 13:06- only db maintenance, which doesn't create more GET traffic :-), so just traffic dependent [16:29:00] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 (10Ladsgroup) I don't know if this is result of this ticket or something unrelated but there is a lot of root@ spam with: ` Cluster configuration incomplete: 'Can... [16:29:45] (03PS2) 10Ladsgroup: db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758513 (https://phabricator.wikimedia.org/T300510) [16:29:49] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db2148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/758513 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [16:30:04] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T1630). [16:34:03] 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Papaul) Please power down the servers and let me now when this is done [16:34:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2125.codfw.wmnet with OS bullseye [16:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:05] 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Papaul) p:05Triage→03Medium [16:36:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P19670 and previous config saved to /var/cache/conftool/dbconfig/20220131-163637-marostegui.json [16:36:37] (03CR) 10Filippo Giunchedi: [C: 03+1] centrallog: clean up old /srv/syslog/host directories after grace period [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [16:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:28] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: tweak IcingaOverload alert [alerts] - 10https://gerrit.wikimedia.org/r/758474 (owner: 10Filippo Giunchedi) [16:38:56] (03PS1) 10Hashar: ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) [16:39:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298559)', diff saved to https://phabricator.wikimedia.org/P19671 and previous config saved to /var/cache/conftool/dbconfig/20220131-163908-marostegui.json [16:39:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [16:39:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [16:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:39:14] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [16:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298559)', diff saved to https://phabricator.wikimedia.org/P19672 and previous config saved to /var/cache/conftool/dbconfig/20220131-163921-marostegui.json [16:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:33] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: update external domain name [puppet] - 10https://gerrit.wikimedia.org/r/758515 [16:39:52] (03CR) 10jerkins-bot: [V: 04-1] ci: Qemu image and snapshot creation [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [16:40:01] !log mmandere@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) for new host durum6001.drmrs.wmnet [16:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: networktests: update external domain name [puppet] - 10https://gerrit.wikimedia.org/r/758515 (owner: 10Arturo Borrero Gonzalez) [16:43:38] (03CR) 10Hashar: [C: 04-1] "I can not ssh into the running vm so went with a passwordless root account to at least login via the console." [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [16:45:11] (03CR) 10Btullis: [C: 03+2] Upgrade remaining aqs_next nodes to 'dev' (Cassandra 3.11.11) [puppet] - 10https://gerrit.wikimedia.org/r/757999 (https://phabricator.wikimedia.org/T298516) (owner: 10Eevans) [16:45:24] (03PS1) 104nn1l2: Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217) [16:45:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217) (owner: 104nn1l2) [16:45:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T300510)', diff saved to https://phabricator.wikimedia.org/P19673 and previous config saved to /var/cache/conftool/dbconfig/20220131-164550-ladsgroup.json [16:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:56] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [16:46:19] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:47:07] PROBLEM - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.186 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:47:35] PROBLEM - Check whether ferm is active by checking the default input chain on prometheus2006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:47:35] PROBLEM - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:47:43] PROBLEM - cassandra-c service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:47:43] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,cassandra-b.service,cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:49] PROBLEM - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.185 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:47:55] PROBLEM - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:48:25] PROBLEM - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.184 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:48:31] PROBLEM - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:48:59] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:49:01] PROBLEM - cassandra-b service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:50:51] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: PS Redundancy for elastic1077.eqiad.wmnet - https://phabricator.wikimedia.org/T300315 (10Gehel) [16:51:07] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: PS Redundancy for elastic1080.eqiad.wmnet - https://phabricator.wikimedia.org/T300317 (10Gehel) [16:51:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P19674 and previous config saved to /var/cache/conftool/dbconfig/20220131-165141-marostegui.json [16:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:18] !log restarting Cassandra, aqs1011-{a,b}, to apply upgrade to 3.11.11 -- T298516 [16:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:23] T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 [16:53:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "please collect +1 from andrew as well." [puppet] - 10https://gerrit.wikimedia.org/r/758510 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [16:55:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298559)', diff saved to https://phabricator.wikimedia.org/P19675 and previous config saved to /var/cache/conftool/dbconfig/20220131-165531-marostegui.json [16:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:37] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [16:57:41] RECOVERY - Wikidough DoH Check on doh6001 is OK: OK - Certificate wikimedia-dns.org will expire on Fri 15 Apr 2022 01:00:09 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikidough [16:57:41] RECOVERY - Wikidough DoT Check on doh6001 is OK: TCP OK - 0.209 second response time on 185.15.58.11 port 853 https://wikitech.wikimedia.org/wiki/Wikidough [16:57:50] (03CR) 10Andrew Bogott: [C: 03+2] O:openstack::services: don't use pdns prometheus exporters on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758510 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [17:00:28] (03PS1) 10Andrew Bogott: codfw1dev network tests: update to reflect that proxy-02 is now active [puppet] - 10https://gerrit.wikimedia.org/r/758520 (https://phabricator.wikimedia.org/T297627) [17:01:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:02:23] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev network tests: update to reflect that proxy-02 is now active [puppet] - 10https://gerrit.wikimedia.org/r/758520 (https://phabricator.wikimedia.org/T297627) (owner: 10Andrew Bogott) [17:03:57] !log restarting Cassandra, aqs1012-{a,b}, to apply upgrade to 3.11.11 -- T298516 [17:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:02] T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 [17:06:46] RECOVERY - Wikidough DoH Check on doh6002 is OK: OK - Certificate wikimedia-dns.org will expire on Fri 15 Apr 2022 01:00:09 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikidough [17:06:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298558)', diff saved to https://phabricator.wikimedia.org/P19676 and previous config saved to /var/cache/conftool/dbconfig/20220131-170646-marostegui.json [17:06:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:06:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:51] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [17:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19677 and previous config saved to /var/cache/conftool/dbconfig/20220131-170653-marostegui.json [17:06:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:31] (03PS2) 104nn1l2: Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217) [17:08:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [17:08:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [17:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T300510)', diff saved to https://phabricator.wikimedia.org/P19678 and previous config saved to /var/cache/conftool/dbconfig/20220131-170808-ladsgroup.json [17:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19679 and previous config saved to /var/cache/conftool/dbconfig/20220131-170812-marostegui.json [17:08:13] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [17:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:05] (03CR) 10Andrew Bogott: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/758052 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [17:09:06] RECOVERY - Wikidough DoT Check on doh6002 is OK: TCP OK - 0.208 second response time on 185.15.58.41 port 853 https://wikitech.wikimedia.org/wiki/Wikidough [17:10:24] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [17:10:31] (03PS3) 104nn1l2: Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217) [17:10:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P19680 and previous config saved to /var/cache/conftool/dbconfig/20220131-171036-marostegui.json [17:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2148.codfw.wmnet with OS bullseye [17:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:06] !log restarting Cassandra, aqs1012-{a,b}, to apply upgrade to 3.11.11 -- T298516 [17:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:12] T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 [17:11:19] !log restarting Cassandra, aqs1013-{a,b}, to apply upgrade to 3.11.11 -- T298516 [17:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:37] (03CR) 10Hashar: [C: 04-1] "The ssh host key can be generated by reconfiguring the ssh server using:" [puppet] - 10https://gerrit.wikimedia.org/r/758514 (https://phabricator.wikimedia.org/T284774) (owner: 10Hashar) [17:12:03] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) 1010 is updated, 1019 is locking up, I will need to power off and unplug [17:13:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS buster [17:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:42] (03PS2) 10Cwhite: apifeatureusage: disable gc logging [puppet] - 10https://gerrit.wikimedia.org/r/757955 (https://phabricator.wikimedia.org/T297239) [17:14:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster [17:15:45] !log restarting Cassandra, aqs1014-{a,b}, to apply upgrade to 3.11.11 -- T298516 [17:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:59] (03PS1) 10Ottomata: Set spark maxPartitionBytes to hadoop dfs block size [puppet] - 10https://gerrit.wikimedia.org/r/758529 (https://phabricator.wikimedia.org/T300299) [17:17:25] (03PS1) 10Eigyan: [wmf-config]: Undeploy gdi survey from cawiki in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) [17:19:01] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack: fix novaenv path [puppet] - 10https://gerrit.wikimedia.org/r/758049 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [17:19:30] (03CR) 10Cwhite: [C: 03+2] apifeatureusage: disable gc logging [puppet] - 10https://gerrit.wikimedia.org/r/757955 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [17:21:14] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan) [17:22:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS buster [17:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster [17:23:05] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1004.eqiad.wmnet with OS buster [17:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster... [17:23:14] (03PS1) 10Ladsgroup: Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758491 [17:23:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P19681 and previous config saved to /var/cache/conftool/dbconfig/20220131-172317-marostegui.json [17:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:23] (03PS2) 10Ladsgroup: Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758491 [17:23:50] !log restarting Cassandra, aqs1015-{a,b}, to apply upgrade to 3.11.11 -- T298516 [17:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:54] T298516: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 [17:23:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS buster [17:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster [17:24:06] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1004.eqiad.wmnet with OS buster [17:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:10] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db2125: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/758491 (owner: 10Ladsgroup) [17:24:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster... [17:25:46] (03PS2) 10Ottomata: Set spark maxPartitionBytes to hadoop dfs block size [puppet] - 10https://gerrit.wikimedia.org/r/758529 (https://phabricator.wikimedia.org/T300299) [17:25:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P19682 and previous config saved to /var/cache/conftool/dbconfig/20220131-172547-marostegui.json [17:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:34] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33514/console" [puppet] - 10https://gerrit.wikimedia.org/r/758529 (https://phabricator.wikimedia.org/T300299) (owner: 10Ottomata) [17:30:35] (03CR) 10Ottomata: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33514/stat1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/758529 (https://phabricator.wikimedia.org/T300299) (owner: 10Ottomata) [17:32:27] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [17:33:47] (03PS1) 10Cwhite: logstash: move safepoint logging flag inside gc_log gate [puppet] - 10https://gerrit.wikimedia.org/r/758533 (https://phabricator.wikimedia.org/T297239) [17:38:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P19683 and previous config saved to /var/cache/conftool/dbconfig/20220131-173821-marostegui.json [17:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298559)', diff saved to https://phabricator.wikimedia.org/P19684 and previous config saved to /var/cache/conftool/dbconfig/20220131-174052-marostegui.json [17:40:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [17:40:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [17:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:58] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [17:41:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T298559)', diff saved to https://phabricator.wikimedia.org/P19685 and previous config saved to /var/cache/conftool/dbconfig/20220131-174059-marostegui.json [17:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:05] !log disable puppet on A:rec-dns for T758063 [17:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298559)', diff saved to https://phabricator.wikimedia.org/P19686 and previous config saved to /var/cache/conftool/dbconfig/20220131-174206-marostegui.json [17:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:12] (03CR) 10Ssingh: [C: 03+2] pdns: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758063 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [17:44:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2148.codfw.wmnet with OS bullseye [17:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:29] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:47:26] (03CR) 10Herron: [C: 03+2] centrallog: clean up old /srv/syslog/host directories after grace period [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [17:48:43] (03PS1) 10Cmjohnson: Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934) [17:49:28] (03CR) 10jerkins-bot: [V: 04-1] Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934) (owner: 10Cmjohnson) [17:51:39] (03PS1) 10Andrew Bogott: Openstack cloudservies: stop installing python2 git [puppet] - 10https://gerrit.wikimedia.org/r/758538 [17:51:49] (03PS2) 10Cmjohnson: Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934) [17:52:51] (03CR) 10jerkins-bot: [V: 04-1] Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934) (owner: 10Cmjohnson) [17:53:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T300510)', diff saved to https://phabricator.wikimedia.org/P19687 and previous config saved to /var/cache/conftool/dbconfig/20220131-175304-ladsgroup.json [17:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:09] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [17:53:21] (03CR) 10Andrew Bogott: [C: 03+2] Openstack cloudservies: stop installing python2 git [puppet] - 10https://gerrit.wikimedia.org/r/758538 (owner: 10Andrew Bogott) [17:53:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19688 and previous config saved to /var/cache/conftool/dbconfig/20220131-175326-marostegui.json [17:53:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:53:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:31] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [17:53:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19689 and previous config saved to /var/cache/conftool/dbconfig/20220131-175333-marostegui.json [17:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:04] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase2017.codfw.wmnet with reason: Firmware upgrades [17:54:06] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase2017.codfw.wmnet with reason: Firmware upgrades [17:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:18] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:54:23] (03PS3) 10Cmjohnson: Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934) [17:54:45] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2017.wmnet [17:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19690 and previous config saved to /var/cache/conftool/dbconfig/20220131-175452-marostegui.json [17:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:09] (03CR) 10Cmjohnson: [C: 03+2] Didn't see a site.pp entry for new cloudbackup servers [puppet] - 10https://gerrit.wikimedia.org/r/758537 (https://phabricator.wikimedia.org/T293934) (owner: 10Cmjohnson) [17:55:16] 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) >>! In T299652#7664448, @Papaul wrote: > Please power down the servers and let me now when this is done Ideally I'd like to do thi... [17:57:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P19691 and previous config saved to /var/cache/conftool/dbconfig/20220131-175710-marostegui.json [17:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:10] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10AniketArs) Thanks @jhathaway , Now I'm able to login Finally thanks @Miriam [18:00:04] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T1800) [18:01:11] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudbackup1003.eqiad.wmnet with OS buster [18:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad... [18:01:31] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) 05Open→03Resolved great, marking as resolved, please reopen if you discover any new issues. [18:01:36] !log installing NSS security updates [18:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1003.eqiad.wmnet with OS buster [18:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1003.e... [18:03:05] (03PS1) 10Ssingh: pdns: update config file to remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/758540 [18:04:30] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:20] (03CR) 10Mepps: [C: 03+1] [wmf-config]: Undeploy gdi survey from cawiki in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan) [18:05:34] (03CR) 10Cwhite: [C: 03+1] "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1003/33517/" [puppet] - 10https://gerrit.wikimedia.org/r/758533 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [18:06:41] 10SRE, 10Traffic: Problem loading thumbnail images due to Envoy (426 Upgrade Required) - https://phabricator.wikimedia.org/T300366 (10Esanders) This appears to be affecting Patch demo instances too: https://github.com/MatmaRex/patchdemo/issues/422 [18:06:56] (03CR) 10Andrew Bogott: "won't this break on everything pre-bullseye? The -content option wasn't added until 4.4" [puppet] - 10https://gerrit.wikimedia.org/r/758540 (owner: 10Ssingh) [18:07:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P19693 and previous config saved to /var/cache/conftool/dbconfig/20220131-180956-marostegui.json [18:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P19694 and previous config saved to /var/cache/conftool/dbconfig/20220131-181215-marostegui.json [18:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:39] (03CR) 10Andrew Bogott: [C: 03+2] P:toolforge::redis_sentinel: fix hardcoded interface [puppet] - 10https://gerrit.wikimedia.org/r/758090 (https://phabricator.wikimedia.org/T153810) (owner: 10Majavah) [18:13:54] (03CR) 10Andrew Bogott: pdns: update config file to remove deprecated option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758540 (owner: 10Ssingh) [18:17:51] (03CR) 10Herron: [C: 03+1] logstash: move safepoint logging flag inside gc_log gate [puppet] - 10https://gerrit.wikimedia.org/r/758533 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [18:25:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P19695 and previous config saved to /var/cache/conftool/dbconfig/20220131-182501-marostegui.json [18:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298559)', diff saved to https://phabricator.wikimedia.org/P19696 and previous config saved to /var/cache/conftool/dbconfig/20220131-182719-marostegui.json [18:27:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [18:27:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [18:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:26] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [18:27:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298559)', diff saved to https://phabricator.wikimedia.org/P19697 and previous config saved to /var/cache/conftool/dbconfig/20220131-182728-marostegui.json [18:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:38] (03CR) 10Cwhite: [C: 03+2] logstash: move safepoint logging flag inside gc_log gate [puppet] - 10https://gerrit.wikimedia.org/r/758533 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [18:28:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298559)', diff saved to https://phabricator.wikimedia.org/P19698 and previous config saved to /var/cache/conftool/dbconfig/20220131-182834-marostegui.json [18:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [18:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:06] (03PS1) 10Cwhite: logstash: disable gc logging on logstash collectors [puppet] - 10https://gerrit.wikimedia.org/r/758541 (https://phabricator.wikimedia.org/T288258) [18:35:13] (03PS1) 10Muehlenhoff: Update logstash Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/758542 [18:37:04] (03CR) 10Cwhite: [C: 03+1] "Looks right! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/758542 (owner: 10Muehlenhoff) [18:37:55] 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 4 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Ladsgroup) The problem is that if we want to have the long-term vision in mind, we need to move towards... [18:39:51] (03CR) 10Muehlenhoff: [C: 03+2] Update logstash Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/758542 (owner: 10Muehlenhoff) [18:40:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298558)', diff saved to https://phabricator.wikimedia.org/P19699 and previous config saved to /var/cache/conftool/dbconfig/20220131-184006-marostegui.json [18:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:11] T298558: Fix mismatching field type of protected_titles.pt_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298558 [18:40:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1003.eqiad.wmnet with OS buster [18:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1003.eqiad.wmnet with OS buster... [18:41:17] (03CR) 10Herron: [C: 03+1] logstash: disable gc logging on logstash collectors [puppet] - 10https://gerrit.wikimedia.org/r/758541 (https://phabricator.wikimedia.org/T288258) (owner: 10Cwhite) [18:41:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1004.eqiad.wmnet with OS buster [18:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster [18:41:24] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudbackup1004.eqiad.wmnet with OS buster [18:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudbackup1004.eqiad.wmnet with OS buster... [18:41:53] (03CR) 10Cwhite: [C: 03+2] logstash: disable gc logging on logstash collectors [puppet] - 10https://gerrit.wikimedia.org/r/758541 (https://phabricator.wikimedia.org/T288258) (owner: 10Cwhite) [18:43:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P19700 and previous config saved to /var/cache/conftool/dbconfig/20220131-184339-marostegui.json [18:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:50] (03PS4) 10Clare Ming: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) [18:43:53] (03PS4) 10Giuseppe Lavagetto: Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [18:43:55] (03PS3) 10Giuseppe Lavagetto: Rakefile: switch to using the new check_charts task [deployment-charts] - 10https://gerrit.wikimedia.org/r/758423 [18:46:08] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:46:55] (03PS5) 10Clare Ming: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) [18:52:11] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7411 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:52:21] PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 32699 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [18:54:15] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7407 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [18:54:19] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Decide on details of progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) a:05Krinkle→03None [18:57:54] (03CR) 10Jdlrobson: [C: 03+1] Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) (owner: 10Clare Ming) [18:58:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P19701 and previous config saved to /var/cache/conftool/dbconfig/20220131-185843-marostegui.json [18:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RoanKattouw and Urbanecm: That opportune time is upon us again. Time for a UTC evening backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T1900). [19:00:04] cjming, nn1l2, and eigyan: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:08] hi [19:00:16] hello [19:00:32] hey [19:00:53] cjming: hi, do you want to deploy today? Or should I? [19:01:25] urbanecm - do you mind doing it? i'm trying to multitask atm which is not my strong suit [19:01:31] sure [19:02:10] cjming: should i put your patches at the end? or reviewing doesn't hurt your multitasking that much? [19:02:32] that's fine too - and i can take care of them then [19:02:54] (03CR) 10Urbanecm: [C: 03+2] Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217) (owner: 104nn1l2) [19:03:02] cjming: okay, will ping you when done [19:03:05] (03CR) 10EllenR: "looks like 2 already, but since it is showing up in my dashboard I will answer" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan) [19:03:13] urbanecm: ty! [19:03:40] (03Merged) 10jenkins-bot: Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758490 (https://phabricator.wikimedia.org/T300217) (owner: 104nn1l2) [19:04:05] Greetings [19:04:12] nn1l2: I'll just sync this one, since it's a revert [19:04:19] thanks! [19:06:00] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: a659cb0089da0c6d501263c19dd692a286601d2c: Revert "commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist" (T300217) (duration: 00m 50s) [19:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:05] T300217: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300217 [19:06:10] nn1l2: live :) [19:06:20] (03PS2) 10Urbanecm: [wmf-config]: Undeploy gdi survey from cawiki in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan) [19:06:20] Thank you! [19:06:33] (03CR) 10Urbanecm: [C: 03+2] [wmf-config]: Undeploy gdi survey from cawiki in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan) [19:06:54] thank you [19:07:00] (03PS2) 10Andrew Bogott: pdns: update config file to remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/758540 (owner: 10Ssingh) [19:07:05] hi eigyan, do you want to test it at mwdebug1001 (once it's there)? [19:07:28] (as far as i know surveys, it can't be reasonably tested, but i'm not 100% sure) [19:07:35] (03Merged) 10jenkins-bot: [wmf-config]: Undeploy gdi survey from cawiki in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758530 (https://phabricator.wikimedia.org/T300544) (owner: 10Eigyan) [19:07:57] eigyan: it's at mwdebug1001 if you want to test. [19:08:24] Will do urbanecm [19:08:27] thanks [19:08:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:09] urbanecm VERIFIED! thank you [19:09:12] syncing [19:10:15] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7371 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:10:24] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 411af378c606c0f987679a1eebd901326dd5db18: [wmf-config]: Undeploy gdi survey from cawiki in production (T300544) (duration: 00m 50s) [19:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:29] T300544: Undeploy the cawiki test survey from production - https://phabricator.wikimedia.org/T300544 [19:10:38] eigyan: and, live [19:10:57] cjming: I'm done. I can do yours now, or you can self-serve -- up to you. [19:11:18] i can self-serve - thanks urbanecm! [19:11:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:11:27] great! Ping me if I'm needed then :) [19:11:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:42] (03PS3) 10Clare Ming: Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson) [19:11:53] and live ✅ [19:11:55] (03PS6) 10Clare Ming: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) [19:13:04] (03CR) 10Clare Ming: [C: 03+2] Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) (owner: 10Clare Ming) [19:13:45] (03Merged) 10jenkins-bot: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) (owner: 10Clare Ming) [19:13:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298559)', diff saved to https://phabricator.wikimedia.org/P19702 and previous config saved to /var/cache/conftool/dbconfig/20220131-191348-marostegui.json [19:13:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [19:13:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [19:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:54] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [19:13:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298559)', diff saved to https://phabricator.wikimedia.org/P19703 and previous config saved to /var/cache/conftool/dbconfig/20220131-191356-marostegui.json [19:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:37] (03PS4) 10Clare Ming: Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson) [19:17:55] !log cjming@deploy1002 Synchronized wmf-config/config: Config: [[gerrit:757500|Update config for idwiki: (T299676)]] (duration: 00m 50s) [19:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:00] T299676: Turn on desktop improvements by default on idwiki - https://phabricator.wikimedia.org/T299676 [19:19:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:51] (03CR) 10Clare Ming: [C: 03+2] Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson) [19:20:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:20:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:43] (03PS3) 10Andrew Bogott: pdns: update config file to remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/758540 (owner: 10Ssingh) [19:21:23] (03Merged) 10jenkins-bot: Disable A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson) [19:21:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:47] (03CR) 10Andrew Bogott: [C: 03+2] pdns: update config file to remove deprecated option [puppet] - 10https://gerrit.wikimedia.org/r/758540 (owner: 10Ssingh) [19:24:51] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:757735|Disable A/B test (T297924)]] (duration: 00m 49s) [19:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:56] T297924: Turn A/B test enrollment off and deploy sticky header everywhere - https://phabricator.wikimedia.org/T297924 [19:25:45] urbanecm: my changes are live - shall i go ahead and close the deployment window? [19:26:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298559)', diff saved to https://phabricator.wikimedia.org/P19704 and previous config saved to /var/cache/conftool/dbconfig/20220131-192604-marostegui.json [19:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:10] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [19:26:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:58] cjming: yes please -- you were the last one. [19:27:19] !log end of UTC evening backport & config window [19:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:49] thanks! [19:28:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:28:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P19705 and previous config saved to /var/cache/conftool/dbconfig/20220131-194109-marostegui.json [19:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:57] PROBLEM - DNS on thumbor2005.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.193.0.182 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:42:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [19:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:58] (03PS16) 10Gehel: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [19:48:49] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7267 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [19:55:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [19:55:35] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:56:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P19706 and previous config saved to /var/cache/conftool/dbconfig/20220131-195614-marostegui.json [19:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:11] (03PS3) 10Jdlrobson: Enable migration mode on all group 0, group 1 and desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757733 (https://phabricator.wikimedia.org/T299927) [20:02:31] 10SRE, 10Traffic: Serve redirect wikimediastatus.net --> www.wikimediastatus.net - https://phabricator.wikimedia.org/T300161 (10CDanis) 05Open→03Resolved a:03CDanis @Volans made the suggestion of using wikitech-static. Given that status.wikipedia.org is currently served from there, this seems quite reas... [20:02:38] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10CDanis) [20:05:25] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7068 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:07:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [20:07:24] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [20:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:51] (03PS1) 10JHathaway: ferm: replace systemd unit to ensure success on boot [puppet] - 10https://gerrit.wikimedia.org/r/758548 [20:09:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [20:09:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [20:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:18] (03CR) 10JHathaway: "Would love a review. I hit this problem on mx1001, but I would love to understand if it is a problem on all ferm hosts with @resolve rules" [puppet] - 10https://gerrit.wikimedia.org/r/758548 (owner: 10JHathaway) [20:10:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [20:10:24] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [20:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298559)', diff saved to https://phabricator.wikimedia.org/P19707 and previous config saved to /var/cache/conftool/dbconfig/20220131-201118-marostegui.json [20:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:23] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [20:12:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [20:12:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [20:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [20:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:53] RECOVERY - Check whether ferm is active by checking the default input chain on prometheus2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:21:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-staging2001.mgmt.codfw.wmnet with reboot policy FORCED [20:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host ml-staging2002.mgmt.codfw.wmnet with reboot policy FORCED [20:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:27] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7402 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:27:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) Can someone please update this task with the Partitioning/Raid information? Thanks. [20:29:13] (03CR) 10JHathaway: [C: 03+1] O:mail::mx: Add mx specific block list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [20:29:48] (03CR) 10Jcrespo: [C: 03+1] "Will merge tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [20:31:26] (03CR) 10JHathaway: P:installserver::proxy: Add domain whitelist to proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [20:31:55] (03CR) 10JHathaway: [C: 03+1] C:mw_rc_irc::ircserver: Refresh ircd services on config changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753046 (https://phabricator.wikimedia.org/T284052) (owner: 10Jbond) [20:33:07] (03CR) 10Dzahn: [C: 03+1] "ok, cool, will let you merge it. thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [20:33:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-staging2002.mgmt.codfw.wmnet with reboot policy FORCED [20:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:03] (03PS1) 10Volans: dhcp: case-insensitive match if Dell serial number [software/spicerack] - 10https://gerrit.wikimedia.org/r/758558 [20:39:48] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host etherpad1003.eqiad.wmnet [20:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:47] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests: create bullseye VM for Etherpad upgrade - https://phabricator.wikimedia.org/T300568 (10Dzahn) [20:43:30] ACKNOWLEDGEMENT - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,cassandra-b.service,cassandra-c.service eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:30] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.184 and port 9042: Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://phabricator.wikimedia.org/T93886 [20:43:30] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.184:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:43:30] ACKNOWLEDGEMENT - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:43:30] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.185 and port 9042: Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://phabricator.wikimedia.org/T93886 [20:43:30] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.48.185:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:43:30] ACKNOWLEDGEMENT - cassandra-b service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:43:31] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is CRITICAL: connect to address 10.64.48.186 and port 9042: Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://phabricator.wikimedia.org/T93886 [20:43:32] ACKNOWLEDGEMENT - cassandra-c SSL 10.64.48.186:7001 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [20:43:32] ACKNOWLEDGEMENT - cassandra-c service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed eevans File permissions are preventing Cassandra startup. Fallout from Buster migration? https://phabricator.wikimedia.org/T295375#7663733 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:43:49] scary but nice:) ty [20:44:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [20:50:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host etherpad1003.eqiad.wmnet [20:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:27] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7407 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [20:52:29] (03CR) 10JHathaway: [C: 03+1] "looks good, would be great to get this merged into extlib" [puppet] - 10https://gerrit.wikimedia.org/r/753786 (owner: 10Jbond) [20:54:39] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:55] (03PS1) 10Dzahn: DHCP: add MAC address for etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/758559 (https://phabricator.wikimedia.org/T300568) [20:56:12] (03PS2) 10Dzahn: DHCP: add MAC address for etherpad1003, use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/758559 (https://phabricator.wikimedia.org/T300568) [20:57:05] (03CR) 10Dzahn: [C: 03+2] DHCP: add MAC address for etherpad1003, use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/758559 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [21:00:05] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T2100). [21:01:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:33] 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Papaul) [21:01:41] 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Papaul) [21:12:15] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) @Joe The one package I was talking about is "ttf-bitstream-vera" which gets installed when you remove "fonts-dejacu-core*" and since I did "--purge fonts*"... [21:12:35] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:15:39] !log installed bullseye on new VM etherpad1003, signing puppet certs for etherpad1003.eqiad.wmnet - puppet error expected until we add the role (T300568) [21:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:45] T300568: create bullseye VM for Etherpad upgrade - https://phabricator.wikimedia.org/T300568 [21:17:10] (03PS1) 10Dzahn: site: add etherpad1003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/758560 (https://phabricator.wikimedia.org/T300568) [21:17:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [21:17:46] (03CR) 10Dzahn: [C: 03+2] site: add etherpad1003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/758560 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [21:19:35] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service,cassandra-b.service,cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:38] (03PS1) 10Dzahn: switch etherpad.discovery.wmnet to etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/758561 (https://phabricator.wikimedia.org/T300568) [21:23:19] (03PS1) 10Dzahn: site: add etherpad role to etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/758562 (https://phabricator.wikimedia.org/T300568) [21:23:50] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) [21:25:32] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) a:05Dzahn→03None [21:25:34] (03Abandoned) 10Majavah: Bare minimum port to Python 3 to support Debian Bullseye [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/758068 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [21:28:09] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) [21:28:54] (03CR) 10Dzahn: [C: 04-2] "Have to be careful because I don't want to repeat what happened last time, got reminded when I read the old ticket: T224580#5828883" [puppet] - 10https://gerrit.wikimedia.org/r/758562 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [21:30:27] (03CR) 10Dzahn: [C: 04-2] "so yea, we need to coordinate and mask the service on one server before starting it on the other etc..." [puppet] - 10https://gerrit.wikimedia.org/r/758562 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [21:31:40] (03CR) 10Dzahn: [C: 04-2] "this will be the last step after everything is confirmed working. just pre-created it but not ready" [dns] - 10https://gerrit.wikimedia.org/r/758561 (https://phabricator.wikimedia.org/T300568) (owner: 10Dzahn) [21:35:14] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) [21:41:41] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) @akosiaris Looking at the old ticket when we upgraded to buster, I don't want to repeat the mistake and run Etherpad on 2 servers at a... [21:46:26] (03PS5) 10JHathaway: [WIP] team-sre: add hardware-related checks [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [21:46:57] (03CR) 10JHathaway: [WIP] team-sre: add hardware-related checks (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [21:57:40] 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10DannyH) [21:58:21] 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10DannyH) I hope that I've done this correctly; please let me know if I've made a mistake. Thanks! [22:00:05] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220131T2200). [22:11:58] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7124 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [22:12:29] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet2004-dev.codfw.wmnet with OS bullseye [22:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:11] (03CR) 10Dzahn: [C: 04-1] "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [22:19:16] (03PS11) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [22:19:52] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7137 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [22:19:53] (03CR) 10jerkins-bot: [V: 04-1] snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [22:21:53] (03PS12) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [22:21:59] sbassett: Reedy: hi, if none of you is deploying something, is it ok for me to roll https://phabricator.wikimedia.org/T298312#7663152 out? [22:22:13] 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10RhinosF1) Adding Andrew & Olja as they normally approve for this group. @DannyH: it looks good. @Ladsgroup is on clinic duty this week and will pick it up for you! Please get yo... [22:22:46] urbanecm: Yep, feel free. Thanks. [22:22:52] thanks sbassett [22:26:15] (03PS13) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [22:29:00] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33523/snapshot1008.eqiad.wmnet/index.html compiles now but there is still something bad" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [22:32:22] (03PS14) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [22:32:59] (03CR) 10jerkins-bot: [V: 04-1] snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [22:36:29] (03PS15) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [22:38:46] !log Deploy security patch for T298312 [22:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:36] (03CR) 10Dzahn: [V: 03+1] "@ArielGlenn finally got back to this to get it done with. now it passes jenkins and compiles and I see no changes anymore INSIDE files/tem" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [22:40:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [22:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:01] RECOVERY - cassandra-c service on restbase1027 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:42:04] sbassett: all done. [22:42:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:42:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [22:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:50] 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ottomata) Approved! [22:43:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:28] 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ottomata) Looks like Danny will not need shell access, just ssh-keyless group membership. [22:43:36] (03CR) 10Ebernhardson: [C: 03+1] rdf-streaming-updater: add the reconciliation stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753788 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [22:47:19] PROBLEM - cassandra-c service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:50:27] (03PS3) 10Ebernhardson: Provide a specific user agent when checking servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/743222 [22:54:40] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/758575 [23:03:44] !log bking@deploy1002 Started deploy [wdqs/wdqs@f0287fb]: 0.3.101 [23:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:41] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet2004-dev.codfw.wmnet with OS bullseye [23:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:57] !log [WDQS Deploy] Tests passing following deploy of 0.3.101 on canary `wdqs1003`; proceeding to rest of fleet [23:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:02] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f0287fb]: 0.3.101 (duration: 08m 18s) [23:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:20] (03PS1) 10Andrew Bogott: OpenStack Neutron: install iptables on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758578 [23:15:21] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Neutron: install iptables on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/758578 (owner: 10Andrew Bogott) [23:16:30] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [23:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:51] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [23:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:04] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [23:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:35] !log bking@deploy1002 Started deploy [wdqs/wdqs@f0287fb] (wcqs): Deploy 0.3.101 to WCQS [23:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:41] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7370 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [23:28:14] !log [WCQS Deploy] Tests look good following deploy of `0.3.101` to canary `wcqs1002.eqiad.wmnet`, proceeding to rest of fleet [23:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:15] !log bking@deploy1002 Finished deploy [wdqs/wdqs@f0287fb] (wcqs): Deploy 0.3.101 to WCQS (duration: 02m 39s) [23:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:17] !log [WCQS Deploy] Restarted `wcqs-updater` across all hosts: `sudo cumin -b 6 'wcqs*' 'sudo systemctl restart wcqs-updater'` [23:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:37] (03PS2) 10Ryan Kemper: Add cname for commons-query.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/717606 (https://phabricator.wikimedia.org/T282117) (owner: 10Ebernhardson) [23:39:12] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/758575 (owner: 10PipelineBot) [23:42:53] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/758575 (owner: 10PipelineBot) [23:43:29] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:44:10] !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply on staging [23:44:12] !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply on production [23:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:27] !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: sync on staging [23:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:02] 10SRE, 10SRE-Access-Requests, 10Analytics: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10CDunn) Approved [23:49:12] !log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply on production [23:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:15] !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply on staging [23:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:53] !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: sync on production [23:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:16] !log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply on production [23:50:18] !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply on staging [23:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:43] !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: sync on production [23:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:03] PROBLEM - Cassandra instance data free space on restbase2010 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7256 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [23:56:46] (03PS1) 10Gergő Tisza: Beta: Replace mediawiki11 with mediawiki12 [puppet] - 10https://gerrit.wikimedia.org/r/758584 (https://phabricator.wikimedia.org/T300591) [23:58:11] hello! any sre around for a quick beta-only puppet patch review? The whole beta cluster is broken: https://phabricator.wikimedia.org/T300591