[00:07:47] (03PS2) 10Krinkle: noc: Improve wiki.php diff by using wikidiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885897 [00:07:49] (03PS1) 10Krinkle: speed-tests: Add captureSpeedtest.php script and publish 2023 snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888105 [00:08:26] (03CR) 10CI reject: [V: 04-1] speed-tests: Add captureSpeedtest.php script and publish 2023 snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888105 (owner: 10Krinkle) [00:08:59] (03CR) 10Krinkle: [C: 03+2] noc: Improve wiki.php diff by using wikidiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885897 (owner: 10Krinkle) [00:09:43] (03Merged) 10jenkins-bot: noc: Improve wiki.php diff by using wikidiff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885897 (owner: 10Krinkle) [00:12:47] (03PS2) 10Krinkle: speed-tests: Add captureSpeedtest.php script and publish 2023 snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888105 [00:24:12] (03PS4) 10Krinkle: multiversion: Create dblist-manage command for easy add/delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885064 (https://phabricator.wikimedia.org/T308932) [00:24:16] (03CR) 10Krinkle: [C: 03+2] multiversion: Create dblist-manage command for easy add/delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885064 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [00:24:20] (03PS4) 10Krinkle: logos: Exclude logos/index.html from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885065 [00:24:24] (03CR) 10Krinkle: [C: 03+2] logos: Exclude logos/index.html from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885065 (owner: 10Krinkle) [00:25:07] (03Merged) 10jenkins-bot: multiversion: Create dblist-manage command for easy add/delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885064 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [00:25:30] (03Merged) 10jenkins-bot: logos: Exclude logos/index.html from Git [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885065 (owner: 10Krinkle) [00:33:41] PROBLEM - Check systemd state on logstash2023 is CRITICAL: CRITICAL - degraded: The following units failed: run-dashboards-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:05] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: run-dashboards-backup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:46] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops: CloudVPS: IPv6 early PoC - https://phabricator.wikimedia.org/T245495 (10bd808) [00:55:04] 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10bd808) >>! In T187929#7623061, @cmooney wrote: > This task more relates to allocating blocks of IPv6 for Toolforge/Cloud. As per the above discussion there are some small open questions, but I've... [00:55:25] (03PS1) 10Andrew Bogott: OpenStack: Standardize and templatize database 'config' section [puppet] - 10https://gerrit.wikimedia.org/r/888107 [01:08:33] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T328420 (10Papaul) 05Open→03Resolved Interface was showing some errors. I clear the statistic on the interface. Before ` {master:2} papaul@asw-c-codfw> show interfaces ge-6/0/6 extensive | match error BPDU Error: None, MAC-RE... [01:24:55] (03PS2) 10Andrew Bogott: OpenStack: Standardize and templatize database 'config' section [puppet] - 10https://gerrit.wikimedia.org/r/888107 [01:29:11] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:31:13] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes in row C - pt1979@cumin2002" [01:31:55] (03CR) 10Reedy: [C: 04-1] "indenting needs to be in tabs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [01:32:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes in row C - pt1979@cumin2002" [01:32:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:33:30] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [01:34:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2436.mgmt.codfw.wmnet with reboot policy FORCED [01:35:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2437.mgmt.codfw.wmnet with reboot policy FORCED [01:37:49] (03PS3) 10Andrew Bogott: OpenStack: Standardize and templatize database 'config' section [puppet] - 10https://gerrit.wikimedia.org/r/888107 [01:42:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2436.mgmt.codfw.wmnet with reboot policy FORCED [01:42:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2437.mgmt.codfw.wmnet with reboot policy FORCED [01:43:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2438.mgmt.codfw.wmnet with reboot policy FORCED [01:43:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2439.mgmt.codfw.wmnet with reboot policy FORCED [01:47:26] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: Standardize and templatize database 'config' section [puppet] - 10https://gerrit.wikimedia.org/r/888107 (owner: 10Andrew Bogott) [01:49:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2438.mgmt.codfw.wmnet with reboot policy FORCED [01:49:28] !log creating wbc_entity_usage on foundationwiki - T321967 [01:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:31] T321967: Enable Wikibase on Wikimedia Foundation Governance Wiki - https://phabricator.wikimedia.org/T321967 [01:50:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2440.mgmt.codfw.wmnet with reboot policy FORCED [01:51:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2439.mgmt.codfw.wmnet with reboot policy FORCED [01:53:44] (03PS1) 10Sbailey: Enable Linter migration scripts for namespace and tag and template [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888111 (https://phabricator.wikimedia.org/T329342) [01:54:44] (03PS1) 10Raymond Ndibe: puppet: adapt replica_cnf_api to python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/888112 (https://phabricator.wikimedia.org/T304040) [01:56:06] (03CR) 10Sbailey: "Enable the linter maintenance scripts so they can be manually run in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888111 (https://phabricator.wikimedia.org/T329342) (owner: 10Sbailey) [01:56:22] (03PS1) 10Andrew Bogott: Neutron: define db_name for the common database template [puppet] - 10https://gerrit.wikimedia.org/r/888113 [01:57:02] (03CR) 10Andrew Bogott: [C: 03+2] Neutron: define db_name for the common database template [puppet] - 10https://gerrit.wikimedia.org/r/888113 (owner: 10Andrew Bogott) [01:58:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2440.mgmt.codfw.wmnet with reboot policy FORCED [01:58:58] (03PS3) 10Reedy: Added Wikimedia Foundation Governance Wiki to Wikibase setup and enabled extension on wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [01:59:31] (03CR) 10CI reject: [V: 04-1] Added Wikimedia Foundation Governance Wiki to Wikibase setup and enabled extension on wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [02:00:03] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2441.mgmt.codfw.wmnet with reboot policy FORCED [02:00:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2441.mgmt.codfw.wmnet with reboot policy FORCED [02:00:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2441.mgmt.codfw.wmnet with reboot policy FORCED [02:01:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2442.mgmt.codfw.wmnet with reboot policy FORCED [02:01:49] (03PS4) 10Reedy: Added Wikimedia Foundation Governance Wiki to Wikibase setup and enabled extension on wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [02:03:40] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [02:04:51] (03PS5) 10Reedy: Added Wikimedia Foundation Governance Wiki to Wikibase setup and enabled extension on wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850547 (https://phabricator.wikimedia.org/T321967) (owner: 10Varnent) [02:05:40] !log deployed mitigations for T326691 [02:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2441.mgmt.codfw.wmnet with reboot policy FORCED [02:06:13] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes row D - pt1979@cumin2002" [02:06:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2443.mgmt.codfw.wmnet with reboot policy FORCED [02:07:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new mw nodes row D - pt1979@cumin2002" [02:07:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:07:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2442.mgmt.codfw.wmnet with reboot policy FORCED [02:07:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2444.mgmt.codfw.wmnet with reboot policy FORCED [02:08:21] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:09:31] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:10:46] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2443.mgmt.codfw.wmnet with reboot policy FORCED [02:16:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2444.mgmt.codfw.wmnet with reboot policy FORCED [02:17:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [02:20:46] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2445.mgmt.codfw.wmnet with reboot policy FORCED [02:55:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2445.mgmt.codfw.wmnet with reboot policy FORCED [03:08:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [06:25:01] (03PS3) 10Hokwelum: use lbzip2 instead of bzcat to decompress blocks in parallel [puppet] - 10https://gerrit.wikimedia.org/r/887776 (https://phabricator.wikimedia.org/T328804) [06:27:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [06:27:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [06:29:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:29:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:31:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:31:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:32:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [06:32:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1130.eqiad.wmnet with reason: Maintenance [06:32:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T328817)', diff saved to https://phabricator.wikimedia.org/P44115 and previous config saved to /var/cache/conftool/dbconfig/20230210-063249-marostegui.json [06:32:53] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [06:34:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [06:34:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [06:35:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:35:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:35:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44116 and previous config saved to /var/cache/conftool/dbconfig/20230210-063543-marostegui.json [06:35:47] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [06:38:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44117 and previous config saved to /var/cache/conftool/dbconfig/20230210-063812-root.json [06:38:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [06:38:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [06:40:02] (03PS1) 10Marostegui: mariadb: Decommission db1098 [puppet] - 10https://gerrit.wikimedia.org/r/888128 (https://phabricator.wikimedia.org/T329171) [06:40:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1098.eqiad.wmnet [06:41:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1098 [puppet] - 10https://gerrit.wikimedia.org/r/888128 (https://phabricator.wikimedia.org/T329171) (owner: 10Marostegui) [06:43:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [06:43:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [06:44:17] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:46:16] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1098.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [06:47:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [06:47:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [06:47:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T328817)', diff saved to https://phabricator.wikimedia.org/P44118 and previous config saved to /var/cache/conftool/dbconfig/20230210-064728-marostegui.json [06:47:33] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [06:53:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44119 and previous config saved to /var/cache/conftool/dbconfig/20230210-065317-root.json [06:53:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T328817)', diff saved to https://phabricator.wikimedia.org/P44120 and previous config saved to /var/cache/conftool/dbconfig/20230210-065322-marostegui.json [06:53:26] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [06:57:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1098.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [06:57:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:57:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1098.eqiad.wmnet [06:58:29] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1098.eqiad.wmnet - https://phabricator.wikimedia.org/T329171 (10Marostegui) a:05Marostegui→03None [06:58:33] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission db1098.eqiad.wmnet - https://phabricator.wikimedia.org/T329171 (10Marostegui) This is ready for DC-Ops [06:59:02] 10ops-eqiad, 10decommission-hardware: decommission db1098.eqiad.wmnet - https://phabricator.wikimedia.org/T329171 (10Marostegui) [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230210T0700) [07:07:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44121 and previous config saved to /var/cache/conftool/dbconfig/20230210-070755-marostegui.json [07:07:59] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [07:08:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44122 and previous config saved to /var/cache/conftool/dbconfig/20230210-070822-root.json [07:08:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P44123 and previous config saved to /var/cache/conftool/dbconfig/20230210-070829-marostegui.json [07:23:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P44124 and previous config saved to /var/cache/conftool/dbconfig/20230210-072301-marostegui.json [07:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44125 and previous config saved to /var/cache/conftool/dbconfig/20230210-072327-root.json [07:23:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P44126 and previous config saved to /var/cache/conftool/dbconfig/20230210-072335-marostegui.json [07:27:13] 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10ayounsi) a:05faidon→03None My understanding is that priorities shifted and other WMCS projects (joint with Netops) are being worked on. Allocating space can be done in a relatively short time... [07:34:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/888095 (owner: 10Volans) [07:38:01] !log installing wireshark security updates [07:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P44127 and previous config saved to /var/cache/conftool/dbconfig/20230210-073808-marostegui.json [07:38:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44128 and previous config saved to /var/cache/conftool/dbconfig/20230210-073831-root.json [07:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T328817)', diff saved to https://phabricator.wikimedia.org/P44129 and previous config saved to /var/cache/conftool/dbconfig/20230210-073841-marostegui.json [07:38:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2120.codfw.wmnet with reason: Maintenance [07:38:45] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:38:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2120.codfw.wmnet with reason: Maintenance [07:39:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T328817)', diff saved to https://phabricator.wikimedia.org/P44130 and previous config saved to /var/cache/conftool/dbconfig/20230210-073902-marostegui.json [07:39:38] (03CR) 10Elukey: [V: 03+1] role::ml_k8s::staging: upgrade cluster settings for k8s 1.23 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [07:39:43] (03PS2) 10Elukey: role::ml_k8s::staging: upgrade cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) [07:40:39] (03CR) 10Nicolas Fraison: [C: 03+2] fix(varnishkafka): add alert duration of 5m to avoid false positive [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [07:41:41] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:42:21] (03Merged) 10jenkins-bot: fix(varnishkafka): add alert duration of 5m to avoid false positive [alerts] - 10https://gerrit.wikimedia.org/r/887966 (https://phabricator.wikimedia.org/T324522) (owner: 10Nicolas Fraison) [07:43:05] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:43:34] lovely [07:43:39] didn't start in the best way [07:46:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T328817)', diff saved to https://phabricator.wikimedia.org/P44131 and previous config saved to /var/cache/conftool/dbconfig/20230210-074600-marostegui.json [07:46:04] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [07:48:53] (03PS3) 10Nicolas Fraison: chore(varnishkafa): add site to VarnishkafkaNoMessages alerts [alerts] - 10https://gerrit.wikimedia.org/r/887784 [07:53:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T329203)', diff saved to https://phabricator.wikimedia.org/P44132 and previous config saved to /var/cache/conftool/dbconfig/20230210-075314-marostegui.json [07:53:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [07:53:19] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [07:53:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [07:53:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44133 and previous config saved to /var/cache/conftool/dbconfig/20230210-075336-root.json [07:54:08] (03PS1) 10Elukey: sre.k8s.upgrade-cluster: fix alertmanager's catch all [cookbooks] - 10https://gerrit.wikimedia.org/r/888163 (https://phabricator.wikimedia.org/T327767) [07:56:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [07:56:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [07:56:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:56:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:57:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T329203)', diff saved to https://phabricator.wikimedia.org/P44134 and previous config saved to /var/cache/conftool/dbconfig/20230210-075702-marostegui.json [07:57:06] (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: fix alertmanager's catch all [cookbooks] - 10https://gerrit.wikimedia.org/r/888163 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [07:58:00] (03CR) 10Nicolas Fraison: Remove the GPU configuration from an-worker109[6-9] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) (owner: 10Btullis) [07:59:03] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [07:59:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T329203)', diff saved to https://phabricator.wikimedia.org/P44135 and previous config saved to /var/cache/conftool/dbconfig/20230210-075911-marostegui.json [07:59:16] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [07:59:16] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230210T0800) [08:00:08] nope still failing, sigh [08:01:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P44136 and previous config saved to /var/cache/conftool/dbconfig/20230210-080106-marostegui.json [08:08:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44137 and previous config saved to /var/cache/conftool/dbconfig/20230210-080841-root.json [08:12:02] !log installing virglrenderer security updates [08:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P44138 and previous config saved to /var/cache/conftool/dbconfig/20230210-081417-marostegui.json [08:16:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P44139 and previous config saved to /var/cache/conftool/dbconfig/20230210-081612-marostegui.json [08:18:49] (03CR) 10Muehlenhoff: admin: add Santiago Faci (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888045 (https://phabricator.wikimedia.org/T329296) (owner: 10Filippo Giunchedi) [08:26:02] moritzm: doh! thank you for spotting the typo, fixing [08:26:50] (03PS1) 10Filippo Giunchedi: admin: fix typo for sfaci [puppet] - 10https://gerrit.wikimedia.org/r/888164 [08:26:52] ^ [08:27:00] (03CR) 10CI reject: [V: 04-1] admin: fix typo for sfaci [puppet] - 10https://gerrit.wikimedia.org/r/888164 (owner: 10Filippo Giunchedi) [08:27:20] wat [08:27:48] (03PS2) 10Filippo Giunchedi: admin: fix typo for sfaci [puppet] - 10https://gerrit.wikimedia.org/r/888164 [08:28:10] (03CR) 10Ayounsi: "The original code is already messy so to me your change is making it a bit cleaner :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/884908 (https://phabricator.wikimedia.org/T328313) (owner: 10Cathal Mooney) [08:29:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P44140 and previous config saved to /var/cache/conftool/dbconfig/20230210-082923-marostegui.json [08:31:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T328817)', diff saved to https://phabricator.wikimedia.org/P44141 and previous config saved to /var/cache/conftool/dbconfig/20230210-083119-marostegui.json [08:31:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:31:22] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:31:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/888164 (owner: 10Filippo Giunchedi) [08:31:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:31:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T328817)', diff saved to https://phabricator.wikimedia.org/P44142 and previous config saved to /var/cache/conftool/dbconfig/20230210-083140-marostegui.json [08:33:52] godog: I didn't spot anything, it was the daily account check :-) [08:35:38] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: fix typo for sfaci [puppet] - 10https://gerrit.wikimedia.org/r/888164 (owner: 10Filippo Giunchedi) [08:35:43] hah! [08:36:59] 10SRE, 10Traffic, 10Patch-For-Review: Update certspotter - https://phabricator.wikimedia.org/T204993 (10MoritzMuehlenhoff) 0.15 has been uploaded to Debian and certspotter is now a proper daemon: https://tracker.debian.org/news/1419591/accepted-certspotter-0150-1-source-into-unstable/ [08:38:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T328817)', diff saved to https://phabricator.wikimedia.org/P44143 and previous config saved to /var/cache/conftool/dbconfig/20230210-083836-marostegui.json [08:38:40] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [08:44:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T329203)', diff saved to https://phabricator.wikimedia.org/P44144 and previous config saved to /var/cache/conftool/dbconfig/20230210-084430-marostegui.json [08:44:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [08:44:34] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [08:44:44] (03PS1) 10Filippo Giunchedi: opensearch_dashboards: bump memory limit [puppet] - 10https://gerrit.wikimedia.org/r/888165 (https://phabricator.wikimedia.org/T327161) [08:44:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [08:44:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099 (s1, s8) T329181', diff saved to https://phabricator.wikimedia.org/P44145 and previous config saved to /var/cache/conftool/dbconfig/20230210-084452-root.json [08:44:56] T329181: decommission db1099.eqiad.wmnet - https://phabricator.wikimedia.org/T329181 [08:44:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T329203)', diff saved to https://phabricator.wikimedia.org/P44146 and previous config saved to /var/cache/conftool/dbconfig/20230210-084457-marostegui.json [08:45:08] (03CR) 10Filippo Giunchedi: "As per task" [puppet] - 10https://gerrit.wikimedia.org/r/888165 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi) [08:45:40] (03PS1) 10Marostegui: db1099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/888166 (https://phabricator.wikimedia.org/T329181) [08:46:13] (03CR) 10Marostegui: [C: 03+2] db1099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/888166 (https://phabricator.wikimedia.org/T329181) (owner: 10Marostegui) [08:47:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T329203)', diff saved to https://phabricator.wikimedia.org/P44147 and previous config saved to /var/cache/conftool/dbconfig/20230210-084706-marostegui.json [08:48:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T328733 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi [08:51:49] 10SRE, 10LDAP-Access-Requests: Add Jon Amar WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T329324 (10fgiunchedi) a:03KFrancis Thank you for reaching out @jon_amar-WMDE. I'm adding @KFrancis for confirmation on NDA status and then we're good to proceed. [08:53:42] (03PS1) 10Filippo Giunchedi: admin: add jon-amar-wmde [puppet] - 10https://gerrit.wikimedia.org/r/888167 (https://phabricator.wikimedia.org/T329324) [08:53:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P44148 and previous config saved to /var/cache/conftool/dbconfig/20230210-085342-marostegui.json [08:54:13] (03CR) 10Filippo Giunchedi: [C: 04-1] "Pending confirmation of NDA" [puppet] - 10https://gerrit.wikimedia.org/r/888167 (https://phabricator.wikimedia.org/T329324) (owner: 10Filippo Giunchedi) [08:54:57] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Jon Amar WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T329324 (10fgiunchedi) p:05Triage→03Medium [08:56:33] (03PS1) 10Ayounsi: Add tox and fix reported issues [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/888168 (https://phabricator.wikimedia.org/T277440) [08:57:52] 10SRE-tools, 10Infrastructure-Foundations, 10homer, 10Patch-For-Review: Add CI to homer-deploy repo - https://phabricator.wikimedia.org/T277440 (10ayounsi) With the patch above, running `tox` in the root of this repo runs basic checks. So we can at least manually run it for now. Not sure how to have it aut... [09:02:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P44149 and previous config saved to /var/cache/conftool/dbconfig/20230210-090213-marostegui.json [09:02:38] (03CR) 10Volans: "To have CI run it you can send a patch like this one:" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/888168 (https://phabricator.wikimedia.org/T277440) (owner: 10Ayounsi) [09:03:44] 10SRE-tools, 10Infrastructure-Foundations, 10homer, 10Patch-For-Review: Add CI to homer-deploy repo - https://phabricator.wikimedia.org/T277440 (10Volans) >>! In T277440#8603799, @ayounsi wrote: > With the patch above, running `tox` in the root of this repo runs basic checks. So we can at least manually ru... [09:08:20] (03PS1) 10Slyngshede: P:idm split IDM into staging and prod. [puppet] - 10https://gerrit.wikimedia.org/r/888169 [09:08:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P44150 and previous config saved to /var/cache/conftool/dbconfig/20230210-090848-marostegui.json [09:10:25] (03PS1) 10Muehlenhoff: swift::ring_manager: Enable profile::auto_restarts::service for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/888170 (https://phabricator.wikimedia.org/T135991) [09:12:24] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Santiago Faci - https://phabricator.wikimedia.org/T329296 (10Aklapper) 05Resolved→03Open Reopening as the patch had a typo [09:13:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/888170 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:14:14] (03CR) 10Ayounsi: "Thanks" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/888168 (https://phabricator.wikimedia.org/T277440) (owner: 10Ayounsi) [09:14:43] (03PS2) 10Ayounsi: Add tox and fix reported issues [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/888168 (https://phabricator.wikimedia.org/T277440) [09:15:30] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Santiago Faci - https://phabricator.wikimedia.org/T329296 (10MoritzMuehlenhoff) 05Open→03Resolved Typo got fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/888164/ [09:17:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P44151 and previous config saved to /var/cache/conftool/dbconfig/20230210-091719-marostegui.json [09:18:36] (03PS2) 10Slyngshede: P:idm split IDM into staging and prod. [puppet] - 10https://gerrit.wikimedia.org/r/888169 [09:20:24] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:21:45] (03CR) 10Ayounsi: [C: 03+2] Add tox and fix reported issues [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/888168 (https://phabricator.wikimedia.org/T277440) (owner: 10Ayounsi) [09:21:49] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add tox and fix reported issues [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/888168 (https://phabricator.wikimedia.org/T277440) (owner: 10Ayounsi) [09:22:22] (03PS3) 10Slyngshede: P:idm split IDM into staging and prod. [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/888169 [09:22:49] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 42184 [09:23:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 42184 [09:23:38] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:23:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T328817)', diff saved to https://phabricator.wikimedia.org/P44152 and previous config saved to /var/cache/conftool/dbconfig/20230210-092355-marostegui.json [09:23:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [09:23:59] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:24:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [09:24:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T328817)', diff saved to https://phabricator.wikimedia.org/P44153 and previous config saved to /var/cache/conftool/dbconfig/20230210-092417-marostegui.json [09:27:23] (03PS1) 10Filippo Giunchedi: clinic-duty: add SGIX [software] - 10https://gerrit.wikimedia.org/r/888172 [09:29:06] (03PS1) 10Slyngshede: R:idm_test add IDM test/staging [labs/private] - 10https://gerrit.wikimedia.org/r/888173 [09:30:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T328817)', diff saved to https://phabricator.wikimedia.org/P44154 and previous config saved to /var/cache/conftool/dbconfig/20230210-093020-marostegui.json [09:30:24] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [09:32:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T329203)', diff saved to https://phabricator.wikimedia.org/P44155 and previous config saved to /var/cache/conftool/dbconfig/20230210-093225-marostegui.json [09:32:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:32:30] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [09:32:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:32:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T329203)', diff saved to https://phabricator.wikimedia.org/P44156 and previous config saved to /var/cache/conftool/dbconfig/20230210-093246-marostegui.json [09:34:16] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 17 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:34:30] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:34:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T329203)', diff saved to https://phabricator.wikimedia.org/P44157 and previous config saved to /var/cache/conftool/dbconfig/20230210-093455-marostegui.json [09:36:36] (03CR) 10Vgutierrez: network: drop abuse_networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884040 (owner: 10Jbond) [09:37:56] (03PS6) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [09:39:34] (03PS7) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [09:41:31] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install6001.wikimedia.org [09:44:46] (03CR) 10Elukey: services: add the first lift wing stream to change-prop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [09:45:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P44158 and previous config saved to /var/cache/conftool/dbconfig/20230210-094526-marostegui.json [09:45:58] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:49:52] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: add SGIX [software] - 10https://gerrit.wikimedia.org/r/888172 (owner: 10Filippo Giunchedi) [09:50:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P44159 and previous config saved to /var/cache/conftool/dbconfig/20230210-095001-marostegui.json [09:51:36] (03PS1) 10DCausse: [cirrus] enable CirrusSearchCompletionSuggesterUseDefaultSort on mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888178 (https://phabricator.wikimedia.org/T327878) [09:55:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install6001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:56:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install6001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:56:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:56:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install6001.wikimedia.org [09:57:00] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install6001.wikimedia.org` - install6001.wikimedia.org (**PASS**) - Downtimed host on Icinga/A... [09:57:22] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install5001.wikimedia.org [10:00:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P44160 and previous config saved to /var/cache/conftool/dbconfig/20230210-100033-marostegui.json [10:01:41] (03PS1) 10Zabe: REST: Don't consider prevented edits unexpected [extensions/Wikibase] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887863 (https://phabricator.wikimedia.org/T329233) [10:01:45] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:03:25] (03PS3) 10Btullis: Remove the GPU configuration from an-worker109[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) [10:05:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P44161 and previous config saved to /var/cache/conftool/dbconfig/20230210-100508-marostegui.json [10:06:30] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:07:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install5001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:07:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:07:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install5001.wikimedia.org [10:08:00] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install5001.wikimedia.org` - install5001.wikimedia.org (**PASS**) - Downtimed host on Icinga/A... [10:08:03] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install4001.wikimedia.org [10:09:13] (03CR) 10Btullis: Remove the GPU configuration from an-worker109[6-9] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) (owner: 10Btullis) [10:12:06] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:14:51] (03CR) 10Elukey: Remove the GPU configuration from an-worker109[6-9] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) (owner: 10Btullis) [10:15:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T328817)', diff saved to https://phabricator.wikimedia.org/P44162 and previous config saved to /var/cache/conftool/dbconfig/20230210-101539-marostegui.json [10:15:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:15:43] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [10:15:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [10:16:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T328817)', diff saved to https://phabricator.wikimedia.org/P44163 and previous config saved to /var/cache/conftool/dbconfig/20230210-101600-marostegui.json [10:18:18] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install4001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:20:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T329203)', diff saved to https://phabricator.wikimedia.org/P44164 and previous config saved to /var/cache/conftool/dbconfig/20230210-102014-marostegui.json [10:20:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:20:22] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:20:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:20:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T329203)', diff saved to https://phabricator.wikimedia.org/P44165 and previous config saved to /var/cache/conftool/dbconfig/20230210-102035-marostegui.json [10:21:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install4001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:21:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:21:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install4001.wikimedia.org [10:21:29] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install3001.wikimedia.org [10:21:32] 10SRE, 10Infrastructure-Foundations: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install4001.wikimedia.org` - install4001.wikimedia.org (**PASS**) - Downtimed host on Icinga/A... [10:21:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T328817)', diff saved to https://phabricator.wikimedia.org/P44166 and previous config saved to /var/cache/conftool/dbconfig/20230210-102156-marostegui.json [10:22:00] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [10:22:22] (03PS1) 10Muehlenhoff: Remove Puppet references for install[3456]001 [puppet] - 10https://gerrit.wikimedia.org/r/888192 (https://phabricator.wikimedia.org/T327867) [10:23:20] (03CR) 10Ayounsi: [C: 03+2] Peering news: move verbose logs [puppet] - 10https://gerrit.wikimedia.org/r/885738 (owner: 10Ayounsi) [10:23:38] (03PS1) 10Jelto: gitlab: use /srv/gitlab-backup in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/888193 (https://phabricator.wikimedia.org/T318521) [10:25:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T329203)', diff saved to https://phabricator.wikimedia.org/P44167 and previous config saved to /var/cache/conftool/dbconfig/20230210-102544-marostegui.json [10:25:48] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [10:26:18] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [10:27:27] (03PS1) 10Jelto: aptrepo: remove gitlab package for buster [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) [10:28:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35467 [10:28:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35467 [10:28:16] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8966 [10:29:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8966 [10:29:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 138886 [10:29:18] (03CR) 10Jelto: "I'm not sure if thirdparty/gitlab-bullseye should be renamed to thirdparty/gitlab again. What do you think? In our upgrade workflow it's o" [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [10:29:48] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 138886 [10:30:15] (03CR) 10Muehlenhoff: aptrepo: remove gitlab package for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [10:30:38] (03PS1) 10Volans: Fix incorrect usage of NodeSet [software/spicerack] - 10https://gerrit.wikimedia.org/r/888195 [10:30:55] (03CR) 10Muehlenhoff: aptrepo: remove gitlab package for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [10:31:55] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install3001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:32:22] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4764 [10:32:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4764 [10:33:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install3001.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:33:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:33:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install3001.wikimedia.org [10:33:23] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install3001.wikimedia.org` - install3001.wikimedia.org (**PASS**) - Down... [10:34:16] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9145 [10:34:25] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9145 [10:34:40] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 6677 [10:34:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 6677 [10:34:53] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/888095 (owner: 10Volans) [10:35:03] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [10:35:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 34177 [10:35:12] (03CR) 10Volans: [C: 03+2] debmonitorgc: garbage collect also stale Hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/888095 (owner: 10Volans) [10:35:16] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate the install servers to Bullseye - https://phabricator.wikimedia.org/T327867 (10MoritzMuehlenhoff) [10:35:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 34177 [10:35:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove Puppet references for install[3456]001 [puppet] - 10https://gerrit.wikimedia.org/r/888192 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [10:36:00] (03CR) 10Jbond: [C: 03+1] R:idm_test add IDM test/staging [labs/private] - 10https://gerrit.wikimedia.org/r/888173 (owner: 10Slyngshede) [10:36:24] (03CR) 10Slyngshede: [V: 03+2] R:idm_test add IDM test/staging [labs/private] - 10https://gerrit.wikimedia.org/r/888173 (owner: 10Slyngshede) [10:36:27] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] R:idm_test add IDM test/staging [labs/private] - 10https://gerrit.wikimedia.org/r/888173 (owner: 10Slyngshede) [10:37:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P44168 and previous config saved to /var/cache/conftool/dbconfig/20230210-103702-marostegui.json [10:37:11] (03PS2) 10Jelto: aptrepo: remove gitlab package for buster [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) [10:38:10] (03Merged) 10jenkins-bot: debmonitorgc: garbage collect also stale Hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/888095 (owner: 10Volans) [10:38:14] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39498/console" [puppet] - 10https://gerrit.wikimedia.org/r/888169 (owner: 10Slyngshede) [10:38:44] (03PS1) 10Muehlenhoff: Remove obsolete stub keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/888196 (https://phabricator.wikimedia.org/T327867) [10:40:07] (03CR) 10Jelto: aptrepo: remove gitlab package for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [10:40:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P44169 and previous config saved to /var/cache/conftool/dbconfig/20230210-104051-marostegui.json [10:40:52] (03PS1) 10Elukey: Revert "sre.k8s.upgrade-cluster: fix alertmanager's catch all" [cookbooks] - 10https://gerrit.wikimedia.org/r/887864 [10:42:42] (03PS4) 10Slyngshede: P:idm split IDM into staging and prod. [puppet] - 10https://gerrit.wikimedia.org/r/888169 [10:43:37] (03PS2) 10Jbond: cache: drop abuse_networks from the varnish profiles [puppet] - 10https://gerrit.wikimedia.org/r/884040 [10:43:40] (03PS1) 10Jbond: configmaster: add a cpoy of the ferm requestctl definitions to nda [puppet] - 10https://gerrit.wikimedia.org/r/888197 [10:43:42] (03PS1) 10Jbond: network: drop parse_abuse_nets function [puppet] - 10https://gerrit.wikimedia.org/r/888198 [10:43:54] (03CR) 10Muehlenhoff: "Looks good, one question inline" [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [10:44:15] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39499/console" [puppet] - 10https://gerrit.wikimedia.org/r/888169 (owner: 10Slyngshede) [10:44:35] (03PS2) 10Jbond: configmaster: add a cpoy of the ferm requestctl definitions to nda [puppet] - 10https://gerrit.wikimedia.org/r/888197 [10:45:31] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove obsolete stub keytabs [labs/private] - 10https://gerrit.wikimedia.org/r/888196 (https://phabricator.wikimedia.org/T327867) (owner: 10Muehlenhoff) [10:46:35] (03CR) 10Elukey: [C: 03+2] Revert "sre.k8s.upgrade-cluster: fix alertmanager's catch all" [cookbooks] - 10https://gerrit.wikimedia.org/r/887864 (owner: 10Elukey) [10:47:32] (03CR) 10Jbond: [C: 03+2] configmaster: add a cpoy of the ferm requestctl definitions to nda [puppet] - 10https://gerrit.wikimedia.org/r/888197 (owner: 10Jbond) [10:48:23] (03PS5) 10Slyngshede: P:idm split IDM into staging and prod. [puppet] - 10https://gerrit.wikimedia.org/r/888169 [10:49:34] (03PS3) 10Jbond: cache: drop abuse_networks from the varnish profiles [puppet] - 10https://gerrit.wikimedia.org/r/884040 [10:49:44] (03PS2) 10Jbond: network: drop parse_abuse_nets function [puppet] - 10https://gerrit.wikimedia.org/r/888198 [10:50:52] (03CR) 10Elukey: [C: 03+1] Fix incorrect usage of NodeSet [software/spicerack] - 10https://gerrit.wikimedia.org/r/888195 (owner: 10Volans) [10:51:22] (03PS1) 10Jon Harald Søby: Rename project namespace in guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888200 (https://phabricator.wikimedia.org/T309054) [10:51:24] (03CR) 10Jelto: aptrepo: remove gitlab package for buster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [10:51:44] (03PS3) 10Jelto: aptrepo: remove gitlab package for buster [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) [10:51:56] 23 [10:51:58] @3 [10:52:04] Oi, fingers, please work. [10:52:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P44170 and previous config saved to /var/cache/conftool/dbconfig/20230210-105208-marostegui.json [10:53:42] (03CR) 10Volans: [C: 03+2] Fix incorrect usage of NodeSet [software/spicerack] - 10https://gerrit.wikimedia.org/r/888195 (owner: 10Volans) [10:55:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P44171 and previous config saved to /var/cache/conftool/dbconfig/20230210-105557-marostegui.json [10:57:30] (03Merged) 10jenkins-bot: Fix incorrect usage of NodeSet [software/spicerack] - 10https://gerrit.wikimedia.org/r/888195 (owner: 10Volans) [10:59:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [11:00:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:00:50] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:05:20] !log upgrade puppetdb[12]003 to bookworm T321783 [11:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:24] T321783: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 [11:07:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T328817)', diff saved to https://phabricator.wikimedia.org/P44172 and previous config saved to /var/cache/conftool/dbconfig/20230210-110715-marostegui.json [11:07:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:07:19] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:07:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [11:07:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [11:07:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [11:07:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T328817)', diff saved to https://phabricator.wikimedia.org/P44173 and previous config saved to /var/cache/conftool/dbconfig/20230210-110740-marostegui.json [11:08:45] (03PS1) 10Elukey: sre.k8s.upgrade-cluster: use verbatim_hosts=True for alert manager [cookbooks] - 10https://gerrit.wikimedia.org/r/888202 (https://phabricator.wikimedia.org/T327767) [11:10:00] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/888203 [11:10:09] (03PS1) 10Jbond: requestctl: add mock requestctl data to be used in cloud [labs/private] - 10https://gerrit.wikimedia.org/r/888204 [11:10:17] (03PS2) 10Elukey: sre.k8s.upgrade-cluster: use verbatim_hosts=True for alert manager [cookbooks] - 10https://gerrit.wikimedia.org/r/888202 (https://phabricator.wikimedia.org/T327767) [11:10:36] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/888202 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:10:42] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/888203 (owner: 10Volans) [11:11:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T329203)', diff saved to https://phabricator.wikimedia.org/P44174 and previous config saved to /var/cache/conftool/dbconfig/20230210-111103-marostegui.json [11:11:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [11:11:08] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:11:08] (03CR) 10Jbond: "trying to add some minimum of data so that we can get some basic functionality from requestctl" [labs/private] - 10https://gerrit.wikimedia.org/r/888204 (owner: 10Jbond) [11:11:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [11:11:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T329203)', diff saved to https://phabricator.wikimedia.org/P44175 and previous config saved to /var/cache/conftool/dbconfig/20230210-111124-marostegui.json [11:12:31] (03CR) 10Majavah: "see also https://phabricator.wikimedia.org/T309281 for the cloud/ ipblocks" [labs/private] - 10https://gerrit.wikimedia.org/r/888204 (owner: 10Jbond) [11:12:39] (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: use verbatim_hosts=True for alert manager [cookbooks] - 10https://gerrit.wikimedia.org/r/888202 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:13:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T329203)', diff saved to https://phabricator.wikimedia.org/P44176 and previous config saved to /var/cache/conftool/dbconfig/20230210-111333-marostegui.json [11:14:24] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v6.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/888203 (owner: 10Volans) [11:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T328817)', diff saved to https://phabricator.wikimedia.org/P44177 and previous config saved to /var/cache/conftool/dbconfig/20230210-111434-marostegui.json [11:14:38] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [11:15:23] (03CR) 10Jbond: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [11:18:29] (03CR) 10Jbond: cache: drop abuse_networks from the varnish profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/884040 (owner: 10Jbond) [11:25:41] (03PS1) 10Volans: Upstream release v6.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/888205 [11:26:47] (03CR) 10Volans: [C: 03+2] Upstream release v6.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/888205 (owner: 10Volans) [11:27:27] 10SRE, 10Commons, 10Traffic: HTTP 500 Error while trying to make large tabular JSON data file - https://phabricator.wikimedia.org/T329339 (10Vgutierrez) The error is reported as issued by Varnish but it's not the case. Using the provided JSON file and submitting it using python requests as a preview for a Sa... [11:28:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P44178 and previous config saved to /var/cache/conftool/dbconfig/20230210-112840-marostegui.json [11:29:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P44179 and previous config saved to /var/cache/conftool/dbconfig/20230210-112940-marostegui.json [11:30:28] (03Merged) 10jenkins-bot: Upstream release v6.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/888205 (owner: 10Volans) [11:34:22] !log uploaded spicerack_6.1.0 to apt.wikimedia.org bullseye-wikimedia [11:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:34] (03CR) 10Stang: [tox] Make running `tox` work (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/887830 (https://phabricator.wikimedia.org/T329231) (owner: 10Urbanecm) [11:35:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [11:39:12] (03CR) 10Jbond: Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [11:39:14] (03CR) 10Volans: [C: 03+1] "found an error post-merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/888202 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [11:40:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [11:43:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P44180 and previous config saved to /var/cache/conftool/dbconfig/20230210-114346-marostegui.json [11:44:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/888169 (owner: 10Slyngshede) [11:44:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P44181 and previous config saved to /var/cache/conftool/dbconfig/20230210-114447-marostegui.json [11:45:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/888053 (owner: 10Muehlenhoff) [11:48:09] (03CR) 10Majavah: Add safe.directory directives for the puppet master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888053 (owner: 10Muehlenhoff) [11:48:54] (03PS1) 10Volans: sre.k8s.upgrade-cluster: fix alertmanager param [cookbooks] - 10https://gerrit.wikimedia.org/r/888207 [11:51:35] (03CR) 10Muehlenhoff: Add safe.directory directives for the puppet master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888053 (owner: 10Muehlenhoff) [11:51:44] !log eoghan@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab-runner2002.codfw.wmnet with OS bullseye [11:51:57] (03PS1) 10Clément Goubert: sre.switchdc.services: Exclude wdqs and wdqs-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) [11:53:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/888169 (owner: 10Slyngshede) [11:53:21] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "testing spicerack 6.1.0 - jbond@cumin2002" [11:53:45] (03CR) 10Muehlenhoff: "Thanks for working on this, I'll review and merge next week." [puppet] - 10https://gerrit.wikimedia.org/r/887943 (owner: 10Majavah) [11:53:56] (03CR) 10Volans: "Question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert) [11:54:05] (03CR) 10CI reject: [V: 04-1] sre.switchdc.services: Exclude wdqs and wdqs-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert) [11:54:16] (03CR) 10EoghanGaffney: [C: 03+2] Insert an empty DOCKER-ISOLATION-STAGE-1 chain into the ferm templates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888057 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [11:54:25] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "testing spicerack 6.1.0 - jbond@cumin2002" [11:54:39] (03PS2) 10Hnowlan: Bump Thumbor minor version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290) [11:55:06] (03PS3) 10Hnowlan: Bump Thumbor minor version [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290) [11:56:26] !log jbond@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts sretest1001.eqiad.wmnet [11:58:53] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "testing spicerack 6.1.0 - jbond@cumin2002" [11:58:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T329203)', diff saved to https://phabricator.wikimedia.org/P44182 and previous config saved to /var/cache/conftool/dbconfig/20230210-115852-marostegui.json [11:58:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:58:58] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [11:59:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:59:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T329203)', diff saved to https://phabricator.wikimedia.org/P44183 and previous config saved to /var/cache/conftool/dbconfig/20230210-115913-marostegui.json [11:59:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T328817)', diff saved to https://phabricator.wikimedia.org/P44184 and previous config saved to /var/cache/conftool/dbconfig/20230210-115953-marostegui.json [11:59:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [11:59:57] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:00:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [12:00:14] (03PS4) 10Hokwelum: use lbzip2 instead of bzcat to decompress blocks in parallel [puppet] - 10https://gerrit.wikimedia.org/r/887776 (https://phabricator.wikimedia.org/T328804) [12:00:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44185 and previous config saved to /var/cache/conftool/dbconfig/20230210-120014-marostegui.json [12:00:25] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T329203)', diff saved to https://phabricator.wikimedia.org/P44186 and previous config saved to /var/cache/conftool/dbconfig/20230210-120123-marostegui.json [12:02:05] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "testing spicerack 6.1.0 - jbond@cumin2002" [12:02:22] (03PS2) 10Clément Goubert: sre.switchdc.services: Exclude wdqs and wdqs-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) [12:02:36] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "testing spicerack 6.1.0 - jbond@cumin2002" [12:03:35] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:03:36] (03CR) 10Hnowlan: "Tests pass with new version, changelog looks mostly safe. Only concerns are around the switch to using piexif where we still use pyexiv2. " [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888034 (https://phabricator.wikimedia.org/T329290) (owner: 10Hnowlan) [12:03:44] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "testing spicerack 6.1.0 - jbond@cumin2002" [12:04:02] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [12:06:58] !log eoghan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2002.codfw.wmnet with reason: host reimage [12:07:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44187 and previous config saved to /var/cache/conftool/dbconfig/20230210-120721-marostegui.json [12:07:25] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:07:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P44188 and previous config saved to /var/cache/conftool/dbconfig/20230210-120747-root.json [12:08:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [12:08:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [12:10:03] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2002.codfw.wmnet with reason: host reimage [12:10:53] (03PS3) 10Clément Goubert: sre.switchdc.services: Exclude wdqs and wdqs-ssl [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) [12:11:26] (03CR) 10Clément Goubert: "As discussed, exclude wdqs and wdqs-ssl from the switchover" [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert) [12:12:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [12:12:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [12:12:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T329203)', diff saved to https://phabricator.wikimedia.org/P44189 and previous config saved to /var/cache/conftool/dbconfig/20230210-121252-marostegui.json [12:12:58] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [12:13:26] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [12:13:27] !log jbond@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts sretest1001.eqiad.wmnet [12:15:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T329203)', diff saved to https://phabricator.wikimedia.org/P44190 and previous config saved to /var/cache/conftool/dbconfig/20230210-121530-marostegui.json [12:22:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P44191 and previous config saved to /var/cache/conftool/dbconfig/20230210-122227-marostegui.json [12:22:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P44192 and previous config saved to /var/cache/conftool/dbconfig/20230210-122252-root.json [12:26:49] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2002.codfw.wmnet with OS bullseye [12:29:57] (03CR) 10Nicolas Fraison: [C: 03+1] "LGTM with my current limited knowledge of puppet and our hadoop setup" [puppet] - 10https://gerrit.wikimedia.org/r/887807 (https://phabricator.wikimedia.org/T318696) (owner: 10Btullis) [12:30:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P44193 and previous config saved to /var/cache/conftool/dbconfig/20230210-123036-marostegui.json [12:35:08] (03PS1) 10Hnowlan: Remove vendored thumbor-community-core [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/888210 [12:37:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P44194 and previous config saved to /var/cache/conftool/dbconfig/20230210-123733-marostegui.json [12:37:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P44195 and previous config saved to /var/cache/conftool/dbconfig/20230210-123757-root.json [12:39:52] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 9 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10EChetty) [12:41:41] (03CR) 10Jelto: [C: 03+2] aptrepo: remove gitlab package for buster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [12:44:04] (03CR) 10Jelto: [C: 03+2] aptrepo: remove gitlab package for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888194 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [12:45:22] !log eoghan@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab-runner2003.codfw.wmnet with OS bullseye [12:45:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P44196 and previous config saved to /var/cache/conftool/dbconfig/20230210-124543-marostegui.json [12:46:23] 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10EChetty) [12:49:25] (03CR) 10Gehel: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/887999 (owner: 10Muehlenhoff) [12:52:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44197 and previous config saved to /var/cache/conftool/dbconfig/20230210-125240-marostegui.json [12:52:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [12:52:44] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [12:52:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [12:53:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44198 and previous config saved to /var/cache/conftool/dbconfig/20230210-125301-marostegui.json [12:53:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P44199 and previous config saved to /var/cache/conftool/dbconfig/20230210-125308-root.json [12:55:29] (03CR) 10Jelto: [C: 03+2] gitlab: use /srv/gitlab-backup in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/888193 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [12:57:15] (03PS1) 10Muehlenhoff: raid_handler: Use universal_newlines [puppet] - 10https://gerrit.wikimedia.org/r/888212 (https://phabricator.wikimedia.org/T325046) [13:00:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44200 and previous config saved to /var/cache/conftool/dbconfig/20230210-130002-marostegui.json [13:00:07] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:00:43] !log eoghan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2003.codfw.wmnet with reason: host reimage [13:00:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T329203)', diff saved to https://phabricator.wikimedia.org/P44201 and previous config saved to /var/cache/conftool/dbconfig/20230210-130049-marostegui.json [13:00:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [13:00:53] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [13:01:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [13:01:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T329203)', diff saved to https://phabricator.wikimedia.org/P44202 and previous config saved to /var/cache/conftool/dbconfig/20230210-130110-marostegui.json [13:03:52] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2003.codfw.wmnet with reason: host reimage [13:04:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb1003.eqiad.wmnet [13:04:42] (03CR) 10Klausman: [C: 03+1] role::ml_k8s::staging: upgrade cluster settings for k8s 1.23 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [13:08:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T329203)', diff saved to https://phabricator.wikimedia.org/P44203 and previous config saved to /var/cache/conftool/dbconfig/20230210-130801-marostegui.json [13:08:05] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [13:08:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P44204 and previous config saved to /var/cache/conftool/dbconfig/20230210-130813-root.json [13:09:06] (03CR) 10Clément Goubert: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert) [13:09:33] (03PS2) 10Clément Goubert: sre.switchdc.services: import sre.discovery.datacenter excludes [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [13:09:44] (03CR) 10CI reject: [V: 04-1] sre.switchdc.services: import sre.discovery.datacenter excludes [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert) [13:10:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb1003.eqiad.wmnet [13:10:09] (03PS1) 10Nicolas Fraison: feat(presto): add gc logs [puppet] - 10https://gerrit.wikimedia.org/r/888214 [13:12:47] (03CR) 10Clément Goubert: sre.switchdc.services: Exclude wdqs and wdqs-ssl (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/888208 (https://phabricator.wikimedia.org/T329193) (owner: 10Clément Goubert) [13:15:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P44205 and previous config saved to /var/cache/conftool/dbconfig/20230210-131509-marostegui.json [13:18:07] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2003.codfw.wmnet with OS bullseye [13:19:16] (03CR) 10Volans: [C: 04-1] "Looks sane to me. Just one small bug inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond) [13:19:22] !log Adjusting evpn route export policy on lsw1-e2-eqiad to include host routes [13:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:57] !log upgraded spicerack to 6.1.0 on the cumin hosts [13:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:28] !log eoghan@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab-runner2004.codfw.wmnet with OS bullseye [13:22:10] (03PS2) 10Joal: Update analytics data purge for webrequest_actor [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) [13:23:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P44206 and previous config saved to /var/cache/conftool/dbconfig/20230210-132307-marostegui.json [13:23:18] (03PS3) 10Joal: Update analytics data purge for webrequest_actor [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) [13:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P44207 and previous config saved to /var/cache/conftool/dbconfig/20230210-132318-root.json [13:23:24] (03CR) 10Joal: Update analytics data purge for webrequest_actor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) (owner: 10Joal) [13:25:32] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39501/console" [puppet] - 10https://gerrit.wikimedia.org/r/888214 (owner: 10Nicolas Fraison) [13:27:24] (03CR) 10Stevemunene: [C: 03+2] analytics::refinery::job::druid_load.pp: remove absented jobs [puppet] - 10https://gerrit.wikimedia.org/r/888082 (https://phabricator.wikimedia.org/T328933) (owner: 10Mforns) [13:27:26] (03PS2) 10Nicolas Fraison: feat(presto): add gc logs [puppet] - 10https://gerrit.wikimedia.org/r/888214 (https://phabricator.wikimedia.org/T329054) [13:29:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "Didn't test it but seems sensible" [puppet] - 10https://gerrit.wikimedia.org/r/888212 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [13:30:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P44208 and previous config saved to /var/cache/conftool/dbconfig/20230210-133016-marostegui.json [13:32:53] (03PS11) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [13:33:01] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond) [13:34:05] (03PS1) 10Muehlenhoff: Reapply puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/888218 (https://phabricator.wikimedia.org/T321783) [13:34:34] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond) [13:36:05] (03PS2) 10Muehlenhoff: Reapply puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/888218 (https://phabricator.wikimedia.org/T321783) [13:36:49] !log eoghan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [13:38:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P44209 and previous config saved to /var/cache/conftool/dbconfig/20230210-133813-marostegui.json [13:38:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P44210 and previous config saved to /var/cache/conftool/dbconfig/20230210-133823-root.json [13:38:49] (03PS12) 10Jbond: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 [13:38:53] (03CR) 10Jbond: [C: 03+2] sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond) [13:39:57] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner2004.codfw.wmnet with reason: host reimage [13:40:38] (03Merged) 10jenkins-bot: sre.hardware.firmware: Add ability to defer or prevent reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/883863 (owner: 10Jbond) [13:42:59] (03CR) 10Clément Goubert: [C: 03+1] "Looks pretty noop to me. Good cleanup." [deployment-charts] - 10https://gerrit.wikimedia.org/r/887991 (https://phabricator.wikimedia.org/T329049) (owner: 10Hnowlan) [13:43:18] (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: fix alertmanager param [cookbooks] - 10https://gerrit.wikimedia.org/r/888207 (owner: 10Volans) [13:43:23] (03PS2) 10Elukey: sre.k8s.upgrade-cluster: fix alertmanager param [cookbooks] - 10https://gerrit.wikimedia.org/r/888207 (owner: 10Volans) [13:45:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44211 and previous config saved to /var/cache/conftool/dbconfig/20230210-134523-marostegui.json [13:45:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [13:45:27] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:45:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [13:45:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T328817)', diff saved to https://phabricator.wikimedia.org/P44212 and previous config saved to /var/cache/conftool/dbconfig/20230210-134544-marostegui.json [13:46:41] (03CR) 10Muehlenhoff: [C: 03+2] Reapply puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/888218 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [13:46:50] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [13:48:02] 10SRE, 10Product-Infrastructure-Team-Backlog-Deprecated, 10WMDE-TechWish-Maintenance, 10serviceops, and 3 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10akosiaris) @Msantos, my current understanding is that we are pausing work on this. Should we set to `Stalled` ? [13:48:24] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [13:48:52] (03PS1) 10Cathal Mooney: Adjust EVPN BGP type-5 route creation / export to include host routes [homer/public] - 10https://gerrit.wikimedia.org/r/888219 (https://phabricator.wikimedia.org/T329369) [13:49:14] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [13:51:15] (03CR) 10Jbond: [C: 04-1] "i think the general premise of this change is fine, however we still need the wrap_with_stunnel parameter in the quickdatacopy resources (" [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron) [13:52:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T328817)', diff saved to https://phabricator.wikimedia.org/P44213 and previous config saved to /var/cache/conftool/dbconfig/20230210-135235-marostegui.json [13:52:40] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [13:53:18] (03PS1) 10Andrew Bogott: Openstack database config: lower timeouts quite a bit [puppet] - 10https://gerrit.wikimedia.org/r/888220 (https://phabricator.wikimedia.org/T328155) [13:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T329203)', diff saved to https://phabricator.wikimedia.org/P44214 and previous config saved to /var/cache/conftool/dbconfig/20230210-135319-marostegui.json [13:53:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [13:53:24] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [13:53:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [13:53:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:53:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:53:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T329203)', diff saved to https://phabricator.wikimedia.org/P44215 and previous config saved to /var/cache/conftool/dbconfig/20230210-135345-marostegui.json [13:56:11] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner2004.codfw.wmnet with OS bullseye [13:56:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T329203)', diff saved to https://phabricator.wikimedia.org/P44216 and previous config saved to /var/cache/conftool/dbconfig/20230210-135622-marostegui.json [14:03:01] (03CR) 10Stevemunene: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/888214 (https://phabricator.wikimedia.org/T329054) (owner: 10Nicolas Fraison) [14:05:15] 10SRE, 10Product-Infrastructure-Team-Backlog-Deprecated, 10WMDE-TechWish-Maintenance, 10serviceops, and 3 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10awight) FWIW, there has been parallel work in {T216826} to containerize the whole kartotherian service, which curren... [14:07:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P44217 and previous config saved to /var/cache/conftool/dbconfig/20230210-140741-marostegui.json [14:11:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Export routes generated from ARP/ND in EVPN - https://phabricator.wikimedia.org/T329369 (10cmooney) [14:11:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P44218 and previous config saved to /var/cache/conftool/dbconfig/20230210-141128-marostegui.json [14:14:14] (03CR) 10Nicolas Fraison: [C: 03+2] Update analytics data purge for webrequest_actor [puppet] - 10https://gerrit.wikimedia.org/r/887786 (https://phabricator.wikimedia.org/T324483) (owner: 10Joal) [14:19:44] (03CR) 10Volans: [C: 03+1] "LGTM, optional nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/888212 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [14:22:39] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [14:22:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P44219 and previous config saved to /var/cache/conftool/dbconfig/20230210-142247-marostegui.json [14:23:10] (03PS1) 10Elukey: sre.k8s: add isRegex=False to Prometheus matchers [cookbooks] - 10https://gerrit.wikimedia.org/r/888221 (https://phabricator.wikimedia.org/T327767) [14:25:53] (03PS2) 10Muehlenhoff: raid_handler: Use universal_newlines [puppet] - 10https://gerrit.wikimedia.org/r/888212 (https://phabricator.wikimedia.org/T325046) [14:25:58] (03CR) 10Muehlenhoff: raid_handler: Use universal_newlines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888212 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [14:26:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P44220 and previous config saved to /var/cache/conftool/dbconfig/20230210-142636-marostegui.json [14:30:25] (03PS2) 10Elukey: sre.k8s: add isRegex=False to Prometheus matchers [cookbooks] - 10https://gerrit.wikimedia.org/r/888221 (https://phabricator.wikimedia.org/T327767) [14:30:59] (03PS1) 10Arturo Borrero Gonzalez: cloud: openstack: add kolla-ansible evaluation recipe [puppet] - 10https://gerrit.wikimedia.org/r/888222 (https://phabricator.wikimedia.org/T326758) [14:31:47] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/888221 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:32:17] (03CR) 10Elukey: [C: 03+2] sre.k8s: add isRegex=False to Prometheus matchers [cookbooks] - 10https://gerrit.wikimedia.org/r/888221 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:32:35] (03PS2) 10Arturo Borrero Gonzalez: cloud: openstack: add kolla-ansible evaluation recipe [puppet] - 10https://gerrit.wikimedia.org/r/888222 (https://phabricator.wikimedia.org/T326758) [14:33:44] (03CR) 10Andrew Bogott: [C: 03+2] Openstack database config: lower timeouts quite a bit [puppet] - 10https://gerrit.wikimedia.org/r/888220 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [14:33:52] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [14:34:25] (03CR) 10CI reject: [V: 04-1] cloud: openstack: add kolla-ansible evaluation recipe [puppet] - 10https://gerrit.wikimedia.org/r/888222 (https://phabricator.wikimedia.org/T326758) (owner: 10Arturo Borrero Gonzalez) [14:35:04] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::staging: upgrade cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/884034 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:36:03] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [14:36:15] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host ml-staging-etcd2001.codfw.wmnet with OS bullseye [14:37:30] (03CR) 10Herron: rsync: remove rsync::server::wrap_with_stunnel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron) [14:37:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T328817)', diff saved to https://phabricator.wikimedia.org/P44221 and previous config saved to /var/cache/conftool/dbconfig/20230210-143753-marostegui.json [14:37:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:38:02] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:38:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [14:38:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44222 and previous config saved to /var/cache/conftool/dbconfig/20230210-143815-marostegui.json [14:40:46] (JobUnavailable) firing: (4) Reduced availability for job k8s-api in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:41] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10elukey) Hi folks, I tried to call the new reimage cookbook from sre.k8s.upgrade-cluster and I got the followin... [14:41:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T329203)', diff saved to https://phabricator.wikimedia.org/P44223 and previous config saved to /var/cache/conftool/dbconfig/20230210-144143-marostegui.json [14:41:45] (03PS2) 10Herron: rsync: remove rsync::server::wrap_with_stunnel [puppet] - 10https://gerrit.wikimedia.org/r/888065 [14:41:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [14:41:48] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:41:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [14:42:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44224 and previous config saved to /var/cache/conftool/dbconfig/20230210-144204-marostegui.json [14:42:19] (03PS3) 10Herron: rsync: remove rsync::server::wrap_with_stunnel [puppet] - 10https://gerrit.wikimedia.org/r/888065 [14:43:01] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [14:44:24] (03CR) 10Btullis: [C: 03+1] "Thanks for this, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/888214 (https://phabricator.wikimedia.org/T329054) (owner: 10Nicolas Fraison) [14:44:34] (03PS1) 10Elukey: install_server: remove ml-staging nodes to allow their reimage [puppet] - 10https://gerrit.wikimedia.org/r/888225 (https://phabricator.wikimedia.org/T327767) [14:45:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44225 and previous config saved to /var/cache/conftool/dbconfig/20230210-144530-marostegui.json [14:45:37] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [14:46:03] (03CR) 10Elukey: [C: 03+2] install_server: remove ml-staging nodes to allow their reimage [puppet] - 10https://gerrit.wikimedia.org/r/888225 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:48:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44226 and previous config saved to /var/cache/conftool/dbconfig/20230210-144830-marostegui.json [14:48:35] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [14:52:06] !log elukey@cumin1001 START - Cookbook sre.k8s.upgrade-cluster Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [14:53:22] !log elukey@cumin1001 START - Cookbook sre.ganeti.reimage for host ml-staging-etcd2002.codfw.wmnet with OS bullseye [14:55:46] (JobUnavailable) firing: (4) Reduced availability for job k8s-api in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:28] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:56:49] yeah these are not downtimed, expected :( [14:56:56] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - ml-staging-ctrl_6443: Servers ml-staging-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:57:02] sorry for the noise [15:00:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb2003.codfw.wmnet [15:00:20] (03CR) 10Herron: rsync: remove rsync::server::wrap_with_stunnel (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/888065 (owner: 10Herron) [15:00:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P44227 and previous config saved to /var/cache/conftool/dbconfig/20230210-150038-marostegui.json [15:00:46] (JobUnavailable) firing: (4) Reduced availability for job k8s-api in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:12] (03CR) 10FNegri: [C: 03+1] "+1 for cleaning up the legacy domain, I'm not 100% confident that nothing will break but if it does, it's easy to revert this patch. I wou" [puppet] - 10https://gerrit.wikimedia.org/r/852836 (owner: 10Majavah) [15:02:51] (03CR) 10Cwhite: [C: 03+1] opensearch_dashboards: bump memory limit [puppet] - 10https://gerrit.wikimedia.org/r/888165 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi) [15:03:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P44228 and previous config saved to /var/cache/conftool/dbconfig/20230210-150337-marostegui.json [15:05:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb2003.codfw.wmnet [15:05:50] (03PS1) 10Joal: Remove previously absent timers from analytics data_purge [puppet] - 10https://gerrit.wikimedia.org/r/888228 (https://phabricator.wikimedia.org/T324483) [15:05:59] (03PS1) 10Slyngshede: site.pp remove reimage test server. [puppet] - 10https://gerrit.wikimedia.org/r/888229 (https://phabricator.wikimedia.org/T324744) [15:06:12] (03CR) 10Filippo Giunchedi: [C: 03+2] opensearch_dashboards: bump memory limit [puppet] - 10https://gerrit.wikimedia.org/r/888165 (https://phabricator.wikimedia.org/T327161) (owner: 10Filippo Giunchedi) [15:08:35] (03CR) 10Andrew Bogott: [C: 03+2] puppet: adapt replica_cnf_api to python3.5 [puppet] - 10https://gerrit.wikimedia.org/r/888112 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [15:09:42] (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 (owner: 10Clément Goubert) [15:10:46] (JobUnavailable) firing: (4) Reduced availability for job k8s-api in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:15:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P44229 and previous config saved to /var/cache/conftool/dbconfig/20230210-151544-marostegui.json [15:15:48] (03CR) 10Nicolas Fraison: [C: 03+1] Remove previously absent timers from analytics data_purge [puppet] - 10https://gerrit.wikimedia.org/r/888228 (https://phabricator.wikimedia.org/T324483) (owner: 10Joal) [15:16:10] (03CR) 10Nicolas Fraison: [C: 03+2] Remove previously absent timers from analytics data_purge [puppet] - 10https://gerrit.wikimedia.org/r/888228 (https://phabricator.wikimedia.org/T324483) (owner: 10Joal) [15:17:26] (03PS1) 10Muehlenhoff: Reapply puppetdb role [puppet] - 10https://gerrit.wikimedia.org/r/888230 [15:18:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P44230 and previous config saved to /var/cache/conftool/dbconfig/20230210-151843-marostegui.json [15:22:38] (03PS5) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) [15:22:40] (03PS1) 10DCausse: experimental: add support for custom flink-app config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 [15:22:42] (03PS1) 10DCausse: rdf-streaming-updater: add a config file and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/888232 [15:24:40] (03PS2) 10DCausse: experimental: add support for custom flink-app config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 [15:24:42] (03PS2) 10DCausse: rdf-streaming-updater: add a config file and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/888232 [15:25:48] (03CR) 10CI reject: [V: 04-1] [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:26:18] (03CR) 10CI reject: [V: 04-1] experimental: add support for custom flink-app config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse) [15:27:51] (03CR) 10CI reject: [V: 04-1] experimental: add support for custom flink-app config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse) [15:28:21] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: add a config file and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/888232 (owner: 10DCausse) [15:28:51] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) As I understand the work that is being done to distribute load better, allowing rack C8 to be offline f... [15:29:31] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:01] RECOVERY - Check systemd state on logstash2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44231 and previous config saved to /var/cache/conftool/dbconfig/20230210-153051-marostegui.json [15:30:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [15:30:55] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:31:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [15:31:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T328817)', diff saved to https://phabricator.wikimedia.org/P44232 and previous config saved to /var/cache/conftool/dbconfig/20230210-153112-marostegui.json [15:32:42] (03PS3) 10Arturo Borrero Gonzalez: cloud: openstack: add kolla-ansible evaluation recipe [puppet] - 10https://gerrit.wikimedia.org/r/888222 (https://phabricator.wikimedia.org/T326758) [15:33:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44233 and previous config saved to /var/cache/conftool/dbconfig/20230210-153349-marostegui.json [15:33:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [15:33:53] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:34:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [15:34:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T329203)', diff saved to https://phabricator.wikimedia.org/P44234 and previous config saved to /var/cache/conftool/dbconfig/20230210-153411-marostegui.json [15:34:16] (03PS6) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) [15:34:18] (03PS3) 10DCausse: experimental: add support for custom flink-app config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 [15:34:20] (03PS3) 10DCausse: rdf-streaming-updater: add a config file and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/888232 [15:35:19] (03PS1) 10EoghanGaffney: Set increased thresholds for docker image/volume garbage collection [puppet] - 10https://gerrit.wikimedia.org/r/888234 (https://phabricator.wikimedia.org/T329035) [15:36:51] (03PS2) 10Clément Goubert: Exclude traindev from tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 [15:36:53] (03CR) 10Ebernhardson: [C: 03+1] [cirrus] enable CirrusSearchCompletionSuggesterUseDefaultSort on mnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888178 (https://phabricator.wikimedia.org/T327878) (owner: 10DCausse) [15:37:16] (03CR) 10CI reject: [V: 04-1] experimental: add support for custom flink-app config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse) [15:37:36] (03CR) 10CI reject: [V: 04-1] Exclude traindev from tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 (owner: 10Clément Goubert) [15:37:37] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [15:37:39] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: add a config file and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/888232 (owner: 10DCausse) [15:38:03] (03CR) 10CI reject: [V: 04-1] [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:40:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T328817)', diff saved to https://phabricator.wikimedia.org/P44235 and previous config saved to /var/cache/conftool/dbconfig/20230210-154001-marostegui.json [15:40:05] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [15:40:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T329203)', diff saved to https://phabricator.wikimedia.org/P44236 and previous config saved to /var/cache/conftool/dbconfig/20230210-154034-marostegui.json [15:40:38] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [15:41:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2446.mgmt.codfw.wmnet with reboot policy FORCED [15:41:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2447.mgmt.codfw.wmnet with reboot policy FORCED [15:42:20] (03PS3) 10Clément Goubert: Exclude traindev from tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/888227 [15:42:36] (03PS7) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) [15:42:38] (03PS4) 10DCausse: experimental: add support for custom flink-app config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 [15:42:40] (03PS4) 10DCausse: rdf-streaming-updater: add a config file and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/888232 [15:42:44] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) The answer is (small sample size) that there are no on-disk records for ghosts. Example with an (non-ghost) object in all 3 containers as deleted: ` mvernon@ms-be1069:~$ sud... [15:45:39] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: add a config file and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/888232 (owner: 10DCausse) [15:45:44] (03CR) 10CI reject: [V: 04-1] [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:45:50] (03CR) 10Ahmon Dancy: "Thanks for this and the associated work!" [puppet] - 10https://gerrit.wikimedia.org/r/888234 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [15:45:58] (03CR) 10CI reject: [V: 04-1] experimental: add support for custom flink-app config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse) [15:49:27] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host ml-staging-etcd2002.codfw.wmnet with OS bullseye [15:49:27] !log elukey@cumin1001 END (FAIL) - Cookbook sre.k8s.upgrade-cluster (exit_code=99) Upgrade K8s version: Upgrade ml-staging-codfw cluster to 1.23 [15:49:43] (03PS2) 10EoghanGaffney: Set increased thresholds for docker image/volume garbage collection [puppet] - 10https://gerrit.wikimedia.org/r/888234 (https://phabricator.wikimedia.org/T329035) [15:55:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P44237 and previous config saved to /var/cache/conftool/dbconfig/20230210-155508-marostegui.json [15:55:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P44238 and previous config saved to /var/cache/conftool/dbconfig/20230210-155541-marostegui.json [15:56:04] (03PS4) 10Arturo Borrero Gonzalez: cloud: openstack: add kolla-ansible evaluation recipe [puppet] - 10https://gerrit.wikimedia.org/r/888222 (https://phabricator.wikimedia.org/T326758) [15:56:05] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ml-staging[2001-2002].codfw.wmnet,ml-staging-ctrl[2001-2002].codfw.wmnet,ml-staging-etcd2003.codfw.wmnet with reason: Cluster half broken, in the middle of upgrading [15:56:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ml-staging[2001-2002].codfw.wmnet,ml-staging-ctrl[2001-2002].codfw.wmnet,ml-staging-etcd2003.codfw.wmnet with reason: Cluster half broken, in the middle of upgrading [15:58:05] (03CR) 10Ahmon Dancy: [C: 03+1] Set increased thresholds for docker image/volume garbage collection [puppet] - 10https://gerrit.wikimedia.org/r/888234 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [15:58:19] (03CR) 10Ahmon Dancy: [C: 03+1] Set increased thresholds for docker image/volume garbage collection (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/888234 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [15:59:42] (03PS8) 10DCausse: [WIP] rdf-streaming-updater: add a test job using the k8s operator... [deployment-charts] - 10https://gerrit.wikimedia.org/r/886005 (https://phabricator.wikimedia.org/T328675) [15:59:44] (03PS5) 10DCausse: experimental: add support for custom flink-app config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 [15:59:46] (03PS5) 10DCausse: rdf-streaming-updater: add a config file and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/888232 [16:02:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2447.mgmt.codfw.wmnet with reboot policy FORCED [16:02:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2446.mgmt.codfw.wmnet with reboot policy FORCED [16:03:34] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T327373 (10phaultfinder) [16:03:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2448.mgmt.codfw.wmnet with reboot policy FORCED [16:03:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2449.mgmt.codfw.wmnet with reboot policy FORCED [16:05:09] (03CR) 10EoghanGaffney: [C: 03+2] Set increased thresholds for docker image/volume garbage collection [puppet] - 10https://gerrit.wikimedia.org/r/888234 (https://phabricator.wikimedia.org/T329035) (owner: 10EoghanGaffney) [16:05:38] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [16:06:35] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [16:07:40] (03PS2) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [16:08:42] (03CR) 10Herron: slo_dashboards: dynamic slo dashboard panels (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [16:08:47] (03CR) 10FNegri: [V: 03+2 C: 03+2] Add support for cloud test env (codfw) (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/887797 (owner: 10FNegri) [16:09:07] (03PS3) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [16:09:15] (03PS5) 10Arturo Borrero Gonzalez: cloud: openstack: add kolla-ansible evaluation recipe [puppet] - 10https://gerrit.wikimedia.org/r/888222 (https://phabricator.wikimedia.org/T326758) [16:09:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2448.mgmt.codfw.wmnet with reboot policy FORCED [16:09:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2449.mgmt.codfw.wmnet with reboot policy FORCED [16:10:02] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10BCornwall) Hi, @Sotiale, does @Ladsgroup's answer answer your question? Any progress on the discussion? [16:10:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P44239 and previous config saved to /var/cache/conftool/dbconfig/20230210-161014-marostegui.json [16:10:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2450.mgmt.codfw.wmnet with reboot policy FORCED [16:10:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2451.mgmt.codfw.wmnet with reboot policy FORCED [16:10:43] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [16:10:47] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [16:10:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P44240 and previous config saved to /var/cache/conftool/dbconfig/20230210-161047-marostegui.json [16:11:14] (03PS6) 10Arturo Borrero Gonzalez: cloud: openstack: add kolla-ansible evaluation recipe [puppet] - 10https://gerrit.wikimedia.org/r/888222 (https://phabricator.wikimedia.org/T326758) [16:11:42] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [16:12:03] (03PS4) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [16:13:58] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [16:14:39] (03PS6) 10DCausse: rdf-streaming-updater: add a config file and use it [deployment-charts] - 10https://gerrit.wikimedia.org/r/888232 [16:14:42] (03PS1) 10Ssingh: dnsrecursor: update template to prepare for bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/888236 (https://phabricator.wikimedia.org/T321309) [16:15:42] (03PS5) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [16:15:47] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39502/console" [puppet] - 10https://gerrit.wikimedia.org/r/888236 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:17:21] (03CR) 10CI reject: [V: 04-1] sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [16:22:47] (03PS2) 10Ssingh: dnsrecursor: update template to prepare for bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/888236 (https://phabricator.wikimedia.org/T321309) [16:23:44] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39503/console" [puppet] - 10https://gerrit.wikimedia.org/r/888236 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:24:42] (03CR) 10Ssingh: [V: 03+1] "The diff for cloudservices is expected as it is running pdns-rec 4.6. The config option change is just the name and not what it does." [puppet] - 10https://gerrit.wikimedia.org/r/888236 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [16:25:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T328817)', diff saved to https://phabricator.wikimedia.org/P44241 and previous config saved to /var/cache/conftool/dbconfig/20230210-162520-marostegui.json [16:25:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:25:26] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [16:25:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:25:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:25:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:25:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T329203)', diff saved to https://phabricator.wikimedia.org/P44242 and previous config saved to /var/cache/conftool/dbconfig/20230210-162553-marostegui.json [16:25:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:25:59] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [16:25:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T328817)', diff saved to https://phabricator.wikimedia.org/P44243 and previous config saved to /var/cache/conftool/dbconfig/20230210-162559-marostegui.json [16:26:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:26:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44244 and previous config saved to /var/cache/conftool/dbconfig/20230210-162615-marostegui.json [16:26:37] (03PS7) 10Herron: slo_dashboards: dynamic slo dashboard panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) [16:28:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T328817)', diff saved to https://phabricator.wikimedia.org/P44245 and previous config saved to /var/cache/conftool/dbconfig/20230210-162809-marostegui.json [16:28:44] (03CR) 10Herron: [V: 03+2 C: 03+2] "thanks for the reviews!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [16:30:27] (03PS7) 10Arturo Borrero Gonzalez: cloud: openstack: add kolla-ansible evaluation recipe [puppet] - 10https://gerrit.wikimedia.org/r/888222 (https://phabricator.wikimedia.org/T326758) [16:32:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44246 and previous config saved to /var/cache/conftool/dbconfig/20230210-163238-marostegui.json [16:32:43] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [16:35:44] 10SRE, 10Maps, 10Observability-Metrics, 10observability, and 2 others: SLO dashboards with N latency targets - https://phabricator.wikimedia.org/T320749 (10herron) 05Open→03Resolved The updated dynamic SLO dashboard template and config structure is now live. I think we're good here! If any followup i... [16:43:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P44247 and previous config saved to /var/cache/conftool/dbconfig/20230210-164316-marostegui.json [16:46:55] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-xcollazo-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:47:08] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [16:47:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P44248 and previous config saved to /var/cache/conftool/dbconfig/20230210-164744-marostegui.json [16:48:34] !log reprepro -C main include bullseye-wikimedia gdnsd_3.8.0-1~wmf2_amd64.changes: T321309 [16:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:37] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [16:48:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2451.mgmt.codfw.wmnet with reboot policy FORCED [16:48:50] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2450.mgmt.codfw.wmnet with reboot policy FORCED [16:54:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Papaul) [16:56:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: openstack: add kolla-ansible evaluation recipe [puppet] - 10https://gerrit.wikimedia.org/r/888222 (https://phabricator.wikimedia.org/T326758) (owner: 10Arturo Borrero Gonzalez) [16:57:10] 10SRE-swift-storage: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 (10MatthewVernon) ...and it also produces a tombstone file: ` mvernon@ms-be1061:~$ sudo swift-get-nodes /etc/swift/object.ring.gz AUTH_mw wikipedia-commons-local-public.ad a/ad/0_73aa1_2d9bafe... [16:58:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P44252 and previous config saved to /var/cache/conftool/dbconfig/20230210-165822-marostegui.json [17:02:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P44253 and previous config saved to /var/cache/conftool/dbconfig/20230210-170250-marostegui.json [17:05:36] (03PS6) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [17:07:09] (03PS7) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [17:09:24] (03PS8) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [17:09:58] (03PS9) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [17:13:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T328817)', diff saved to https://phabricator.wikimedia.org/P44254 and previous config saved to /var/cache/conftool/dbconfig/20230210-171328-marostegui.json [17:13:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:13:33] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:13:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:13:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44255 and previous config saved to /var/cache/conftool/dbconfig/20230210-171349-marostegui.json [17:17:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44256 and previous config saved to /var/cache/conftool/dbconfig/20230210-171757-marostegui.json [17:17:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [17:18:01] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [17:18:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [17:18:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T329203)', diff saved to https://phabricator.wikimedia.org/P44257 and previous config saved to /var/cache/conftool/dbconfig/20230210-171818-marostegui.json [17:19:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44258 and previous config saved to /var/cache/conftool/dbconfig/20230210-171943-marostegui.json [17:19:47] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [17:20:45] (03CR) 10Volans: [C: 03+1] "LGTM, in general this script could benefit some modernization for more modern python but out of scope of this fix." [puppet] - 10https://gerrit.wikimedia.org/r/888212 (https://phabricator.wikimedia.org/T325046) (owner: 10Muehlenhoff) [17:22:20] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/888229 (https://phabricator.wikimedia.org/T324744) (owner: 10Slyngshede) [17:22:52] (03PS10) 10Jbond: sre.puppet.sync-netbox-hiera: Use netbox GraphQL endpoint to fetch data [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 [17:23:07] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond) [17:24:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T329203)', diff saved to https://phabricator.wikimedia.org/P44259 and previous config saved to /var/cache/conftool/dbconfig/20230210-172434-marostegui.json [17:24:39] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [17:27:09] (03PS1) 10Slyngshede: sre:ganeti:reimage switch tty [cookbooks] - 10https://gerrit.wikimedia.org/r/888241 (https://phabricator.wikimedia.org/T306661) [17:28:02] (03PS1) 10Andrew Bogott: Openstack database config: enable 'use_db_reconnect' [puppet] - 10https://gerrit.wikimedia.org/r/888242 (https://phabricator.wikimedia.org/T328155) [17:28:36] (03CR) 10Andrew Bogott: [C: 03+2] Openstack database config: enable 'use_db_reconnect' [puppet] - 10https://gerrit.wikimedia.org/r/888242 (https://phabricator.wikimedia.org/T328155) (owner: 10Andrew Bogott) [17:30:01] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/888241 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [17:33:27] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [17:34:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P44260 and previous config saved to /var/cache/conftool/dbconfig/20230210-173450-marostegui.json [17:39:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P44261 and previous config saved to /var/cache/conftool/dbconfig/20230210-173941-marostegui.json [17:47:05] (03CR) 10Elukey: [C: 03+1] sre:ganeti:reimage switch tty [cookbooks] - 10https://gerrit.wikimedia.org/r/888241 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [17:49:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P44262 and previous config saved to /var/cache/conftool/dbconfig/20230210-174956-marostegui.json [17:52:21] (03PS1) 10Majavah: kubeadm: provision .kube/config in root home directory [puppet] - 10https://gerrit.wikimedia.org/r/888245 (https://phabricator.wikimedia.org/T329376) [17:52:38] (03PS31) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [17:54:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P44263 and previous config saved to /var/cache/conftool/dbconfig/20230210-175447-marostegui.json [17:56:17] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [17:56:51] (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [18:03:08] (03PS32) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [18:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T328817)', diff saved to https://phabricator.wikimedia.org/P44264 and previous config saved to /var/cache/conftool/dbconfig/20230210-180502-marostegui.json [18:05:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:05:12] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:05:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:06:35] (03CR) 10CI reject: [V: 04-1] Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [18:07:38] (03PS33) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [18:09:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [18:09:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [18:09:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T328817)', diff saved to https://phabricator.wikimedia.org/P44265 and previous config saved to /var/cache/conftool/dbconfig/20230210-180921-marostegui.json [18:09:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T329203)', diff saved to https://phabricator.wikimedia.org/P44266 and previous config saved to /var/cache/conftool/dbconfig/20230210-180953-marostegui.json [18:09:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:09:58] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [18:10:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [18:11:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T328817)', diff saved to https://phabricator.wikimedia.org/P44267 and previous config saved to /var/cache/conftool/dbconfig/20230210-181132-marostegui.json [18:11:36] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:14:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [18:14:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [18:14:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44268 and previous config saved to /var/cache/conftool/dbconfig/20230210-181456-marostegui.json [18:15:01] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [18:15:22] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1098.eqiad.wmnet - https://phabricator.wikimedia.org/T329171 (10wiki_willy) a:03Jclark-ctr [18:18:19] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [18:21:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44269 and previous config saved to /var/cache/conftool/dbconfig/20230210-182131-marostegui.json [18:21:36] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [18:26:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P44270 and previous config saved to /var/cache/conftool/dbconfig/20230210-182638-marostegui.json [18:32:39] 10ops-codfw, 10DBA: db2181 crashed - https://phabricator.wikimedia.org/T328623 (10Papaul) I create the request below to request for a new main board ` Create Dispatch: Success You have successfully submitted request SR162073970. [18:36:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P44271 and previous config saved to /var/cache/conftool/dbconfig/20230210-183638-marostegui.json [18:41:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P44272 and previous config saved to /var/cache/conftool/dbconfig/20230210-184144-marostegui.json [18:47:49] (03CR) 10Chad: [C: 03+2] REST: Don't consider prevented edits unexpected [extensions/Wikibase] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887863 (https://phabricator.wikimedia.org/T329233) (owner: 10Zabe) [18:51:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P44273 and previous config saved to /var/cache/conftool/dbconfig/20230210-185144-marostegui.json [18:56:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T328817)', diff saved to https://phabricator.wikimedia.org/P44274 and previous config saved to /var/cache/conftool/dbconfig/20230210-185651-marostegui.json [18:56:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:56:55] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [18:57:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [18:57:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T328817)', diff saved to https://phabricator.wikimedia.org/P44275 and previous config saved to /var/cache/conftool/dbconfig/20230210-185712-marostegui.json [18:59:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T328817)', diff saved to https://phabricator.wikimedia.org/P44276 and previous config saved to /var/cache/conftool/dbconfig/20230210-185923-marostegui.json [19:03:38] (03Merged) 10jenkins-bot: REST: Don't consider prevented edits unexpected [extensions/Wikibase] (wmf/1.40.0-wmf.22) - 10https://gerrit.wikimedia.org/r/887863 (https://phabricator.wikimedia.org/T329233) (owner: 10Zabe) [19:06:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44277 and previous config saved to /var/cache/conftool/dbconfig/20230210-190650-marostegui.json [19:06:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance [19:06:55] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [19:07:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance [19:07:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T329203)', diff saved to https://phabricator.wikimedia.org/P44278 and previous config saved to /var/cache/conftool/dbconfig/20230210-190711-marostegui.json [19:13:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T329203)', diff saved to https://phabricator.wikimedia.org/P44279 and previous config saved to /var/cache/conftool/dbconfig/20230210-191322-marostegui.json [19:13:28] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [19:14:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P44280 and previous config saved to /var/cache/conftool/dbconfig/20230210-191429-marostegui.json [19:15:59] !log demon@deploy1002 Started scap: Updating wikibase to fix T329233 [19:16:02] T329233: Wikibase\Repo\RestApi\Domain\Services\ItemUpdateFailed: +----------+---------------------------+--------------------------------------+| error | actionthrottledtex - https://phabricator.wikimedia.org/T329233 [19:16:35] (03CR) 10Ottomata: "Nice, this is kinda what I was thinking too! I also like how this doesn't really affect the way the FlinkDeployment is declared. This ju" [deployment-charts] - 10https://gerrit.wikimedia.org/r/888231 (owner: 10DCausse) [19:17:01] (03CR) 10Dzahn: [C: 03+1] gitlab: use /srv/gitlab-backup in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/888193 (https://phabricator.wikimedia.org/T318521) (owner: 10Jelto) [19:23:48] !log demon@deploy1002 Finished scap: Updating wikibase to fix T329233 (duration: 07m 49s) [19:23:52] T329233: Wikibase\Repo\RestApi\Domain\Services\ItemUpdateFailed: +----------+---------------------------+--------------------------------------+| error | actionthrottledtex - https://phabricator.wikimedia.org/T329233 [19:25:22] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888247 (https://phabricator.wikimedia.org/T325585) [19:25:25] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888247 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [19:26:09] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888247 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [19:28:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P44281 and previous config saved to /var/cache/conftool/dbconfig/20230210-192828-marostegui.json [19:29:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P44282 and previous config saved to /var/cache/conftool/dbconfig/20230210-192935-marostegui.json [19:30:39] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 13 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [19:32:54] (03CR) 10Dzahn: [C: 03+2] logspam.pl: Filter out some persistent noise [puppet] - 10https://gerrit.wikimedia.org/r/888050 (https://phabricator.wikimedia.org/T323254) (owner: 10Ahmon Dancy) [19:33:15] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.22 refs T325585 [19:33:19] T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585 [19:34:15] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 13.6 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [19:39:37] RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 2.633 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [19:39:50] !log demon@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.22 refs T325585 (duration: 06m 34s) [19:39:54] T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585 [19:43:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P44283 and previous config saved to /var/cache/conftool/dbconfig/20230210-194335-marostegui.json [19:44:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T328817)', diff saved to https://phabricator.wikimedia.org/P44284 and previous config saved to /var/cache/conftool/dbconfig/20230210-194443-marostegui.json [19:44:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [19:44:47] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [19:44:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [19:45:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T328817)', diff saved to https://phabricator.wikimedia.org/P44285 and previous config saved to /var/cache/conftool/dbconfig/20230210-194504-marostegui.json [19:46:23] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Jon Amar WMDE to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T329324 (10KFrancis) @fgiunchedi I have set up and sent out the NDA for signatures. I'll confirm when it's complete. Thanks! [19:47:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T328817)', diff saved to https://phabricator.wikimedia.org/P44286 and previous config saved to /var/cache/conftool/dbconfig/20230210-194715-marostegui.json [19:52:35] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:58:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T329203)', diff saved to https://phabricator.wikimedia.org/P44287 and previous config saved to /var/cache/conftool/dbconfig/20230210-195841-marostegui.json [19:58:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [19:58:46] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [19:58:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [19:59:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T329203)', diff saved to https://phabricator.wikimedia.org/P44288 and previous config saved to /var/cache/conftool/dbconfig/20230210-195902-marostegui.json [19:59:16] (03PS1) 10Dzahn: phabricator: set phd_service_ensure to 'masked' [puppet] - 10https://gerrit.wikimedia.org/r/888248 (https://phabricator.wikimedia.org/T329285) [20:01:13] (03CR) 10CI reject: [V: 04-1] phabricator: set phd_service_ensure to 'masked' [puppet] - 10https://gerrit.wikimedia.org/r/888248 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [20:01:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T329203)', diff saved to https://phabricator.wikimedia.org/P44289 and previous config saved to /var/cache/conftool/dbconfig/20230210-200118-marostegui.json [20:02:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P44290 and previous config saved to /var/cache/conftool/dbconfig/20230210-200221-marostegui.json [20:16:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P44291 and previous config saved to /var/cache/conftool/dbconfig/20230210-201625-marostegui.json [20:17:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P44292 and previous config saved to /var/cache/conftool/dbconfig/20230210-201728-marostegui.json [20:18:48] (03PS2) 10Dzahn: phabricator: set phd_service_ensure to 'masked' [puppet] - 10https://gerrit.wikimedia.org/r/888248 (https://phabricator.wikimedia.org/T329285) [20:26:44] (03PS1) 10Dzahn: wmflib: add data type Ensure::Service that allows 'masked' [puppet] - 10https://gerrit.wikimedia.org/r/888251 [20:27:04] (03PS2) 10Dzahn: wmflib: add data type Ensure::Service that allows 'masked' [puppet] - 10https://gerrit.wikimedia.org/r/888251 (https://phabricator.wikimedia.org/T329285) [20:29:33] (03CR) 10CI reject: [V: 04-1] wmflib: add data type Ensure::Service that allows 'masked' [puppet] - 10https://gerrit.wikimedia.org/r/888251 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [20:29:43] 10SRE, 10LDAP-Access-Requests: Grant Access to 'cn=nda or cn=wmf' for ekalkst - https://phabricator.wikimedia.org/T328145 (10Aklapper) 05Open→03Declined Unfortunately declining this Phabricator task as no further information has been provided. @Ekalkst: After you have provided the information asked for an... [20:31:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P44293 and previous config saved to /var/cache/conftool/dbconfig/20230210-203131-marostegui.json [20:32:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T328817)', diff saved to https://phabricator.wikimedia.org/P44294 and previous config saved to /var/cache/conftool/dbconfig/20230210-203234-marostegui.json [20:32:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [20:32:38] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [20:32:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [20:32:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T328817)', diff saved to https://phabricator.wikimedia.org/P44295 and previous config saved to /var/cache/conftool/dbconfig/20230210-203255-marostegui.json [20:33:34] (03PS3) 10Dzahn: wmflib: add data type Ensure::Service that allows 'masked' [puppet] - 10https://gerrit.wikimedia.org/r/888251 (https://phabricator.wikimedia.org/T329285) [20:35:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T328817)', diff saved to https://phabricator.wikimedia.org/P44296 and previous config saved to /var/cache/conftool/dbconfig/20230210-203506-marostegui.json [20:46:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T329203)', diff saved to https://phabricator.wikimedia.org/P44297 and previous config saved to /var/cache/conftool/dbconfig/20230210-204638-marostegui.json [20:46:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [20:46:42] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [20:46:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [20:50:02] (03PS3) 10Dzahn: phabricator: set phd_service_ensure to 'masked' [puppet] - 10https://gerrit.wikimedia.org/r/888248 (https://phabricator.wikimedia.org/T329285) [20:50:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P44298 and previous config saved to /var/cache/conftool/dbconfig/20230210-205012-marostegui.json [20:50:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [20:50:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [20:50:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44299 and previous config saved to /var/cache/conftool/dbconfig/20230210-205059-marostegui.json [20:57:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44300 and previous config saved to /var/cache/conftool/dbconfig/20230210-205722-marostegui.json [20:57:26] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [21:02:19] PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: / 1831 MB (3% inode=97%): /srv/swift-storage/sda3 10543 MB (5% inode=99%): /tmp 1831 MB (3% inode=97%): /var/tmp 1831 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [21:04:40] (03PS4) 10Dzahn: phabricator: set phd_service_ensure to 'masked' [puppet] - 10https://gerrit.wikimedia.org/r/888248 (https://phabricator.wikimedia.org/T329285) [21:05:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P44301 and previous config saved to /var/cache/conftool/dbconfig/20230210-210519-marostegui.json [21:08:19] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 106, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:11:30] (03CR) 10Dzahn: "I could not reproduce the issue that phd is running. It had not been started by puppet for over a day. Regardless I am setting it to maske" [puppet] - 10https://gerrit.wikimedia.org/r/888248 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [21:11:39] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/888248/39507/" [puppet] - 10https://gerrit.wikimedia.org/r/888248 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [21:12:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P44302 and previous config saved to /var/cache/conftool/dbconfig/20230210-211228-marostegui.json [21:19:34] (03CR) 10Dzahn: [C: 03+2] "doesn't work either because further down the stack it's also not an allowed value. Invalid value "masked". Valid values are stopped, runn" [puppet] - 10https://gerrit.wikimedia.org/r/888248 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [21:20:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T328817)', diff saved to https://phabricator.wikimedia.org/P44303 and previous config saved to /var/cache/conftool/dbconfig/20230210-212025-marostegui.json [21:20:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [21:20:30] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [21:20:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [21:20:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T328817)', diff saved to https://phabricator.wikimedia.org/P44304 and previous config saved to /var/cache/conftool/dbconfig/20230210-212046-marostegui.json [21:22:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T328817)', diff saved to https://phabricator.wikimedia.org/P44305 and previous config saved to /var/cache/conftool/dbconfig/20230210-212257-marostegui.json [21:23:41] (03CR) 10Dzahn: [C: 03+2] "We had this issue before, this is all familiar, we solved it for zuul. This is the "enable" parameter of service, not the "ensure" paramet" [puppet] - 10https://gerrit.wikimedia.org/r/888248 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [21:27:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P44306 and previous config saved to /var/cache/conftool/dbconfig/20230210-212734-marostegui.json [21:31:43] (03PS1) 10Dzahn: phabricator: add phd_service_enable parameter [puppet] - 10https://gerrit.wikimedia.org/r/888263 (https://phabricator.wikimedia.org/T329285) [21:32:04] (03CR) 10Dzahn: [C: 03+2] "continued at https://gerrit.wikimedia.org/r/c/operations/puppet/+/888263/" [puppet] - 10https://gerrit.wikimedia.org/r/888248 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [21:34:01] (03Abandoned) 10Dzahn: wmflib: add data type Ensure::Service that allows 'masked' [puppet] - 10https://gerrit.wikimedia.org/r/888251 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [21:37:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:38:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P44307 and previous config saved to /var/cache/conftool/dbconfig/20230210-213803-marostegui.json [21:38:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:39:31] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49566 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:39:43] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:41:19] 10SRE, 10Data-Persistence, 10cloud-services-team, 10serviceops, and 3 others: Wikitech issues for datacentre switchover (March 2023) - https://phabricator.wikimedia.org/T328768 (10nskaggs) Sounds like there's a plan in place here. Thank you! I did also want to add my support for {T237773} to avoid this typ... [21:42:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44308 and previous config saved to /var/cache/conftool/dbconfig/20230210-214241-marostegui.json [21:42:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [21:42:45] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [21:42:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [21:42:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [21:43:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [21:43:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T329203)', diff saved to https://phabricator.wikimedia.org/P44309 and previous config saved to /var/cache/conftool/dbconfig/20230210-214308-marostegui.json [21:43:27] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/888263/39508/" [puppet] - 10https://gerrit.wikimedia.org/r/888263 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [21:49:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T329203)', diff saved to https://phabricator.wikimedia.org/P44310 and previous config saved to /var/cache/conftool/dbconfig/20230210-214901-marostegui.json [21:49:06] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [21:53:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P44311 and previous config saved to /var/cache/conftool/dbconfig/20230210-215310-marostegui.json [21:53:30] (03PS1) 10Dzahn: phabricator: pass through phd_service_enable parameter [puppet] - 10https://gerrit.wikimedia.org/r/888271 (https://phabricator.wikimedia.org/T329285) [22:03:13] (03CR) 10Dzahn: [C: 03+2] phabricator: pass through phd_service_enable parameter [puppet] - 10https://gerrit.wikimedia.org/r/888271 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [22:04:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P44312 and previous config saved to /var/cache/conftool/dbconfig/20230210-220408-marostegui.json [22:04:44] (03CR) 10Dzahn: [C: 03+2] "Notice: /Stage[main]/Phabricator/Systemd::Service[phd]/Service[phd]/enable: enable changed 'true' to 'false'" [puppet] - 10https://gerrit.wikimedia.org/r/888271 (https://phabricator.wikimedia.org/T329285) (owner: 10Dzahn) [22:08:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T328817)', diff saved to https://phabricator.wikimedia.org/P44313 and previous config saved to /var/cache/conftool/dbconfig/20230210-220816-marostegui.json [22:08:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [22:08:21] T328817: Drop cuc_user and cuc_user_text from cu_changes in wmf wikis - https://phabricator.wikimedia.org/T328817 [22:08:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [22:16:23] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: phd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:48] (03PS1) 10TrainBranchBot: group2 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888272 (https://phabricator.wikimedia.org/T325585) [22:17:50] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888272 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [22:18:04] phab2002: that's intended [22:18:22] but monitoring should not exist unless service is activated [22:18:25] (03Merged) 10jenkins-bot: group2 wikis to 1.40.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/888272 (https://phabricator.wikimedia.org/T325585) (owner: 10TrainBranchBot) [22:18:37] it's one of those "trivial" things that end up taking a whole day [22:19:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P44314 and previous config saved to /var/cache/conftool/dbconfig/20230210-221914-marostegui.json [22:19:35] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: debugging [22:19:51] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: debugging [22:19:53] "Unable to verify all hosts got downtimed" .. [22:25:44] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.40.0-wmf.22 refs T325585 [22:25:48] T325585: 1.40.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T325585 [22:34:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T329203)', diff saved to https://phabricator.wikimedia.org/P44315 and previous config saved to /var/cache/conftool/dbconfig/20230210-223420-marostegui.json [22:34:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:34:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [22:34:25] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [22:34:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44316 and previous config saved to /var/cache/conftool/dbconfig/20230210-223430-marostegui.json [22:39:13] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:39:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44317 and previous config saved to /var/cache/conftool/dbconfig/20230210-223946-marostegui.json [22:39:50] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [22:39:59] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:44:49] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: debugging [22:44:53] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: debugging [22:54:19] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P44318 and previous config saved to /var/cache/conftool/dbconfig/20230210-225452-marostegui.json [23:09:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P44319 and previous config saved to /var/cache/conftool/dbconfig/20230210-230958-marostegui.json [23:20:15] (03PS1) 10Dzahn: phabricator: stop/disable/mask phd based on phabricator_server setting [puppet] - 10https://gerrit.wikimedia.org/r/888274 (https://phabricator.wikimedia.org/T329285) [23:25:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T329203)', diff saved to https://phabricator.wikimedia.org/P44320 and previous config saved to /var/cache/conftool/dbconfig/20230210-232505-marostegui.json [23:25:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [23:25:15] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [23:25:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [23:25:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T329203)', diff saved to https://phabricator.wikimedia.org/P44321 and previous config saved to /var/cache/conftool/dbconfig/20230210-232526-marostegui.json [23:29:47] (03PS1) 10JHathaway: CI runner: skip helm library charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/888275 [23:29:49] (03PS1) 10JHathaway: Add jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 [23:30:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2450.mgmt.codfw.wmnet with reboot policy FORCED [23:31:05] (03CR) 10JHathaway: "kindly review" [deployment-charts] - 10https://gerrit.wikimedia.org/r/888275 (owner: 10JHathaway) [23:31:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T329203)', diff saved to https://phabricator.wikimedia.org/P44322 and previous config saved to /var/cache/conftool/dbconfig/20230210-233118-marostegui.json [23:31:22] T329203: Add new column cuc_only_for_read_old to cu_changes for migration purposes to wmf wikis - https://phabricator.wikimedia.org/T329203 [23:33:45] (03CR) 10JHathaway: "First stab at adding the jaeger chart. This is definitely missing some pieces, but have had a tough time coming up to speed on how to inte" [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 (owner: 10JHathaway) [23:34:58] (03CR) 10CI reject: [V: 04-1] Add jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/888276 (owner: 10JHathaway) [23:42:48] 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove aliases `minnan` and `zh-cfr` for the Min Nan Wikipedia - https://phabricator.wikimedia.org/T230382 (10Sotiale) >>! In T230382#8605191, @BCornwall wrote: > Hi, @Sotiale, does @Ladsgroup's answer answer your quest... [23:46:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P44323 and previous config saved to /var/cache/conftool/dbconfig/20230210-234624-marostegui.json [23:52:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2450.mgmt.codfw.wmnet with reboot policy FORCED [23:52:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mw2451.mgmt.codfw.wmnet with reboot policy FORCED