[00:06:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage [00:08:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [00:09:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2006-dev.codfw.wmnet with reason: host reimage [00:11:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [00:14:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2004-dev.codfw.wmnet with reason: host reimage [00:17:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage [00:20:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt2005-dev.codfw.wmnet with reason: host reimage [00:25:03] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:29:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:29:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2006-dev.codfw.wmnet with OS bullseye [00:29:34] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:29:39] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt2006-dev.codfw.wmnet with OS bullseye completed:... [00:30:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:30:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2004-dev.codfw.wmnet with OS bullseye [00:30:37] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt2004-... [00:36:49] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:37:48] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:37:58] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [00:38:14] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [00:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/944341 [00:38:40] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/944341 (owner: 10TrainBranchBot) [00:40:11] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: setup switch interfaces and DNS for pc201[5-6] - pt1979@cumin2002" [00:42:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: setup switch interfaces and DNS for pc201[5-6] - pt1979@cumin2002" [00:42:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:45:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Papaul) [00:52:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host pc2015.mgmt.codfw.wmnet with reboot policy FORCED [00:54:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/944341 (owner: 10TrainBranchBot) [00:57:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Papaul) [00:58:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:58:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt2005-dev.codfw.wmnet with OS bullseye [00:58:45] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudvirt2005-dev.codfw.wmnet with OS bullseye completed:... [00:59:57] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [01:00:18] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10Papaul) [01:00:59] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt200[4-6]-dev - https://phabricator.wikimedia.org/T342459 (10Papaul) 05Open→03Resolved @Andrew All your's [01:02:07] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: setup switch port and DNS for pc2016 - pt1979@cumin2002" [01:02:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: setup switch port and DNS for pc2016 - pt1979@cumin2002" [01:02:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:04:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host pc2016.mgmt.codfw.wmnet with reboot policy FORCED [01:14:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host pc2016.mgmt.codfw.wmnet with reboot policy FORCED [01:14:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Papaul) @Jhancock.wm I can not run the provision cookbook on pc2016. I checked the serial number, it is correct. when you are back on site, can you please check the mgmt cable?... [01:20:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2015.mgmt.codfw.wmnet with reboot policy FORCED [01:24:58] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc2015'] [01:27:17] PROBLEM - cinder-scheduler process on cloudcontrol1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:27:39] PROBLEM - cinder-api http on cloudcontrol1007 is CRITICAL: connect to address 208.80.155.104 and port 18776: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:27:55] PROBLEM - cinder-volume process on cloudcontrol1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:28:41] PROBLEM - cinder-volume process on cloudcontrol1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:29:17] PROBLEM - cinder-scheduler process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:29:27] PROBLEM - cinder-api http on cloudcontrol1005 is CRITICAL: connect to address 10.64.151.3 and port 18776: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:36:05] RECOVERY - cinder-volume process on cloudcontrol1006 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:37:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pc2015'] [01:45:00] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:48:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Papaul) [02:06:35] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:36] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:11] (03CR) 10Cwhite: [C: 03+2] Revert "logstash remove wikifunctions response field" [puppet] - 10https://gerrit.wikimedia.org/r/944194 (https://phabricator.wikimedia.org/T343176) (owner: 10Cwhite) [02:18:43] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:25] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:11] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:53] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:36] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:38] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [03:55:03] RECOVERY - cinder-api http on cloudcontrol1005 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 663 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [03:55:07] RECOVERY - cinder-scheduler process on cloudcontrol1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [03:59:09] RECOVERY - cinder-scheduler process on cloudcontrol1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [03:59:23] RECOVERY - cinder-api http on cloudcontrol1007 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 666 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:27:57] Evening. Enwiki may need to enable emergency CAPTCHA in the next hour. Forgetting who I'm supposed to ping for that [04:28:36] glancing at topic, guess I'll ping godog ? [04:44:49] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] noc: unify methods to fetch the current wiki versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942671 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [04:45:21] <_joe_> Tamzin: no I doubt at this time of the day you can get anyone to respond this fast [04:45:29] (03Merged) 10jenkins-bot: noc: unify methods to fetch the current wiki versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942671 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [04:46:06] <_joe_> Tamzin: you can ask me though - although I've no idea how to enable the emergency CAPTCHA, and I would beed a task to do so [04:46:10] <_joe_> *need [04:46:23] I can look up tasks that have done it before [04:46:33] just, juggling with AbuseFilter Whac-a-Mole atm [04:46:45] <_joe_> yeah let me know if I can help [04:49:07] <_joe_> !log running scap pull on mwmaint1002 to pick up the noc.w.o changes [04:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:01] _joe_: https://phabricator.wikimedia.org/T343294 [04:51:51] <_joe_> Tamzin: ack, on it, it will take me a few mins to make sure I get the patch right [04:52:07] no worries, thanks so much :) [04:58:07] (03PS1) 10Giuseppe Lavagetto: enwiki: temp enable emergencyCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944361 (https://phabricator.wikimedia.org/T343294) [05:01:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "self-merging given the emergency." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944361 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [05:02:57] (03Merged) 10jenkins-bot: enwiki: temp enable emergencyCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944361 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [05:02:59] (03CR) 10Marostegui: [C: 03+1] enwiki: temp enable emergencyCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944361 (https://phabricator.wikimedia.org/T343294) (owner: 10Giuseppe Lavagetto) [05:03:11] <_joe_> thanks marostegui <3 [05:05:16] <_joe_> Tamzin: deploying now, will take ~ 5-10 minutes to go into effect [05:06:23] thanks _joe_ ! [05:06:57] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:05] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:11:14] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: enabling emergency captcha on enwiki - T343294 (duration: 06m 36s) [05:12:26] <_joe_> Tamzin: ^^ captchas on edit/create should be active now [05:12:35] yay! [05:16:26] (03PS8) 10Giuseppe Lavagetto: noc: don't use on-disk files but etcd directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942672 (https://phabricator.wikimedia.org/T341859) [05:18:34] (03CR) 10Giuseppe Lavagetto: noc: don't use on-disk files but etcd directly (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942672 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [05:19:58] (03CR) 10Giuseppe Lavagetto: [V: 03+1] noc: remove symlinks and also neutralize createTxtFileSymlinks (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942675 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [05:20:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: don't use on-disk files but etcd directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942672 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [05:20:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2025 T343254', diff saved to https://phabricator.wikimedia.org/P49943 and previous config saved to /var/cache/conftool/dbconfig/20230802-052021-root.json [05:20:25] T343254: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 [05:20:44] (03Merged) 10jenkins-bot: noc: don't use on-disk files but etcd directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942672 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [05:21:44] (03PS1) 10Marostegui: es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/944362 (https://phabricator.wikimedia.org/T343254) [05:22:47] (03CR) 10Marostegui: [C: 03+2] es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/944362 (https://phabricator.wikimedia.org/T343254) (owner: 10Marostegui) [05:23:18] !log Stop mariadb on es2025 for onsite maintenance dbmaint codfw T343254 [05:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:54] 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) Server depooled and turned off. Proceed as needed. [05:25:31] <_joe_> uhm I did something wrong it seems [05:25:46] (03PS1) 10Marostegui: Revert "db1130: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/944203 [05:27:18] <_joe_> Tamzin: sorry, I should always caffeinate before I deploy code :/ [05:27:30] <_joe_> I am now actually deploying the config change [05:27:51] oh no, what went wrong? [05:28:09] <_joe_> I forgot to pull the code on the deployment server [05:28:16] XD [05:28:25] <_joe_> yeah that's the dumbest thing ever [05:28:30] (03CR) 10Marostegui: [C: 03+2] Revert "db1130: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/944203 (owner: 10Marostegui) [05:28:37] <_joe_> marostegui: revoke my root rights please [05:28:49] <_joe_> (jokes aside, caffeine did its trick) [05:28:56] it was in effect originally though, right? because the stuff stopped right when you said it would [05:29:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 1%: Repooling after replacing its memory', diff saved to https://phabricator.wikimedia.org/P49944 and previous config saved to /var/cache/conftool/dbconfig/20230802-052906-root.json [05:29:38] 10SRE, 10ops-eqiad, 10DBA: db1130 crash memory errors - https://phabricator.wikimedia.org/T343076 (10Marostegui) Host being repooled [05:30:42] <_joe_> Tamzin: uh interesting. [05:31:20] I guess, y'all would be able to see if there was a big spike in failed CAPTCHAs around then, right? [05:31:47] <_joe_> yeah, probably in the logs [05:33:12] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: enabling emergency captcha on enwiki - T343294 (take 2) (duration: 06m 40s) [05:33:42] Well, I'm going to go stargaze on the beach so the vandals don't win. Will check IRC periodically for if something comes up :) [05:34:17] <_joe_> eheh have a nice day :) [05:38:01] (03PS8) 10Giuseppe Lavagetto: noc: centralize file list management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942673 (https://phabricator.wikimedia.org/T341859) [05:39:14] (03PS1) 10Marostegui: site.pp: Add pc2015, pc2016 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/944363 (https://phabricator.wikimedia.org/T342163) [05:40:45] (03CR) 10Marostegui: [C: 03+2] site.pp: Add pc2015, pc2016 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/944363 (https://phabricator.wikimedia.org/T342163) (owner: 10Marostegui) [05:41:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence, 10Patch-For-Review: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Marostegui) I have merged the puppet change to get them "insetup" mode. [05:42:06] (03CR) 10Giuseppe Lavagetto: noc: centralize file list management (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942673 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [05:43:06] (03PS9) 10Giuseppe Lavagetto: noc: centralize file list management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942673 (https://phabricator.wikimedia.org/T341859) [05:44:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 3%: Repooling after replacing its memory', diff saved to https://phabricator.wikimedia.org/P49945 and previous config saved to /var/cache/conftool/dbconfig/20230802-054411-root.json [05:45:02] (ConfdResourceFailed) firing: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:46:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: centralize file list management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942673 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [05:47:07] (03Merged) 10jenkins-bot: noc: centralize file list management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942673 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [05:52:06] (03PS1) 10Elukey: admin_ng: increase again resource limits for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/944738 [05:58:12] (03PS1) 10Elukey: custom_deploy.d: increase isto gateway resources for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/944760 [05:59:02] (03CR) 10Giuseppe Lavagetto: [V: 03+1] noc: add static file server (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [05:59:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 5%: Repooling after replacing its memory', diff saved to https://phabricator.wikimedia.org/P49946 and previous config saved to /var/cache/conftool/dbconfig/20230802-055916-root.json [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T0600) [06:01:31] (03PS10) 10Giuseppe Lavagetto: noc: add static file server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) [06:04:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: add static file server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [06:05:03] (03Merged) 10jenkins-bot: noc: add static file server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942674 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [06:05:57] (03CR) 10Elukey: [C: 03+2] admin_ng: increase again resource limits for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/944738 (owner: 10Elukey) [06:07:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: stop serving static files from symlinks [puppet] - 10https://gerrit.wikimedia.org/r/942607 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [06:10:49] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [06:11:13] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [06:12:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [06:12:34] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [06:12:53] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [06:13:15] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [06:14:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Repooling after replacing its memory', diff saved to https://phabricator.wikimedia.org/P49947 and previous config saved to /var/cache/conftool/dbconfig/20230802-061420-root.json [06:14:26] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: increase isto gateway resources for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/944760 (owner: 10Elukey) [06:21:20] (03PS10) 10Giuseppe Lavagetto: noc: remove symlinks and also neutralize createTxtFileSymlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942675 (https://phabricator.wikimedia.org/T341859) [06:24:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: remove symlinks and also neutralize createTxtFileSymlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942675 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [06:25:07] (03Merged) 10jenkins-bot: noc: remove symlinks and also neutralize createTxtFileSymlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942675 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [06:29:22] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Repooling after replacing its memory', diff saved to https://phabricator.wikimedia.org/P49948 and previous config saved to /var/cache/conftool/dbconfig/20230802-062925-root.json [06:30:00] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:30] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2114 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/944342 (https://phabricator.wikimedia.org/T343296) [06:31:32] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:06] (03PS1) 10Marostegui: site.pp: Add db21[88-95] [puppet] - 10https://gerrit.wikimedia.org/r/944765 (https://phabricator.wikimedia.org/T342174) [06:44:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 50%: Repooling after replacing its memory', diff saved to https://phabricator.wikimedia.org/P49949 and previous config saved to /var/cache/conftool/dbconfig/20230802-064431-root.json [06:49:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [06:51:28] (03PS1) 10Muehlenhoff: profile::mirrors::serve: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944768 [06:59:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Repooling after replacing its memory', diff saved to https://phabricator.wikimedia.org/P49950 and previous config saved to /var/cache/conftool/dbconfig/20230802-065936-root.json [07:00:04] Amir1, Urbanecm, and taavi: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T0700) [07:00:04] aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:02:45] _joe_: I see you +2'ing mw-config patches, is production clear for the backport window? [07:03:02] <_joe_> taavi: absolutely [07:03:16] <_joe_> those patches only impacted noc.wikimedia.org, which works correctly now [07:03:29] o/ [07:03:35] <_joe_> I decided to just run scap pull on the host [07:03:57] ok, thanks [07:04:09] (03PS4) 10Majavah: idwikisource change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944190 (https://phabricator.wikimedia.org/T341173) (owner: 10Anzx) [07:04:14] (03PS3) 10Majavah: Change idwikisource logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944221 (https://phabricator.wikimedia.org/T341173) (owner: 10Anzx) [07:04:21] (03CR) 10Jelto: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 16.0 [puppet] - 10https://gerrit.wikimedia.org/r/941398 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [07:05:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944190 (https://phabricator.wikimedia.org/T341173) (owner: 10Anzx) [07:05:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944221 (https://phabricator.wikimedia.org/T341173) (owner: 10Anzx) [07:06:52] (03Merged) 10jenkins-bot: idwikisource change wgSiteName, wgMetaNamespace and add project namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944190 (https://phabricator.wikimedia.org/T341173) (owner: 10Anzx) [07:06:55] (03Merged) 10jenkins-bot: Change idwikisource logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944221 (https://phabricator.wikimedia.org/T341173) (owner: 10Anzx) [07:07:47] !log taavi@deploy1002 Started scap: Backport for [[gerrit:944190|idwikisource change wgSiteName, wgMetaNamespace and add project namespace alias (T341173)]], [[gerrit:944221|Change idwikisource logos (T341173)]] [07:07:50] T341173: Change the Indonesian Wikisource's name and project namespace from Wikisource to Wikisumber - https://phabricator.wikimedia.org/T341173 [07:09:26] !log taavi@deploy1002 anzx and taavi: Backport for [[gerrit:944190|idwikisource change wgSiteName, wgMetaNamespace and add project namespace alias (T341173)]], [[gerrit:944221|Change idwikisource logos (T341173)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:09:35] aanzx: please test [07:09:35] Testing [07:10:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff) [07:13:22] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:25] (03PS2) 10Muehlenhoff: profile::mirrors::serve: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944768 [07:13:31] taavi: tested looks good [07:13:35] !log taavi@deploy1002 anzx and taavi: Continuing with sync [07:13:40] thanks, syncing [07:14:23] (03CR) 10Marostegui: [C: 03+2] site.pp: Add db21[88-95] [puppet] - 10https://gerrit.wikimedia.org/r/944765 (https://phabricator.wikimedia.org/T342174) (owner: 10Marostegui) [07:14:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Repooling after replacing its memory', diff saved to https://phabricator.wikimedia.org/P49951 and previous config saved to /var/cache/conftool/dbconfig/20230802-071441-root.json [07:17:48] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:18:14] !log installing Linux 5.10.179-3 on bullseye hosts [07:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:29] taavi: can you run namespacedupes after sync [07:19:30] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:944190|idwikisource change wgSiteName, wgMetaNamespace and add project namespace alias (T341173)]], [[gerrit:944221|Change idwikisource logos (T341173)]] (duration: 11m 43s) [07:19:33] T341173: Change the Indonesian Wikisource's name and project namespace from Wikisource to Wikisumber - https://phabricator.wikimedia.org/T341173 [07:19:34] and done [07:27:07] (03CR) 10Kaleem Bhatti: "xqt please help to merge this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [07:27:52] (03PS1) 10Jelto: Revert "aptrepo: update gitlab-ce & gitlab-runner to 16.0" [puppet] - 10https://gerrit.wikimedia.org/r/944846 [07:28:15] (03CR) 10Jelto: "GitLab suggest to update to 15.11.13 first:" [puppet] - 10https://gerrit.wikimedia.org/r/944846 (owner: 10Jelto) [07:28:16] !log mwscript namespaceDupes.php idwikisource --fix --add-prefix "BROKEN " # T341173 [07:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:21] T341173: Change the Indonesian Wikisource's name and project namespace from Wikisource to Wikisumber - https://phabricator.wikimedia.org/T341173 [07:29:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/944846 (owner: 10Jelto) [07:30:06] (03CR) 10Jelto: [C: 03+2] Revert "aptrepo: update gitlab-ce & gitlab-runner to 16.0" [puppet] - 10https://gerrit.wikimedia.org/r/944846 (owner: 10Jelto) [07:30:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff) [07:31:26] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:34:22] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:35:00] 10SRE, 10Traffic, 10observability: HAProxy metrics go down on config reload - https://phabricator.wikimedia.org/T343000 (10Vgutierrez) It looks like it's a matter of how we graph the data, please see: https://grafana.wikimedia.org/goto/7xCydjqVk?orgId=1 {F37159719} first panel is the original one using `rat... [07:36:28] (03PS1) 10Giuseppe Lavagetto: noc: switch to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/944840 (https://phabricator.wikimedia.org/T341859) [07:50:27] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE) Hi there, thanks for working on this. This is my SSH pub key: ` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDLPQJbf/DeX4HbcC2tp2SjzVDKFrpJB2liGGh2OEdbmzCcPmGw2NnzuJAXLLehL... [07:51:14] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10roti_WMDE) a:05roti_WMDE→03None [08:00:05] dancy and jnuche: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T0800). [08:05:52] (03CR) 10Volans: [C: 03+2] Revert "validators: temporary support for esams->knams" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/944192 (owner: 10Volans) [08:06:24] (03Merged) 10jenkins-bot: Revert "validators: temporary support for esams->knams" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/944192 (owner: 10Volans) [08:07:50] !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:10:08] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:11:14] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:13:51] (03CR) 10Filippo Giunchedi: [C: 03+1] graphite: Pass port without Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944254 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:15:25] !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:17:25] (03CR) 10Volans: "replies inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/943620 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [08:23:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [08:27:46] 10SRE, 10Traffic, 10observability: HAProxy metrics go down on config reload - https://phabricator.wikimedia.org/T343000 (10fgiunchedi) Looking at the raw data https://w.wiki/7AxF there's indeed a counter "reset" e.g. around 16:00 {F37159747} I'm not sure offhand why moving to a smaller period fixes things,... [08:32:28] confdresourcefailed has been firing since yesterday, known ? [08:32:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944254 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:34:26] godog: John has been working on setting up a separate conftool instance for the Puppet 7 infra, that should be related [08:35:05] ack thanks, cc jbond ^ re: ConfdResourceFailed alert [08:35:12] it's separate Ganeti VMs, config-maste[12]001 [08:35:29] John's out for the rest of the week [08:37:16] <_joe_> a separate conftool instance?? [08:37:33] moritzm: I see, ok! [08:38:39] _joe_: separate VMs for https://config-master.wikimedia.org/ which currently run on puppetmasters [08:38:54] configmaster I meant, mental typo [08:38:55] the thing you discussed on task, nothing more ;) [08:39:04] don't get an heart attack [08:39:06] no need :D [08:39:40] !log downgrade gitlab-ce package to 15.11.13-ce.0 [08:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:03] if someone has time to look into the alert that'd be useful, it looks "scary" because there's 450 alerts firing and maybe it is fine [08:40:37] hi all (cc godog) if the mahines are causing an isue they can be shutdown. i thught i had silenced everything but perhaps i missed something [08:40:51] the machines .. config-mastr[12]01 [08:41:39] jbond: cheers! no worries, please do enjoy your vacation [08:41:50] ok thanks :) [08:42:09] <_joe_> jbond: GO AWAY :D [08:42:21] lol ok [08:42:27] * jbond logs of [08:42:47] godog: I was looking at team-sre/confd.yaml that seems the souce of the alert [08:44:14] indeed [08:44:14] who's generating confd_resource_healthy? [08:44:36] also is this alert without any instance? [08:44:50] that might explain why it doesn't match the downtime [08:45:10] or at least I'm failing to see the instance in alerts.w.o [08:45:14] (03CR) 10Vgutierrez: [C: 03+1] noc: switch to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/944840 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [08:45:27] that's likely it yeah, the instances show up in the linked dashboard volans [08:47:00] sure but if it's not on the alert it can't be downtimed [08:49:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] noc: switch to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/944840 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [08:50:03] fair enough, I'll tweak the alert [08:50:16] (03CR) 10Kamila Součková: [C: 03+1] noc: switch to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/944840 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [08:50:47] ther eis already a silence for instance=~"config-master[12].*" [08:53:16] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: GitLab minor version upgrade [08:54:28] (03PS1) 10Filippo Giunchedi: sre: add 'instance' to ConfdResourceFailed [alerts] - 10https://gerrit.wikimedia.org/r/944843 [08:56:03] (03CR) 10CI reject: [V: 04-1] sre: add 'instance' to ConfdResourceFailed [alerts] - 10https://gerrit.wikimedia.org/r/944843 (owner: 10Filippo Giunchedi) [08:57:46] oh yeah of course [08:58:54] (03PS2) 10Filippo Giunchedi: sre: add 'instance' to ConfdResourceFailed [alerts] - 10https://gerrit.wikimedia.org/r/944843 [08:58:59] (03PS1) 10Giuseppe Lavagetto: mw-misc: add fileserving rewrite to noc [deployment-charts] - 10https://gerrit.wikimedia.org/r/944844 (https://phabricator.wikimedia.org/T341859) [08:59:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-misc: add fileserving rewrite to noc [deployment-charts] - 10https://gerrit.wikimedia.org/r/944844 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [09:00:15] (03Merged) 10jenkins-bot: mw-misc: add fileserving rewrite to noc [deployment-charts] - 10https://gerrit.wikimedia.org/r/944844 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [09:01:27] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [09:01:35] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch" [alerts] - 10https://gerrit.wikimedia.org/r/944843 (owner: 10Filippo Giunchedi) [09:01:49] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [09:01:53] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add 'instance' to ConfdResourceFailed [alerts] - 10https://gerrit.wikimedia.org/r/944843 (owner: 10Filippo Giunchedi) [09:02:02] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [09:02:21] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [09:04:18] (03PS1) 10Elukey: knative-serving: add a variable to tune the controller's replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/944867 [09:08:23] (03CR) 10Elukey: [C: 03+2] knative-serving: add a variable to tune the controller's replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/944867 (owner: 10Elukey) [09:11:13] (03PS4) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [09:11:38] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10Volans) [09:11:41] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [09:12:06] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:12:54] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10Volans) @bking please use the task comments to discuss changes instead of modifying the task description [09:13:45] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:15:43] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [09:17:23] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [09:18:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:20:01] (ConfdResourceFailed) resolved: (446) confd resource _srv_config-master_pybal_codfw_apertium.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:20:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:24:12] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: GitLab minor version upgrade [09:24:42] 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) [09:26:31] (03PS1) 10Muehlenhoff: ferm network defs: Add $CLOUD_NETWORKS alias [puppet] - 10https://gerrit.wikimedia.org/r/944870 [09:26:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:27:16] 10SRE, 10Infrastructure-Foundations, 10netops: Announce new public IPv6 prefix from Amsterdam for knams migration - https://phabricator.wikimedia.org/T343216 (10cmooney) Being announced from all esams/knams routers now, for example: ` cmooney@re0.cr2-esams> show route advertising-protocol bgp 2001:7f8:1:0:a5... [09:27:55] (03PS1) 10Cathal Mooney: Add new IP ranges assigned for esams post-migration to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/944871 (https://phabricator.wikimedia.org/T343214) [09:32:21] (03PS2) 10Cathal Mooney: Add new IP ranges assigned for esams post-migration to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/944871 (https://phabricator.wikimedia.org/T343214) [09:33:30] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/944871 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [09:34:50] (03CR) 10Cathal Mooney: [C: 03+2] Add new IP ranges assigned for esams post-migration to geo-maps [dns] - 10https://gerrit.wikimedia.org/r/944871 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [09:35:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:28] (03PS1) 10Samtar: Revert "enwiki: temp enable emergencyCaptcha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944847 [09:35:38] (03PS2) 10Samtar: Revert "enwiki: temp enable emergencyCaptcha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944847 [09:37:49] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [09:39:06] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:42:57] (03PS5) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [09:45:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:56] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [09:52:38] (03CR) 10Samtar: [C: 04-1] "self -1, awaiting confirmation per T343294#9061900" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944847 (owner: 10Samtar) [09:53:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [09:54:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [09:54:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [09:54:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [09:54:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T342617)', diff saved to https://phabricator.wikimedia.org/P49954 and previous config saved to /var/cache/conftool/dbconfig/20230802-095428-ladsgroup.json [09:54:32] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1000) [10:01:21] (03PS1) 10Lucas Werkmeister (WMDE): Inject LanguageNameLookupFactory into WikibaseValueFormatterBuilders [extensions/Wikibase] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944848 (https://phabricator.wikimedia.org/T281726) [10:02:16] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: GitLab minor version upgrade [10:03:20] also I just noticed I missed the puppet window yesterday, sorry jbond [10:03:23] I’ll see if I can find a +1 [10:04:32] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:04:45] ^ should resolve in 15m [10:09:46] ack [10:11:18] (03PS1) 10AikoChou: ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/944874 [10:13:41] (03CR) 10AikoChou: [C: 03+2] ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/944874 (owner: 10AikoChou) [10:14:26] (03Merged) 10jenkins-bot: ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/944874 (owner: 10AikoChou) [10:15:30] (03Abandoned) 10Volans: Use only active authdns hosts for DNS changes [cookbooks] - 10https://gerrit.wikimedia.org/r/941758 (owner: 10Volans) [10:15:52] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937508 (owner: 10Volans) [10:16:30] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:51] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: downtime mgmt only in AM [cookbooks] - 10https://gerrit.wikimedia.org/r/937509 (owner: 10Volans) [10:17:16] (03CR) 10Volans: [C: 03+2] sre.dns.netbox: use cumin alias for Netbox hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/941757 (owner: 10Volans) [10:17:34] (03PS1) 10Cathal Mooney: Add IP pre-assignments for new lvs servers in Amsterdam [puppet] - 10https://gerrit.wikimedia.org/r/944875 (https://phabricator.wikimedia.org/T343214) [10:18:31] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/937508 (owner: 10Volans) [10:18:47] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: skip site.pp for matches [cookbooks] - 10https://gerrit.wikimedia.org/r/941759 (https://phabricator.wikimedia.org/T297516) (owner: 10Volans) [10:19:10] (03Merged) 10jenkins-bot: sre.hosts.decommission: downtime mgmt only in AM [cookbooks] - 10https://gerrit.wikimedia.org/r/937509 (owner: 10Volans) [10:19:12] (03CR) 10Samtar: "Confirmation at T343294#9062057" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944847 (owner: 10Samtar) [10:20:00] elukey: fiwiki village pump has a thread wondering why RC is showing all edits as 'very likely bad faith', I'm wondering if the ORES liftwing migration might be causing that [10:20:31] (03CR) 10Volans: "Seeking consensus" [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [10:20:38] jouncebot: nowandnext [10:20:38] For the next 0 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1000) [10:20:39] In 2 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1300) [10:20:40] (03Merged) 10jenkins-bot: sre.dns.netbox: use cumin alias for Netbox hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/941757 (owner: 10Volans) [10:21:13] Going to deploy 944847: Revert "enwiki: temp enable emergencyCaptcha" | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/944847 [10:21:17] (03PS6) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [10:21:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944847 (owner: 10Samtar) [10:22:14] (03Merged) 10jenkins-bot: Revert "enwiki: temp enable emergencyCaptcha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944847 (owner: 10Samtar) [10:22:42] !log samtar@deploy1002 Started scap: Backport for [[gerrit:944847|Revert "enwiki: temp enable emergencyCaptcha"]] [10:22:52] * TheresNoTime never remembers to do the "skip mwdebug" thing [10:24:03] taavi: o/ do they say when it started? fiwiki is indeed served by liftwing now, but should be the same models.. now we are super lucky since Ilias, who wrote the change, is on holidays for the next days :D [10:24:18] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:24:18] !log samtar@deploy1002 samtar: Backport for [[gerrit:944847|Revert "enwiki: temp enable emergencyCaptcha"]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [10:24:21] !log samtar@deploy1002 samtar: Continuing with sync [10:24:28] (^ oooh, yay!) [10:25:15] elukey: the thread was created yesterday evening, so presumably sometime before that [10:25:49] (03PS2) 10Volans: sre.hosts.decommission: skip site.pp for matches [cookbooks] - 10https://gerrit.wikimedia.org/r/941759 (https://phabricator.wikimedia.org/T297516) [10:27:13] taavi: yeah happened on the 31st, time matches - https://sal.toolforge.org/log/lppcq4kBGiVuUzOdMCry [10:27:33] taavi: could you please open a task with the Machine-Learning-team tag? [10:27:40] will do [10:27:46] we'll try to investigate what's happening, thanks a lot! [10:29:08] (03PS2) 10Volans: sre.hosts.decommission: search in the DNS repo too [cookbooks] - 10https://gerrit.wikimedia.org/r/941760 [10:29:19] elukey: T343308 [10:29:20] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [10:29:31] taavi: <3 [10:29:38] (03CR) 10Volans: [C: 03+2] "Addressed rename comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/941760 (owner: 10Volans) [10:30:05] (03PS7) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [10:30:15] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:944847|Revert "enwiki: temp enable emergencyCaptcha"]] (duration: 07m 33s) [10:31:41] (03Abandoned) 10Volans: Revert "setup.py: add temporary upper limit for pylint" [cookbooks] - 10https://gerrit.wikimedia.org/r/862364 (owner: 10Volans) [10:32:47] (03CR) 10Volans: "@cdanis: kind reminder that this is available whenever you want it" [alerts] - 10https://gerrit.wikimedia.org/r/902316 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [10:33:09] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:37:00] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: GitLab minor version upgrade [10:40:33] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [10:40:40] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: GitLab minor version upgrade [10:41:05] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [10:41:57] (03PS1) 10Muehlenhoff: ntp: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944880 [10:46:24] (03CR) 10Fabfur: [C: 03+2] Version 0.4.6-4 [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/944209 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [10:46:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944880 (owner: 10Muehlenhoff) [10:50:13] (03Abandoned) 10Volans: sre.ganeti.makevm: refactor to simplify expansion [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans) [10:50:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] varnish: add requestctl to X-analytics for static actions too [puppet] - 10https://gerrit.wikimedia.org/r/941448 (https://phabricator.wikimedia.org/T342577) (owner: 10Giuseppe Lavagetto) [10:51:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: add requestctl to X-analytics for static actions too [puppet] - 10https://gerrit.wikimedia.org/r/941448 (https://phabricator.wikimedia.org/T342577) (owner: 10Giuseppe Lavagetto) [11:00:52] (03CR) 10Muehlenhoff: [C: 03+2] ferm::service: Fix handling of multiple ports [puppet] - 10https://gerrit.wikimedia.org/r/944233 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:09:50] (03PS3) 10Muehlenhoff: profile::mirrors::serve: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944768 [11:11:32] (03PS8) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [11:13:57] (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/944880 (owner: 10Muehlenhoff) [11:14:36] (03CR) 10CI reject: [V: 04-1] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [11:15:02] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:15:26] (03PS1) 10Giuseppe Lavagetto: mw-misc: fixup for noc website [deployment-charts] - 10https://gerrit.wikimedia.org/r/944886 [11:16:15] (03CR) 10Clément Goubert: [C: 03+1] mw-misc: fixup for noc website [deployment-charts] - 10https://gerrit.wikimedia.org/r/944886 (owner: 10Giuseppe Lavagetto) [11:17:24] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [11:17:47] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [11:17:48] (03CR) 10Muehlenhoff: [C: 03+2] graphite: Pass port without Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944254 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:18:02] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [11:18:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-misc: fixup for noc website [deployment-charts] - 10https://gerrit.wikimedia.org/r/944886 (owner: 10Giuseppe Lavagetto) [11:18:24] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [11:18:55] (03Merged) 10jenkins-bot: mw-misc: fixup for noc website [deployment-charts] - 10https://gerrit.wikimedia.org/r/944886 (owner: 10Giuseppe Lavagetto) [11:21:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff) [11:33:03] (03PS1) 10Hnowlan: service: add discovery for AQS [puppet] - 10https://gerrit.wikimedia.org/r/944888 (https://phabricator.wikimedia.org/T342213) [11:36:52] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) 05Resolved→03Open I can't connect to the server (and when I log over the management I can also not get a serial console), can you please have a look? [11:40:07] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: sretest2002.codfw.wmnet [11:40:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: sretest2002.codfw.wmnet [11:40:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [11:40:14] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: sretest2002.codfw.wmnet [11:40:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [11:41:32] !log installing libxml2 security updates [11:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T342617)', diff saved to https://phabricator.wikimedia.org/P49958 and previous config saved to /var/cache/conftool/dbconfig/20230802-114237-ladsgroup.json [11:42:40] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:48:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [11:51:47] (03PS1) 10Volans: sre.hosts.decomission: run git grep in dry-run too [cookbooks] - 10https://gerrit.wikimedia.org/r/944889 [11:52:16] (03CR) 10Muehlenhoff: [C: 03+2] ntp: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944880 (owner: 10Muehlenhoff) [11:56:00] (03PS2) 10Muehlenhoff: orchestrator: Remove ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/932166 [11:57:20] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: GitLab minor version upgrade [11:57:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P49959 and previous config saved to /var/cache/conftool/dbconfig/20230802-115743-ladsgroup.json [11:59:35] (03CR) 10Volans: [C: 03+2] "Trivial, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/944889 (owner: 10Volans) [12:01:20] (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 16.0 [puppet] - 10https://gerrit.wikimedia.org/r/944891 (https://phabricator.wikimedia.org/T338460) [12:01:56] (03Merged) 10jenkins-bot: sre.hosts.decomission: run git grep in dry-run too [cookbooks] - 10https://gerrit.wikimedia.org/r/944889 (owner: 10Volans) [12:03:05] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:03:16] (03CR) 10EoghanGaffney: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner to 16.0 [puppet] - 10https://gerrit.wikimedia.org/r/944891 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [12:06:04] (03CR) 10Jelto: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner to 16.0 [puppet] - 10https://gerrit.wikimedia.org/r/944891 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [12:08:05] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:09] !log Depool mw1451 and mw1452 for reimage as wikikube nodes - T343306 [12:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:12] T343306: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 [12:09:25] (03CR) 10Muehlenhoff: [C: 03+2] orchestrator: Remove ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/932166 (owner: 10Muehlenhoff) [12:09:45] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=mw1451.eqiad.wmnet [12:09:51] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=mw1452.eqiad.wmnet [12:11:10] !log update gitlab-ce package to 16.0.8-ce.0 [12:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P49960 and previous config saved to /var/cache/conftool/dbconfig/20230802-121249-ladsgroup.json [12:13:27] !log Repool mw1451 and mw1452, more recent servers will be used - T343306 [12:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:50] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on people2003.codfw.wmnet with reason: Resizing disk [12:18:03] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on people2003.codfw.wmnet with reason: Resizing disk [12:19:12] !log Depool mw1497 and mw1498 for reimage as wikikube nodes - T343306 [12:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:15] T343306: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 [12:19:16] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=mw1497.eqiad.wmnet [12:19:20] !log cgoubert@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=mw1498.eqiad.wmnet [12:21:30] !log dcausse@deploy1002 Started deploy [airflow-dags/search@8bba01c]: search: do not use hive partitions to wait for wmf_raw.mediawiki_page [12:21:41] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@8bba01c]: search: do not use hive partitions to wait for wmf_raw.mediawiki_page (duration: 00m 11s) [12:23:52] RECOVERY - Disk space on people1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=people1004&var-datasource=eqiad+prometheus/ops [12:24:41] (03PS1) 10Clément Goubert: dsh: Remove mw1497 and mw1498 from appserver [puppet] - 10https://gerrit.wikimedia.org/r/944893 (https://phabricator.wikimedia.org/T343306) [12:25:22] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue alert [warning] - https://phabricator.wikimedia.org/T343318 (10LSobanski) [12:26:24] 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue alert [warning] - https://phabricator.wikimedia.org/T343319 (10LSobanski) [12:27:40] (03PS1) 10Clément Goubert: site.pp: Rename mw149[7-8] to kubernetes102[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) [12:27:44] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944893 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [12:27:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T342617)', diff saved to https://phabricator.wikimedia.org/P49961 and previous config saved to /var/cache/conftool/dbconfig/20230802-122756-ladsgroup.json [12:27:56] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [12:27:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [12:28:00] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:28:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [12:28:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T342617)', diff saved to https://phabricator.wikimedia.org/P49962 and previous config saved to /var/cache/conftool/dbconfig/20230802-122816-ladsgroup.json [12:29:03] (03PS5) 10Slyngshede: P:backup::host unique id is not available in facter3 [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) [12:29:36] (03PS2) 10Clément Goubert: dsh: Remove mw1497 and mw1498 from appserver [puppet] - 10https://gerrit.wikimedia.org/r/944893 (https://phabricator.wikimedia.org/T343306) [12:29:43] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944893 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [12:30:34] (03PS2) 10Clément Goubert: site.pp: Rename mw149[7-8] to kubernetes102[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) [12:31:10] (03PS3) 10Clément Goubert: site.pp: Rename mw149[7-8] to kubernetes102[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) [12:31:40] (03CR) 10CI reject: [V: 04-1] site.pp: Rename mw149[7-8] to kubernetes102[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [12:32:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1184 T342284', diff saved to https://phabricator.wikimedia.org/P49963 and previous config saved to /var/cache/conftool/dbconfig/20230802-123228-ladsgroup.json [12:32:32] T342284: db1218 crashed - https://phabricator.wikimedia.org/T342284 [12:32:43] (03CR) 10Clément Goubert: "To merge after the decommission cookbook has been run and rename done in netbox" [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [12:33:25] (03PS4) 10Clément Goubert: site.pp: Rename mw149[7-8] to kubernetes102[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) [12:34:42] (03PS1) 10Muehlenhoff: cassandra: Pass ports in firewall-agnostic format [puppet] - 10https://gerrit.wikimedia.org/r/944896 [12:35:06] (03CR) 10CI reject: [V: 04-1] cassandra: Pass ports in firewall-agnostic format [puppet] - 10https://gerrit.wikimedia.org/r/944896 (owner: 10Muehlenhoff) [12:35:16] PROBLEM - mediawiki-installation DSH group on mw1498 is CRITICAL: Host mw1498 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:35:57] (03CR) 10Jcrespo: [C: 04-1] "I don't think this is what we agreed on the comments, nor I am seeing the puppet compiler results for all backed up hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:36:49] (03PS2) 10Muehlenhoff: cassandra: Pass ports in firewall-agnostic format [puppet] - 10https://gerrit.wikimedia.org/r/944896 [12:37:52] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [12:38:06] (03CR) 10Slyngshede: P:backup::host unique id is not available in facter3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:38:55] (03PS1) 10Ladsgroup: mariadb: Switch candidate host of s1 [puppet] - 10https://gerrit.wikimedia.org/r/944897 (https://phabricator.wikimedia.org/T342284) [12:39:31] (03CR) 10Jcrespo: P:backup::host unique id is not available in facter3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:39:51] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42750/console" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:40:54] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10CommRel-Specialists-Support (Jul-Sep-2023): Request Access to Superset querying presto_analytics_hive datasets - https://phabricator.wikimedia.org/T343320 (10ARamadan-WMF) [12:42:09] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10CommRel-Specialists-Support (Jul-Sep-2023): Request Access to Superset querying presto_analytics_hive datasets - https://phabricator.wikimedia.org/T343320 (10ARamadan-WMF) [12:42:17] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics_product@8bba01c]: Redeploy of analytics_product Airflow instance [12:42:26] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics_product@8bba01c]: Redeploy of analytics_product Airflow instance (duration: 00m 08s) [12:43:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P49964 and previous config saved to /var/cache/conftool/dbconfig/20230802-124305-ladsgroup.json [12:44:07] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10CommRel-Specialists-Support (Jul-Sep-2023): Request Access to Superset querying presto_analytics_hive datasets - https://phabricator.wikimedia.org/T343320 (10Elitre) Approved [12:44:41] (03PS6) 10Slyngshede: P:backup::host unique id is not available in facter3 [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) [12:46:06] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42751/console" [puppet] - 10https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: 10Slyngshede) [12:46:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944896 (owner: 10Muehlenhoff) [12:47:48] (03PS1) 10Stang: uzwiki: Install WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944898 (https://phabricator.wikimedia.org/T343270) [12:48:50] PROBLEM - mediawiki-installation DSH group on mw1497 is CRITICAL: Host mw1497 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:50:13] (03CR) 10CDanis: NEL: add alert by country (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902316 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [12:56:06] (03CR) 10Volans: "reply inline" [alerts] - 10https://gerrit.wikimedia.org/r/902316 (https://phabricator.wikimedia.org/T328941) (owner: 10Volans) [12:56:24] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10vadim-kovalenko) Hi there! I'm responsible for Kiwix migration to another API, but given the discussion above I'm curious wheth... [12:57:43] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [12:58:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P49965 and previous config saved to /var/cache/conftool/dbconfig/20230802-125810-ladsgroup.json [13:00:10] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1300). nyaa~ [13:00:10] Lucas_WMDE and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] o/ [13:00:20] o/ [13:00:21] * Lucas_WMDE bonks jouncebot [13:00:47] * taavi assumes Lucas_WMDE will deploy [13:00:53] yup, can do [13:01:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "kicking off gate-and-submit" [extensions/Wikibase] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944848 (https://phabricator.wikimedia.org/T281726) (owner: 10Lucas Werkmeister (WMDE)) [13:02:08] (03PS2) 10Lucas Werkmeister (WMDE): simplewiktionary: Update project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942806 (https://phabricator.wikimedia.org/T343084) (owner: 10Stang) [13:02:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942806 (https://phabricator.wikimedia.org/T343084) (owner: 10Stang) [13:03:05] (03Merged) 10jenkins-bot: simplewiktionary: Update project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/942806 (https://phabricator.wikimedia.org/T343084) (owner: 10Stang) [13:03:36] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:942806|simplewiktionary: Update project logo (T343084)]] [13:03:39] T343084: Change the logo of Simple English Wiktionary - https://phabricator.wikimedia.org/T343084 [13:05:10] !log lucaswerkmeister-wmde@deploy1002 stang and lucaswerkmeister-wmde: Backport for [[gerrit:942806|simplewiktionary: Update project logo (T343084)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:05:22] koi: please test [13:05:46] Lucas_WMDE, tested and LGTM [13:05:49] (looks good on my end) [13:05:49] ok [13:05:50] !log lucaswerkmeister-wmde@deploy1002 stang and lucaswerkmeister-wmde: Continuing with sync [13:06:15] (03CR) 10Ssingh: sre.hosts.reboot-cluster: simplify Icinga logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [13:07:43] (03CR) 10Volans: [C: 03+2] "Thanks for the feedback" [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [13:10:12] (03Merged) 10jenkins-bot: sre.hosts.reboot-cluster: simplify Icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [13:11:49] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:942806|simplewiktionary: Update project logo (T343084)]] (duration: 08m 13s) [13:11:54] T343084: Change the logo of Simple English Wiktionary - https://phabricator.wikimedia.org/T343084 [13:12:16] (03CR) 10Ssingh: [C: 03+1] "Thank you for the patch! And yeah, we will need to update the iface names once we do the first LVS reimage but that's completely fine." [puppet] - 10https://gerrit.wikimedia.org/r/944875 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [13:12:45] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/project-logos/simplewiktionary.png\n' | mwscript purgeList.php # T343084 [13:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P49967 and previous config saved to /var/cache/conftool/dbconfig/20230802-131314-ladsgroup.json [13:13:18] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php uzwiki wikilove # Create extension tables for Wikilove on uzwiki (T343270) [13:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:21] T343270: Request to enable WikiLove on uzwiki - https://phabricator.wikimedia.org/T343270 [13:13:30] (03PS2) 10Lucas Werkmeister (WMDE): uzwiki: Install WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944898 (https://phabricator.wikimedia.org/T343270) (owner: 10Stang) [13:14:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944898 (https://phabricator.wikimedia.org/T343270) (owner: 10Stang) [13:15:53] (03Merged) 10jenkins-bot: uzwiki: Install WikiLove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944898 (https://phabricator.wikimedia.org/T343270) (owner: 10Stang) [13:16:18] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:944898|uzwiki: Install WikiLove (T343270)]] [13:17:51] !log lucaswerkmeister-wmde@deploy1002 stang and lucaswerkmeister-wmde: Backport for [[gerrit:944898|uzwiki: Install WikiLove (T343270)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:18:05] koi: please test :) the tables should exist [13:18:09] looking [13:18:14] (03Merged) 10jenkins-bot: Inject LanguageNameLookupFactory into WikibaseValueFormatterBuilders [extensions/Wikibase] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944848 (https://phabricator.wikimedia.org/T281726) (owner: 10Lucas Werkmeister (WMDE)) [13:18:44] I see WikiLove on https://uz.wikipedia.org/wiki/Special:Version?uselang=en, but I don’t think I’ve ever used it, I don’t know how to test it actually works [13:20:05] i see it works, https://imgur.com/znUrdZX [13:20:13] !log lucaswerkmeister-wmde@deploy1002 stang and lucaswerkmeister-wmde: Continuing with sync [13:20:16] cool, thanks [13:21:24] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti2014'] [13:22:58] (03CR) 10Ssingh: "The debian-glue failing for Varnish is expected -- whether these extra messages build failure are a symptom of that is debatable. However," [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [13:23:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Papaul) @Marostegui thnaks [13:25:39] (03CR) 10Fabfur: [C: 03+1] "Agreed, we can check with piuparts just in case" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [13:25:56] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host pc2015.codfw.wmnet with OS bullseye [13:26:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [13:26:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host pc2015.codfw.wmnet with OS bullseye [13:26:14] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: GitLab minor version upgrade [13:26:17] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:944898|uzwiki: Install WikiLove (T343270)]] (duration: 09m 58s) [13:26:19] T343270: Request to enable WikiLove on uzwiki - https://phabricator.wikimedia.org/T343270 [13:26:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [13:26:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T342617)', diff saved to https://phabricator.wikimedia.org/P49968 and previous config saved to /var/cache/conftool/dbconfig/20230802-132632-ladsgroup.json [13:26:35] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:26:50] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:944848|Inject LanguageNameLookupFactory into WikibaseValueFormatterBuilders (T281726)]] [13:26:52] T281726: Stop injecting LanguageNameLookup into WikibaseValueFormatterBuilders - https://phabricator.wikimedia.org/T281726 [13:26:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host pc2016.mgmt.codfw.wmnet with reboot policy FORCED [13:26:59] ty! [13:27:07] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Version 6.0.11-1wm2 for Debian Bookworm [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [13:28:13] np :) [13:28:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1184 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P49969 and previous config saved to /var/cache/conftool/dbconfig/20230802-132819-ladsgroup.json [13:28:26] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:944848|Inject LanguageNameLookupFactory into WikibaseValueFormatterBuilders (T281726)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:29:21] nothing seems to be broken, yay [13:29:22] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [13:31:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T342617)', diff saved to https://phabricator.wikimedia.org/P49970 and previous config saved to /var/cache/conftool/dbconfig/20230802-133122-ladsgroup.json [13:35:30] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:944848|Inject LanguageNameLookupFactory into WikibaseValueFormatterBuilders (T281726)]] (duration: 08m 39s) [13:35:32] T281726: Stop injecting LanguageNameLookup into WikibaseValueFormatterBuilders - https://phabricator.wikimedia.org/T281726 [13:35:41] anything else to deploy? [13:35:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] dsh: Remove mw1497 and mw1498 from appserver [puppet] - 10https://gerrit.wikimedia.org/r/944893 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [13:35:45] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): cloudcumin: decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [13:36:03] !log UTC afternoon backport+config window done [13:36:04] * Lucas_WMDE done [13:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:10] (03CR) 10Clément Goubert: [C: 03+2] dsh: Remove mw1497 and mw1498 from appserver [puppet] - 10https://gerrit.wikimedia.org/r/944893 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [13:38:24] (03CR) 10AOkoth: [C: 03+2] vrts: add /var/log/clamav/{clamav,freshclam}.log to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/943607 (owner: 10AOkoth) [13:38:39] (03PS2) 10AOkoth: vrts: add /var/log/clamav/{clamav,freshclam}.log to rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/943607 [13:39:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Jclark-ctr) @aborrero @Andrew Will this server be using single or dual network connection? [13:40:45] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Jhancock.wm) @Marostegui I have swapped fan5 and fan6. The original error has cleared for now. But if it comes back we should know if it's the fan or the slot that is causing the issues. [13:41:09] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): cloudcumin: decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [13:41:17] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [13:43:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Jclark-ctr) cloudservices1006 C8 u20 port 19 (cableid - 5321) port 43(cableid 5329) [13:43:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Jclark-ctr) [13:46:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P49971 and previous config saved to /var/cache/conftool/dbconfig/20230802-134628-ladsgroup.json [13:46:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2015.codfw.wmnet with reason: host reimage [13:48:14] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Jhancock.wm) @Papaul reseated the mgmt cable. [13:48:38] (03PS5) 10Clément Goubert: site.pp: Rename mw149[7-8] to kubernetes102[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) [13:49:21] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:49:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2015.codfw.wmnet with reason: host reimage [13:51:18] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudservices1006 - jclark@cumin1001" [13:52:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudservices1006 - jclark@cumin1001" [13:52:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:44] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Jclark-ctr) [13:54:13] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudservices1006 [13:54:33] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudservices1006 [13:54:39] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Papaul) @Jhancock.wm thanks let me update the BIOS and IDRAC before DB put the server back in production [13:55:09] (03PS6) 10Clément Goubert: Rename mw149[7-8] to kubernetes102[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) [13:55:14] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudservices1006.mgmt.eqiad.wmnet with reboot policy FORCED [13:56:06] (03PS1) 10Giuseppe Lavagetto: mediawiki: call noc on kubernetes [software/spicerack] - 10https://gerrit.wikimedia.org/r/944910 [13:56:54] !log Decomissioning mw1497 and mw1498 - T343306 [13:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:57] T343306: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 [13:57:38] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2025'] [13:57:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2025'] [13:58:18] claime: please lmk if you have any issue with the decom cookbook, I merged some changes today for it [13:58:30] volans: It just timed out trying to log to irc [13:58:47] ? weird, that's totally unrelated [13:58:55] yeah, maybe a network blip? [13:59:02] phasting the stacktrace [13:59:05] thx [13:59:09] https://phabricator.wikimedia.org/P49972 [14:00:00] (03CR) 10CI reject: [V: 04-1] mediawiki: call noc on kubernetes [software/spicerack] - 10https://gerrit.wikimedia.org/r/944910 (owner: 10Giuseppe Lavagetto) [14:00:02] (03PS1) 10Muehlenhoff: Add base nftables sets which are equivalent to the main Ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/944911 (https://phabricator.wikimedia.org/T336497) [14:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1400) [14:00:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Jclark-ctr) @aborrero @Andrew i have connected two network ports to prevent any blockers from remote work but only have entered one into netbox... [14:00:49] I'm trying to abort because of typo but it seems like it's still doing things :? [14:01:00] it might be rollib back [14:01:01] (03PS3) 10Hnowlan: wmnet: add discovery record for aqs [dns] - 10https://gerrit.wikimedia.org/r/943616 (https://phabricator.wikimedia.org/T342213) [14:01:06] depending on where you interrupt it [14:01:32] (it finds the right mgmt interface if you typo for instance wment instead of wmnet, but other steps fail because well, wment doesn't exist...) [14:01:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P49973 and previous config saved to /var/cache/conftool/dbconfig/20230802-140134-ladsgroup.json [14:01:42] I should read better with my eyes though [14:01:44] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:51] uh [14:02:00] ^ expected? [14:02:42] Not to my knowledge [14:02:45] looking [14:02:48] probably the downtime jbond put yesterday I think expired [14:02:50] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:02:55] ah [14:03:02] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti2014'] [14:03:26] (03PS2) 10Giuseppe Lavagetto: mediawiki: call noc on kubernetes [software/spicerack] - 10https://gerrit.wikimedia.org/r/944910 [14:03:28] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [14:04:02] ok cookbook, live your life [14:05:16] volans: I'm aborting every change btw, I'd rather it start from scratch [14:05:18] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:36] Aug 02 14:02:06 config-master2001 dump-conftool-pools[1356156]: ModuleNotFoundError: No module named 'conftool' [14:05:38] ah ok [14:05:41] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:05:42] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw[1497-1498].eqiad.wment [14:06:26] !log cgoubert@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw[1497-1498].eqiad.wmnet [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:10] Hey all - was hoping to deploy a quick security update in /private for T336027. Thanks. [14:07:35] claime: I'd call this a network blip, not sure why, I checked if the bot was restarted on alert1001 but was not the case [14:08:42] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:13:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:13:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2015.codfw.wmnet with OS bullseye [14:13:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host pc2015.codfw.wmnet with OS bullseye completed: - pc2015 (**WARN**) - Removed from... [14:15:24] !log importing varnish and libvarnishapi2 in bookworm-wikimedia (T342154) [14:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:27] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [14:15:46] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [14:16:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:40] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T342617)', diff saved to https://phabricator.wikimedia.org/P49974 and previous config saved to /var/cache/conftool/dbconfig/20230802-141640-ladsgroup.json [14:16:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:16:44] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:16:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:16:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:17:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:17:13] !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on people1004.eqiad.wmnet with reason: Resizing disk [14:17:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T342617)', diff saved to https://phabricator.wikimedia.org/P49975 and previous config saved to /var/cache/conftool/dbconfig/20230802-141719-ladsgroup.json [14:17:27] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on people1004.eqiad.wmnet with reason: Resizing disk [14:18:53] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1497-1498].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1001" [14:19:00] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2025'] [14:19:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['es2025'] [14:19:37] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[1497-1498].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - cgoubert@cumin1001" [14:19:38] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:19:38] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw[1497-1498].eqiad.wmnet [14:19:54] !log importing python-logstash in bookworm-wikimedia (T342154) [14:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:06] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) Thanks, let us know when we can proceed and put it back in production [14:21:35] 10SRE-tools, 10Cloud-VPS, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from CloudVPS instances - https://phabricator.wikimedia.org/T343335 (10fnegri) [14:21:58] 10SRE-tools, 10Cloud-VPS, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from a laptop - https://phabricator.wikimedia.org/T343336 (10fnegri) [14:22:15] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from CloudVPS instances - https://phabricator.wikimedia.org/T343335 (10fnegri) [14:25:47] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [14:25:54] 10SRE, 10Traffic: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [14:26:11] !log Deployed updated mitigation for T336027 [14:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:45] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) 05In progress→03Resolved I'm marking this task as resolved, as the requirement... [14:29:35] (03PS1) 10Elukey: ext-ORES: avoid Lift Wing calls for fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944916 (https://phabricator.wikimedia.org/T343308) [14:31:40] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10Puppet-Core, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10lmata) [14:31:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host pc2016.mgmt.codfw.wmnet with reboot policy FORCED [14:32:18] (03PS1) 10Volans: sre.hosts.decommission: fail on wrong FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/944919 [14:33:44] 10SRE, 10serviceops: Nutcracker stats monitoring should only listen on localhost - https://phabricator.wikimedia.org/T111934 (10lmata) Untagging observability, there doesn't seem anything for us to do; please re-tag if you need us to engage. Thanks! [14:33:50] (03CR) 10Clément Goubert: [C: 03+1] sre.hosts.decommission: fail on wrong FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/944919 (owner: 10Volans) [14:35:17] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [14:35:18] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: GitLab minor version upgrade [14:35:32] 10SRE, 10serviceops: Nutcracker stats monitoring should only listen on localhost - https://phabricator.wikimedia.org/T111934 (10Joe) 05Open→03Declined [14:35:38] (03PS6) 10Sohom Datta: Add validator userright for pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) [14:35:40] (03PS2) 10Volans: sre.hosts.decommission: fail on wrong FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/944919 [14:37:03] 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup), 10User-MoritzMuehlenhoff: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10lmata) @joanna_borun grooming phab board this week, we feel this is better suited for I/F please retag if you need our assist... [14:38:19] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename mw[1497-1498] to kubernetes[1025-1026] - cgoubert@cumin1001" [14:38:21] 10SRE-swift-storage, 10SRE Observability: swift hosts (thanos-fe1001, ms-be2012) with failed prometheus-ipmi-exporter services - https://phabricator.wikimedia.org/T311262 (10lmata) 05Open→03Declined Discussed in the today's team meeting, boldly declining. Please re-open if you feel differently. [14:38:45] (03PS1) 10Giuseppe Lavagetto: noc: remove from role::maintenance [puppet] - 10https://gerrit.wikimedia.org/r/944920 [14:38:47] (03PS1) 10Giuseppe Lavagetto: noc: remove profile, module [puppet] - 10https://gerrit.wikimedia.org/r/944921 [14:38:57] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: GitLab minor version upgrade [14:39:07] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename mw[1497-1498] to kubernetes[1025-1026] - cgoubert@cumin1001" [14:39:07] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:41:03] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1025.mgmt.eqiad.wmnet with reboot policy FORCED [14:41:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service: add discovery for AQS [puppet] - 10https://gerrit.wikimedia.org/r/944888 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [14:41:53] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host kubernetes1025.mgmt.eqiad.wmnet with reboot policy FORCED [14:42:25] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1025.mgmt.eqiad.wmnet with reboot policy FORCED [14:42:29] (03CR) 10Hnowlan: [C: 03+2] service: add discovery for AQS [puppet] - 10https://gerrit.wikimedia.org/r/944888 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [14:43:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudservices1006.mgmt.eqiad.wmnet with reboot policy FORCED [14:43:18] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudservices1006.eqiad.wmnet'] [14:43:49] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudservices1006.eqiad.wmnet'] [14:43:53] (03CR) 10Eevans: [C: 03+1] cassandra: Pass ports in firewall-agnostic format [puppet] - 10https://gerrit.wikimedia.org/r/944896 (owner: 10Muehlenhoff) [14:43:57] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudservices1006.eqiad.wmnet'] [14:44:39] !log installing iperf3 security updates [14:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:56] (03CR) 10AikoChou: [C: 03+1] ext-ORES: avoid Lift Wing calls for fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944916 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [14:45:19] PROBLEM - SSH on config-master2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:45:36] (03CR) 10Ladsgroup: [C: 03+1] ext-ORES: avoid Lift Wing calls for fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944916 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [14:45:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:45:52] (03CR) 10Elukey: [C: 03+2] ext-ORES: avoid Lift Wing calls for fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944916 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [14:46:23] RECOVERY - SSH on config-master2001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:46:34] jouncebot: nowandnext [14:46:34] For the next 0 hour(s) and 13 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1400) [14:46:34] In 2 hour(s) and 13 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1700) [14:46:39] jouncebot: next [14:46:39] In 2 hour(s) and 13 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1700) [14:46:41] :) [14:46:42] (03PS1) 10Muehlenhoff: Add library hint for iperf3 [puppet] - 10https://gerrit.wikimedia.org/r/944925 [14:46:45] let's gooo [14:46:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pc2016'] [14:48:09] (03Merged) 10jenkins-bot: ext-ORES: avoid Lift Wing calls for fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944916 (https://phabricator.wikimedia.org/T343308) (owner: 10Elukey) [14:48:46] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1025.mgmt.eqiad.wmnet with reboot policy FORCED [14:49:38] !log elukey@deploy1002 Started scap: Backport for [[gerrit:944916|ext-ORES: avoid Lift Wing calls for fiwiki (T343308)]] [14:49:45] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [14:50:11] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1026.mgmt.eqiad.wmnet with reboot policy FORCED [14:50:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:51:17] !log elukey@deploy1002 elukey: Backport for [[gerrit:944916|ext-ORES: avoid Lift Wing calls for fiwiki (T343308)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:52:09] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1026.mgmt.eqiad.wmnet with reboot policy FORCED [14:52:43] !log elukey@deploy1002 elukey: Continuing with sync [14:53:17] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for iperf3 [puppet] - 10https://gerrit.wikimedia.org/r/944925 (owner: 10Muehlenhoff) [14:54:57] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudservices1006.eqiad.wmnet'] [14:57:17] PROBLEM - SSH on config-master2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:57:19] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:11] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:24] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [14:58:46] !log elukey@deploy1002 Finished scap: Backport for [[gerrit:944916|ext-ORES: avoid Lift Wing calls for fiwiki (T343308)]] (duration: 09m 08s) [14:58:49] T343308: fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 [14:59:14] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q1): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) 05Resolved→03In progress Reopening, because I realized we need to patch al... [14:59:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pc2016'] [15:00:08] 10SRE, 10Traffic: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [15:00:17] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:19] (03Abandoned) 10FNegri: cloudcumin: don't send logs to prod IRC [puppet] - 10https://gerrit.wikimedia.org/r/935461 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [15:02:57] did the silence on config-master expire ? [15:03:07] RECOVERY - SSH on config-master2001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:03:11] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:04:21] !log installing gst-plugins-base1.0 security updates [15:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:15] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [15:07:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T342617)', diff saved to https://phabricator.wikimedia.org/P49976 and previous config saved to /var/cache/conftool/dbconfig/20230802-150739-ladsgroup.json [15:07:43] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:07:51] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:44] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:10:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T342617)', diff saved to https://phabricator.wikimedia.org/P49977 and previous config saved to /var/cache/conftool/dbconfig/20230802-151038-ladsgroup.json [15:10:59] PROBLEM - Check systemd state on config-master2001 is CRITICAL: CRITICAL - degraded: The following units failed: dump-conftool-pools.service,envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:09] mmhh probably some heavy queries re: thanos-frontend [15:11:50] (03CR) 10Filippo Giunchedi: "Request has been approved, this is ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/943575 (https://phabricator.wikimedia.org/T343122) (owner: 10Filippo Giunchedi) [15:12:25] PROBLEM - SSH on config-master2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:12:55] can someone look after and/or silence config-master? volans maybe ? [15:12:59] 10SRE, 10Traffic: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) Start working on varnishkafka package [15:13:59] godog: the one on alerts.w.o expires in 8 days, checking icinga [15:14:26] apparently there isn't one there [15:14:29] running the decom [15:14:34] *downtime cookbook [15:14:44] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:15:21] RECOVERY - SSH on config-master2001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:15:31] thank you [15:15:54] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 8 days, 0:00:00 on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet with reason: WIP hosts to be setup [15:16:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10fgiunchedi) @thcipriani ping for approval on this re: deployment group, thank you! [15:16:08] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8 days, 0:00:00 on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet with reason: WIP hosts to be setup [15:16:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host pc2016.codfw.wmnet with OS bullseye [15:16:13] {done} [15:16:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host pc2016.codfw.wmnet with OS bullseye [15:16:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Papaul) [15:17:54] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fail on wrong FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/944919 (owner: 10Volans) [15:18:08] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10fgiunchedi) @Mabualruz we'll also need to verify out of band the ssh public key you provided. One way is if you publish the same key in your wiki user... [15:18:55] (03PS1) 10Muehlenhoff: docker_registry_ha: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944934 [15:21:20] (03PS1) 10Kamila Součková: benthos: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/944936 (https://phabricator.wikimedia.org/T324200) [15:21:28] (03CR) 10CI reject: [V: 04-1] docker_registry_ha: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944934 (owner: 10Muehlenhoff) [15:22:17] (03Merged) 10jenkins-bot: sre.hosts.decommission: fail on wrong FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/944919 (owner: 10Volans) [15:22:32] (03CR) 10Kamila Součková: [C: 03+2] benthos: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/944936 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [15:22:45] (03PS1) 10Muehlenhoff: cloudcephosd: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944937 [15:22:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P49978 and previous config saved to /var/cache/conftool/dbconfig/20230802-152245-ladsgroup.json [15:23:13] (03PS2) 10Muehlenhoff: cloudcephosd: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944937 [15:23:17] (03Merged) 10jenkins-bot: benthos: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/944936 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [15:24:10] !log Remove dns3002 from cr2-esams and cr3-esams routes in prep for reboot - T335835 [15:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:11] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [15:25:07] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [15:25:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P49979 and previous config saved to /var/cache/conftool/dbconfig/20230802-152545-ladsgroup.json [15:26:13] (03PS2) 10Muehlenhoff: docker_registry_ha: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944934 [15:26:22] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10fgiunchedi) Thank you @MoritzMuehlenhoff @KFrancis ! @darthmon_wmde the following actions are missing (see task description for details) * Sign L3 * Verification of your ssh k... [15:27:18] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10fgiunchedi) [15:27:33] (03PS1) 10BCornwall: common: Remove dns3002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/944940 (https://phabricator.wikimedia.org/T335835) [15:27:44] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2025'] [15:27:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2025'] [15:28:28] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10fgiunchedi) @WMDE-leszek we're seeking approvals as you are listed as an approval party for `releasers-wikibase` group, thank you ! [15:28:55] (03CR) 10Hnowlan: [C: 03+2] wmnet: add discovery record for aqs [dns] - 10https://gerrit.wikimedia.org/r/943616 (https://phabricator.wikimedia.org/T342213) (owner: 10Hnowlan) [15:29:25] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) [15:29:33] (03CR) 10Ssingh: [C: 03+1] common: Remove dns3002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/944940 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:29:50] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) [15:30:20] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2025'] [15:30:29] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) @WMDE-leszek as approval party for `releasers-wikibase` we're seeking approvals here, thank you ! [15:30:32] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2025'] [15:31:23] (03CR) 10BCornwall: [C: 03+2] common: Remove dns3002 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/944940 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [15:31:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944934 (owner: 10Muehlenhoff) [15:32:51] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10fgiunchedi) @darthmon_wmde hello, you mentioned you'll be managing the wikibase releases, as such I take it you'll be added to `approval` in https://github.com/wikimedia/operat... [15:35:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944937 (owner: 10Muehlenhoff) [15:35:34] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10fgiunchedi) [15:36:21] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:35] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10fgiunchedi) Hello @roti_WMDE, thank you for the providing the key. We'll need to verify it out of band too, the easiest method being if you publish the same on your wiki user page... [15:36:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2016.codfw.wmnet with reason: host reimage [15:36:56] expected BGP alerts in esams, dns host rebooting [15:37:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P49980 and previous config saved to /var/cache/conftool/dbconfig/20230802-153751-ladsgroup.json [15:37:52] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10fgiunchedi) @lojo_wmde in addition to the ssh key @darthmon_wmde mentioned, we'll also need the same key published on your wiki user page for out of band verification. Also please... [15:37:58] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [15:38:00] (03PS1) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944943 [15:38:19] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns3002.wikimedia.org [15:38:47] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:39:02] (03PS2) 10Muehlenhoff: ganeti: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944943 [15:39:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2016.codfw.wmnet with reason: host reimage [15:40:48] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [15:40:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P49981 and previous config saved to /var/cache/conftool/dbconfig/20230802-154051-ladsgroup.json [15:41:09] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [15:41:53] brett: will dns3002 resync after the reboot? I'm having a (normal considering) failure on the dns.netbox cookbook on it, and wondering if it's safe to just skip for this host [15:42:15] claime: we are removing it from the authdns/dnsrec lists [15:42:27] can you try again? just a race condition from the removal run I am guessing [15:42:32] sukhe: ack, so just rerun? ok [15:42:34] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns3002.wikimedia.org [15:42:36] and yes, we will put it back ASAP [15:42:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944943 (owner: 10Muehlenhoff) [15:43:03] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix kubernetes10[25-26] main interfaces - cgoubert@cumin1001" [15:43:16] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10fgiunchedi) Thank you for the context @dr0ptp4kt . From reading https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups it seems wiki replicas access is... [15:43:24] sukhe: yep, looks good, cookbook advanced, thanks for teh confirmation [15:43:28] claime: <3 [15:43:41] <3 back :) [15:43:42] sukhe: here's the example of why I was suggesting to use the active ones instead of the cumin alias :-P [15:43:47] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Fix kubernetes10[25-26] main interfaces - cgoubert@cumin1001" [15:43:47] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/943575 (https://phabricator.wikimedia.org/T343122) (owner: 10Filippo Giunchedi) [15:44:08] volans: ha [15:44:25] (03PS1) 10BCornwall: Revert "common: Remove dns3002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/944851 [15:44:26] volans: this is short-lived pain, because once brandon comes back, we are going to be putting these behind BGP anyway :) [15:45:12] !log cgoubert@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1025 [15:45:16] the suggested non-alias path is more problematic though I feel (it involves more cookbooks) and I think netbox DNS hiccups are acceptable in that sense [15:45:20] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1025 [15:45:24] !log cgoubert@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1026 [15:45:34] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10CommRel-Specialists-Support (Jul-Sep-2023): Request Access to Superset querying presto_analytics_hive datasets - https://phabricator.wikimedia.org/T343320 (10fgiunchedi) Hi @ARamadan-WMF, I'm looking at the access guide at https://wikitech.wikimedia.org/... [15:45:40] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1026 [15:45:53] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add dbrant to 'restricted' [puppet] - 10https://gerrit.wikimedia.org/r/943575 (https://phabricator.wikimedia.org/T343122) (owner: 10Filippo Giunchedi) [15:45:56] the authdns part here is the main concern, recdns is anycasted so a single host in 10.3.0.1 going down is not a problem [15:46:16] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:46:40] sukhe: how the BPG part would help? we'll need to deploy to all hosts anyway and if a host is down it will fail there right? [15:47:15] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Papaul) FAN 5 is now the one having issues and the server is out of warranty and we do not have any fan on site. I am still dong the firmware upgrade the idrac version was 2.13 to i need to do all... [15:47:44] (03CR) 10Ssingh: [C: 03+1] Revert "common: Remove dns3002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/944851 (owner: 10BCornwall) [15:47:56] (03CR) 10BCornwall: [C: 03+2] Revert "common: Remove dns3002 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/944851 (owner: 10BCornwall) [15:48:04] volans: it helps in two ways: 1) right now the work involved in depooling a host is manual and involves puppet changes, so no more of that and a cookbook can simply reboot the host quickly, thus resulting in lower downtimes for the host being down2 [15:48:27] * volans waits for 2 [15:48:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted for dbrant - https://phabricator.wikimedia.org/T343122 (10fgiunchedi) 05In progress→03Resolved a:03fgiunchedi @Dbrant access will be live in the next 30 min, I'm optimistically resolving the task. Though please do reopen... [15:49:07] 2) the workaround for a host being down would probably involve some soft-failing and retrying but I will admit that needs more thoughts with you. but as compared to this, a host going down and coming back up is a longer wait and the probability of a host being down for service and some other changes being pushed is higher [15:49:22] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) How should we proceed then? Could we order one? [15:50:07] could the reboot cookbook be updated to pull any changes after a reboot? then updates could just ignore hosts that are down [15:50:22] (03CR) 10Marostegui: "confirmed no replicas hanging and different row/rack than the current master?" [puppet] - 10https://gerrit.wikimedia.org/r/944897 (https://phabricator.wikimedia.org/T342284) (owner: 10Ladsgroup) [15:51:02] taavi: we could manage pulling in updates on the dns hosts easily, the challenge is the other processes failing in the meantime, such as someone running a decom cookbook and then attempting to push changes to a dns host that is down [15:51:22] right now we resolve that by removing the DNS host from A:cumin or A:dns-rec or A:netbox and it serves us as well [15:51:35] there might be a time where in between the removal happening and cumin runing that someone tries to push an update, which is what happened how [15:51:38] *now [15:51:52] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: GitLab minor version upgrade [15:51:53] and I think that's fine and acceptable, we sill just see how to soft-fail [15:52:15] Yeah it's no biggie, and it actually gives the option to retry [15:52:26] I just wanted to make sure that everything was all right [15:52:32] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Papaul) I will say yes if the server is still going to be in production for a while. [15:52:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T342617)', diff saved to https://phabricator.wikimedia.org/P49982 and previous config saved to /var/cache/conftool/dbconfig/20230802-155258-ladsgroup.json [15:53:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [15:53:01] yeah and we have been doing this reboot work since the past week and yours is the first such case. I am all up for suggestions on how to improve this process, maybe the cookbooks that touch this can retry themselves in case there are race conditions like this but yeah [15:53:04] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:53:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [15:53:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T342617)', diff saved to https://phabricator.wikimedia.org/P49983 and previous config saved to /var/cache/conftool/dbconfig/20230802-155319-ladsgroup.json [15:53:41] (03CR) 10Ladsgroup: mariadb: Switch candidate host of s1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944897 (https://phabricator.wikimedia.org/T342284) (owner: 10Ladsgroup) [15:53:48] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [15:53:52] (03CR) 10Clément Goubert: [C: 03+1] docker_registry_ha: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944934 (owner: 10Muehlenhoff) [15:54:44] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) @wiki_willy when would this server be refreshed? [15:54:51] I will file a task for this and we can discuss it and the different approaches [15:55:19] k [15:55:24] thanks [15:55:49] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Rename mw149[7-8] to kubernetes102[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [15:55:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T342617)', diff saved to https://phabricator.wikimedia.org/P49984 and previous config saved to /var/cache/conftool/dbconfig/20230802-155558-ladsgroup.json [15:56:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:56:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:56:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T342617)', diff saved to https://phabricator.wikimedia.org/P49985 and previous config saved to /var/cache/conftool/dbconfig/20230802-155618-ladsgroup.json [15:58:14] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:58:44] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch. I have to say that at first sight this method seems currently unused. But good to have it fixed." [software/spicerack] - 10https://gerrit.wikimedia.org/r/944910 (owner: 10Giuseppe Lavagetto) [15:59:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:59:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2016.codfw.wmnet with OS bullseye [16:01:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host pc2016.codfw.wmnet with OS bullseye completed: - pc2016 (**WARN**) - Removed from... [16:01:47] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10wiki_willy) It's not on the refresh list for this fiscal year; looks like it'll be due for a refresh in FY24-25. If the firmware upgrade on the iDrac doesn't work, we can try sourcing the fan if yo... [16:02:07] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudservices1006.eqiad.wmnet'] [16:02:50] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudservices1006.eqiad.wmnet'] [16:02:50] claime: All done, sorry for the inconvenience! [16:03:12] brett: Absolutely no need to apologize :) Thanks <3 [16:03:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Jclark-ctr) [16:10:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Jclark-ctr) Raid has not been configured yet on server. What raid is needed for this server @aborrero @Andrew [16:12:15] (03PS1) 10Samtar: enwiki: temp enable emergencyCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944852 [16:12:22] (03PS2) 10Samtar: enwiki: temp enable emergencyCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944852 [16:15:40] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10RobH) @papaul: Before I order a new fan, I want to confirm that it is indeed the fan and not the fan slot? It sounds like you swapped >>! In T343254#9062738, @Jhancock.wm wrote: > @Marostegui I h... [16:28:03] (03CR) 10Samtar: [C: 04-1] "self -1, awaiting T343294#9063770" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944852 (owner: 10Samtar) [16:32:48] (03CR) 10Cathal Mooney: [C: 03+2] Do not compare speed of disabled interfaces when validating blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/944240 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:33:10] PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:26] (03Merged) 10jenkins-bot: Do not compare speed of disabled interfaces when validating blocks [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/944240 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [16:38:33] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Papaul) @robh you right it is not the fan that has the issue it is the board i will have @Jhancock.wm check if it is the daughter board or if it is the mainboard the fan is plugged in tomorrow when... [16:41:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2025'] [16:42:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Papaul) [16:43:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Papaul) 05Open→03Resolved @Marostegui all yours [16:44:45] (03CR) 10Volans: [C: 03+2] mediawiki: call noc on kubernetes [software/spicerack] - 10https://gerrit.wikimedia.org/r/944910 (owner: 10Giuseppe Lavagetto) [16:45:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Ladsgroup) {meme, src=itshappening} [16:46:27] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [16:46:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [16:46:36] !log cmooney@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [16:46:42] !log cmooney@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [16:48:35] (03CR) 10Samtar: "T343294#9063884" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944852 (owner: 10Samtar) [16:48:51] (03PS1) 10Eevans: restbase: Upgrade restbase2013 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944959 (https://phabricator.wikimedia.org/T339298) [16:48:53] (03PS1) 10Eevans: restbase: Upgrade restbase2014 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944960 (https://phabricator.wikimedia.org/T339298) [16:48:55] (03PS1) 10Eevans: restbase: Upgrade restbase2019 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944961 (https://phabricator.wikimedia.org/T339298) [16:48:57] (03Merged) 10jenkins-bot: mediawiki: call noc on kubernetes [software/spicerack] - 10https://gerrit.wikimedia.org/r/944910 (owner: 10Giuseppe Lavagetto) [16:48:59] (03PS1) 10Eevans: restbase: Upgrade restbase2021 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944962 (https://phabricator.wikimedia.org/T339298) [16:49:01] (03PS1) 10Eevans: restbase: Upgrade restbase2024 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944963 (https://phabricator.wikimedia.org/T339298) [16:49:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944852 (owner: 10Samtar) [16:50:34] (03Merged) 10jenkins-bot: enwiki: temp enable emergencyCaptcha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944852 (owner: 10Samtar) [16:51:09] !log samtar@deploy1002 Started scap: Backport for [[gerrit:944852|enwiki: temp enable emergencyCaptcha]] [16:51:47] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944959 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [16:52:55] !log samtar@deploy1002 samtar: Backport for [[gerrit:944852|enwiki: temp enable emergencyCaptcha]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [16:52:56] !log samtar@deploy1002 samtar: Continuing with sync [16:58:58] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:944852|enwiki: temp enable emergencyCaptcha]] (duration: 07m 48s) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1700) [17:05:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) a:03Jhancock.wm [17:05:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Jhancock.wm) a:03Jhancock.wm [17:06:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) a:03Jhancock.wm [17:29:33] taavi: got a second for a DM? [17:29:46] GenNotability: definitely [17:31:00] RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T342617)', diff saved to https://phabricator.wikimedia.org/P49988 and previous config saved to /var/cache/conftool/dbconfig/20230802-173206-ladsgroup.json [17:32:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:35:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T342617)', diff saved to https://phabricator.wikimedia.org/P49989 and previous config saved to /var/cache/conftool/dbconfig/20230802-173520-ladsgroup.json [17:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P49990 and previous config saved to /var/cache/conftool/dbconfig/20230802-174712-ladsgroup.json [17:48:15] (03PS1) 10Ssingh: tox updates: drop older interpreter and update requirements [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/944968 [17:49:19] (03PS2) 10Ssingh: tox updates: drop older interpreter and update requirements [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/944968 [17:50:06] (03CR) 10Ssingh: [C: 03+2] tox updates: drop older interpreter and update requirements [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/944968 (owner: 10Ssingh) [17:50:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P49991 and previous config saved to /var/cache/conftool/dbconfig/20230802-175026-ladsgroup.json [17:53:38] 10SRE, 10LDAP-Access-Requests: Grant wmf and turnilo/superset access for Rae Adimer - https://phabricator.wikimedia.org/T342591 (10RAdimer-WMF) Everything seems to be working! Thanks :) [18:00:04] dancy and jnuche: #bothumor I � Unicode. All rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1800). [18:00:04] dancy and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T1800). [18:00:42] o/ [18:01:42] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944971 (https://phabricator.wikimedia.org/T340248) [18:01:45] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944971 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [18:02:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P49993 and previous config saved to /var/cache/conftool/dbconfig/20230802-180218-ladsgroup.json [18:02:37] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944971 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [18:05:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P49994 and previous config saved to /var/cache/conftool/dbconfig/20230802-180532-ladsgroup.json [18:10:00] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.20 refs T340248 [18:10:05] T340248: 1.41.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T340248 [18:16:39] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.20 refs T340248 (duration: 06m 38s) [18:16:42] T340248: 1.41.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T340248 [18:17:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T342617)', diff saved to https://phabricator.wikimedia.org/P49995 and previous config saved to /var/cache/conftool/dbconfig/20230802-181724-ladsgroup.json [18:17:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:17:28] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:17:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [18:20:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T342617)', diff saved to https://phabricator.wikimedia.org/P49996 and previous config saved to /var/cache/conftool/dbconfig/20230802-182038-ladsgroup.json [18:20:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [18:20:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [18:21:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T342617)', diff saved to https://phabricator.wikimedia.org/P49997 and previous config saved to /var/cache/conftool/dbconfig/20230802-182059-ladsgroup.json [18:22:53] dancy: T343375 possibly related to the train? [18:22:54] T343375: "Module:" synonym on HeWp canceled and breaks the entire site. - https://phabricator.wikimedia.org/T343375 [18:23:00] arwiki is group1 [18:23:06] hewiki* [18:23:47] I will roll back. [18:24:23] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944975 (https://phabricator.wikimedia.org/T340248) [18:24:25] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944975 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [18:24:28] (03PS1) 10Andrew Bogott: Update codfw1dev horizon build [puppet] - 10https://gerrit.wikimedia.org/r/944976 [18:24:30] (03PS1) 10Andrew Bogott: Horizon: switch eqiad1 to a docker-based 2023.1 Horizon deploy [puppet] - 10https://gerrit.wikimedia.org/r/944977 (https://phabricator.wikimedia.org/T341640) [18:25:05] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944975 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [18:25:09] (03CR) 10Andrew Bogott: [C: 03+2] Update codfw1dev horizon build [puppet] - 10https://gerrit.wikimedia.org/r/944976 (owner: 10Andrew Bogott) [18:32:01] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.20 refs T340248 [18:32:04] T340248: 1.41.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T340248 [18:32:11] (03PS2) 10Andrew Bogott: Horizon: switch eqiad1 to a docker-based 2023.1 Horizon deploy [puppet] - 10https://gerrit.wikimedia.org/r/944977 (https://phabricator.wikimedia.org/T341640) [18:35:11] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: switch eqiad1 to a docker-based 2023.1 Horizon deploy [puppet] - 10https://gerrit.wikimedia.org/r/944977 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [18:37:23] (03PS1) 10Sbailey: WIP re-enabling parser migration extension on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) [18:38:25] taavi: Thanks for the heads-up! [18:40:11] (03CR) 10Sbailey: "Please comment on the correct driver and other info that should be in commonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [18:51:03] (03PS1) 10Andrew Bogott: rename role::wmcs::openstack::eqiad1::labweb to ::cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/944980 [18:51:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['es2025'] [18:56:22] (03PS1) 10Andrew Bogott: Rename labweb.yaml to cloudweb.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/944981 [18:57:40] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Papaul) @Marostegui firmware upgrade complete [18:59:25] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Rename labweb.yaml to cloudweb.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/944981 (owner: 10Andrew Bogott) [19:12:20] (03CR) 10Andrew Bogott: [C: 03+2] rename role::wmcs::openstack::eqiad1::labweb to ::cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/944980 (owner: 10Andrew Bogott) [19:13:40] (03PS2) 10Majavah: conftool-data: Duplicate labweb service as cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/941458 (https://phabricator.wikimedia.org/T317463) [19:13:42] (03PS2) 10Majavah: service: update labweb/cloudweb conftool pool name [puppet] - 10https://gerrit.wikimedia.org/r/941459 (https://phabricator.wikimedia.org/T317463) [19:13:44] (03PS2) 10Majavah: conftool-data: drop labweb pool [puppet] - 10https://gerrit.wikimedia.org/r/941460 (https://phabricator.wikimedia.org/T317463) [19:34:04] !log xcollazo@deploy1002 Started deploy [analytics/refinery@27def33]: Special refinery deploy to fix mediwiki_history_denormalize [analytics/refinery@27def33] [19:34:13] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [19:34:15] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [19:39:33] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [19:39:59] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [19:41:52] !log xcollazo@deploy1002 Finished deploy [analytics/refinery@27def33]: Special refinery deploy to fix mediwiki_history_denormalize [analytics/refinery@27def33] (duration: 07m 48s) [19:43:52] !log xcollazo@deploy1002 Started deploy [analytics/refinery@27def33] (thin): Special refinery deploy to fix mediwiki_history_denormalize THIN [analytics/refinery@27def33] [19:43:57] !log xcollazo@deploy1002 Finished deploy [analytics/refinery@27def33] (thin): Special refinery deploy to fix mediwiki_history_denormalize THIN [analytics/refinery@27def33] (duration: 00m 04s) [19:44:38] !log xcollazo@deploy1002 Started deploy [analytics/refinery@27def33] (hadoop-test): Special refinery deploy to fix mediwiki_history_denormalize TEST [analytics/refinery@27def33] [19:44:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [19:45:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [19:45:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T342617)', diff saved to https://phabricator.wikimedia.org/P49998 and previous config saved to /var/cache/conftool/dbconfig/20230802-194518-ladsgroup.json [19:45:21] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:46:37] !log xcollazo@deploy1002 Finished deploy [analytics/refinery@27def33] (hadoop-test): Special refinery deploy to fix mediwiki_history_denormalize TEST [analytics/refinery@27def33] (duration: 01m 59s) [19:55:27] (03PS1) 10Ahmon Dancy: Revert "LocalisationCache: Load only core data if possible" [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944854 (https://phabricator.wikimedia.org/T342418) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T2000) [20:00:05] Sohom_Datta: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] o/ [20:04:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T342617)', diff saved to https://phabricator.wikimedia.org/P49999 and previous config saved to /var/cache/conftool/dbconfig/20230802-200401-ladsgroup.json [20:04:05] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:07:01] Sohom_Datta: I can deploy for you. [20:07:28] Sure, thank you :) [20:07:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) (owner: 10Sohom Datta) [20:08:14] (03Merged) 10jenkins-bot: Add validator userright for pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941928 (https://phabricator.wikimedia.org/T341428) (owner: 10Sohom Datta) [20:08:42] !log dancy@deploy1002 Started scap: Backport for [[gerrit:941928|Add validator userright for pawikisource (T341428)]] [20:08:46] T341428: Creation of a new user right:Validator on Punjabi Wikisource - https://phabricator.wikimedia.org/T341428 [20:10:20] !log dancy@deploy1002 dancy and soda: Backport for [[gerrit:941928|Add validator userright for pawikisource (T341428)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:11:00] Sohom_Datta: Please test using a debug server and lemme know if I should proceed. [20:11:45] On it :0 [20:11:49] *:) [20:11:54] 10SRE, 10SRE-OnFire: Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10RLazarus) [20:13:57] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10CommRel-Specialists-Support (Jul-Sep-2023): Request Access to Superset querying presto_analytics_hive datasets - https://phabricator.wikimedia.org/T343320 (10SNowick_WMF) Hi @fgiunchedi I am the analyst working with Amal to get access to data in Superset... [20:15:50] Looks to be working :), Special:Statistics shows the validator userright [20:18:59] (03PS1) 10Stang: newiki: Fix templateeditor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944983 (https://phabricator.wikimedia.org/T343257) [20:19:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P50000 and previous config saved to /var/cache/conftool/dbconfig/20230802-201907-ladsgroup.json [20:23:45] ok.. moving on [20:23:48] !log dancy@deploy1002 dancy and soda: Continuing with sync [20:29:31] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:941928|Add validator userright for pawikisource (T341428)]] (duration: 20m 49s) [20:29:40] T341428: Creation of a new user right:Validator on Punjabi Wikisource - https://phabricator.wikimedia.org/T341428 [20:30:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944854 (https://phabricator.wikimedia.org/T342418) (owner: 10Ahmon Dancy) [20:34:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P50001 and previous config saved to /var/cache/conftool/dbconfig/20230802-203413-ladsgroup.json [20:38:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T342617)', diff saved to https://phabricator.wikimedia.org/P50002 and previous config saved to /var/cache/conftool/dbconfig/20230802-203833-ladsgroup.json [20:38:37] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:41:26] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@cbce175]: Deploy latest for Airflow analytics instance. [20:41:47] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@cbce175]: Deploy latest for Airflow analytics instance. (duration: 00m 20s) [20:46:12] (03Merged) 10jenkins-bot: Revert "LocalisationCache: Load only core data if possible" [core] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944854 (https://phabricator.wikimedia.org/T342418) (owner: 10Ahmon Dancy) [20:46:39] !log dancy@deploy1002 Started scap: Backport for [[gerrit:944854|Revert "LocalisationCache: Load only core data if possible" (T342418 T343375)]] [20:46:43] T342418: Speed up Language creation - https://phabricator.wikimedia.org/T342418 [20:46:43] T343375: Translation of some key namespaces ("Module:", others?) missing from 1.41.0-wmf.20, breaking any hewiki page that uses that - https://phabricator.wikimedia.org/T343375 [20:48:13] !log dancy@deploy1002 dancy: Backport for [[gerrit:944854|Revert "LocalisationCache: Load only core data if possible" (T342418 T343375)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:49:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T342617)', diff saved to https://phabricator.wikimedia.org/P50003 and previous config saved to /var/cache/conftool/dbconfig/20230802-204919-ladsgroup.json [20:49:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [20:49:23] !log dancy@deploy1002 dancy: Continuing with sync [20:49:23] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:49:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [20:49:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T342617)', diff saved to https://phabricator.wikimedia.org/P50004 and previous config saved to /var/cache/conftool/dbconfig/20230802-204941-ladsgroup.json [20:52:03] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10WMDE-leszek) approved [20:52:39] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10WMDE-leszek) approved [20:53:13] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for roti_WMDE - https://phabricator.wikimedia.org/T342972 (10WMDE-leszek) explicitly approving this request too [20:53:31] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for lojo_wmde - https://phabricator.wikimedia.org/T342973 (10WMDE-leszek) explicitly approving this request too [20:53:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P50005 and previous config saved to /var/cache/conftool/dbconfig/20230802-205339-ladsgroup.json [20:55:26] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:944854|Revert "LocalisationCache: Load only core data if possible" (T342418 T343375)]] (duration: 08m 47s) [20:55:30] T342418: Speed up Language creation - https://phabricator.wikimedia.org/T342418 [20:55:31] T343375: Translation of some key namespaces ("Module:", others?) missing from 1.41.0-wmf.20, breaking any hewiki page that uses that - https://phabricator.wikimedia.org/T343375 [20:55:49] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944988 (https://phabricator.wikimedia.org/T340248) [20:55:51] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944988 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [20:56:33] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944988 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230802T2100) [21:04:07] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.20 refs T340248 [21:04:10] T340248: 1.41.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T340248 [21:08:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P50006 and previous config saved to /var/cache/conftool/dbconfig/20230802-210846-ladsgroup.json [21:10:28] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.20 refs T340248 (duration: 06m 21s) [21:10:32] T340248: 1.41.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T340248 [21:15:04] (03PS2) 10Krinkle: Profiler: Sync minor changes with arc-lamp.git package [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939756 (https://phabricator.wikimedia.org/T337873) [21:20:44] dancy: deployment clear? [21:20:49] Yep! [21:21:11] it was a bit later today than usual I guess? [21:21:28] glad to see we're on group1 [21:23:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T342617)', diff saved to https://phabricator.wikimedia.org/P50007 and previous config saved to /var/cache/conftool/dbconfig/20230802-212352-ladsgroup.json [21:23:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [21:23:55] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:24:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [21:24:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T342617)', diff saved to https://phabricator.wikimedia.org/P50008 and previous config saved to /var/cache/conftool/dbconfig/20230802-212412-ladsgroup.json [21:24:40] Krinkle: Spillover from the train window. [21:25:38] dancy: always nice when stuff is fixed quick enough to appear as a longer train window though! [21:25:50] Agreed! [22:00:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Andrew) I'm sure that this only needs a single nic connected unless @aborrero has something truly ambitious in mind. Seems easy enough to leave... [22:02:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10Andrew) Oh, raid-wise: the existing cloudservices use ` echo partman/standard.cfg partman/raid1-2dev.cfg ;; \ ` That, or similar, is fine for... [22:04:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Maryana Pinchuk - https://phabricator.wikimedia.org/T342797 (10odimitrijevic) Approved. [22:06:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by krinkle@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939756 (https://phabricator.wikimedia.org/T337873) (owner: 10Krinkle) [22:07:09] (03Merged) 10jenkins-bot: Profiler: Sync minor changes with arc-lamp.git package [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939756 (https://phabricator.wikimedia.org/T337873) (owner: 10Krinkle) [22:07:36] !log krinkle@deploy1002 Started scap: Backport for [[gerrit:939756|Profiler: Sync minor changes with arc-lamp.git package (T337873)]] [22:07:40] T337873: Consider linking excimer-ui-client and arclamp-client directly from wmf-config - https://phabricator.wikimedia.org/T337873 [22:09:24] !log krinkle@deploy1002 krinkle: Backport for [[gerrit:939756|Profiler: Sync minor changes with arc-lamp.git package (T337873)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [22:09:32] (03PS1) 10Jforrester: Wikifunctions: Add TODO task numbers where appropriate [deployment-charts] - 10https://gerrit.wikimedia.org/r/944992 [22:09:47] * Krinkle is trying scap-backport "the normal way" for a couple of small patches [22:12:44] !log krinkle@deploy1002 krinkle: Continuing with sync [22:13:22] (03PS3) 10Krinkle: highlight.php: Remove ?blame=1 from URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940923 (owner: 10Reedy) [22:13:29] (03PS4) 10Krinkle: noc: Remove ?blame=1 from highlight.php URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940923 (owner: 10Reedy) [22:13:32] (03CR) 10Krinkle: [C: 03+2] noc: Remove ?blame=1 from highlight.php URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940923 (owner: 10Reedy) [22:14:20] (03Merged) 10jenkins-bot: noc: Remove ?blame=1 from highlight.php URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940923 (owner: 10Reedy) [22:15:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T342617)', diff saved to https://phabricator.wikimedia.org/P50009 and previous config saved to /var/cache/conftool/dbconfig/20230802-221547-ladsgroup.json [22:15:51] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [22:16:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10Papaul) @Ladsgroup hey can you please put in the description what software raid to use for this server? Thanks. [22:17:04] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:18:39] !log krinkle@deploy1002 Finished scap: Backport for [[gerrit:939756|Profiler: Sync minor changes with arc-lamp.git package (T337873)]] (duration: 11m 02s) [22:18:42] T337873: Consider linking excimer-ui-client and arclamp-client directly from wmf-config - https://phabricator.wikimedia.org/T337873 [22:19:13] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add switch interface and DNS for lists2001 - pt1979@cumin2002" [22:19:29] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Krinkle) [22:19:37] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Krinkle) [22:20:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add switch interface and DNS for lists2001 - pt1979@cumin2002" [22:20:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:21:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host lists2001.mgmt.codfw.wmnet with reboot policy FORCED [22:22:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10Papaul) [22:28:41] 10SRE, 10MW-on-K8s, 10serviceops: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Krinkle) [22:30:13] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:30:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P50010 and previous config saved to /var/cache/conftool/dbconfig/20230802-223053-ladsgroup.json [22:31:16] !log krinkle@deploy1002 Started scap: Backport for [[gerrit:940923|noc: Remove ?blame=1 from highlight.php URLs]] [22:32:51] !log krinkle@deploy1002 reedy and krinkle: Backport for [[gerrit:940923|noc: Remove ?blame=1 from highlight.php URLs]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [22:32:57] !log krinkle@deploy1002 reedy and krinkle: Continuing with sync [22:33:14] (03PS2) 10Krinkle: mc: Remove mcrouter-with-onhost-tier from ParserCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937197 (https://phabricator.wikimedia.org/T264604) [22:33:39] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: setup switch port and DNS for titan200[1-2] - pt1979@cumin2002" [22:34:21] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lists2001.mgmt.codfw.wmnet with reboot policy FORCED [22:34:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: setup switch port and DNS for titan200[1-2] - pt1979@cumin2002" [22:34:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:35:40] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lists2001'] [22:36:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host titan2001.mgmt.codfw.wmnet with reboot policy FORCED [22:39:24] !log krinkle@deploy1002 Finished scap: Backport for [[gerrit:940923|noc: Remove ?blame=1 from highlight.php URLs]] (duration: 08m 07s) [22:39:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T342617)', diff saved to https://phabricator.wikimedia.org/P50011 and previous config saved to /var/cache/conftool/dbconfig/20230802-223948-ladsgroup.json [22:39:52] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [22:44:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['lists2001'] [22:45:12] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lists2001'] [22:45:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lists2001'] [22:46:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P50012 and previous config saved to /var/cache/conftool/dbconfig/20230802-224559-ladsgroup.json [22:51:44] 10SRE-Unowned, 10noc.wikimedia.org: Investigate using php-fpm for noc - https://phabricator.wikimedia.org/T337302 (10Krinkle) 05Open→03Resolved a:03Joe Presumed fixed by {T341859}. [22:54:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P50013 and previous config saved to /var/cache/conftool/dbconfig/20230802-225454-ladsgroup.json [23:01:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T342617)', diff saved to https://phabricator.wikimedia.org/P50014 and previous config saved to /var/cache/conftool/dbconfig/20230802-230106-ladsgroup.json [23:01:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [23:01:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [23:01:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [23:01:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T342617)', diff saved to https://phabricator.wikimedia.org/P50015 and previous config saved to /var/cache/conftool/dbconfig/20230802-230127-ladsgroup.json [23:08:36] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [23:10:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P50016 and previous config saved to /var/cache/conftool/dbconfig/20230802-231001-ladsgroup.json [23:25:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T342617)', diff saved to https://phabricator.wikimedia.org/P50017 and previous config saved to /var/cache/conftool/dbconfig/20230802-232507-ladsgroup.json [23:25:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [23:25:12] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [23:25:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [23:25:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T342617)', diff saved to https://phabricator.wikimedia.org/P50018 and previous config saved to /var/cache/conftool/dbconfig/20230802-232528-ladsgroup.json [23:53:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T342617)', diff saved to https://phabricator.wikimedia.org/P50019 and previous config saved to /var/cache/conftool/dbconfig/20230802-235358-ladsgroup.json [23:54:02] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617