[00:01:43] RECOVERY - ensure kvm processes are running on cloudvirt1051 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:03:05] RECOVERY - ensure kvm processes are running on cloudvirt1052 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:07:11] RECOVERY - ensure kvm processes are running on cloudvirt1053 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:08:53] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [00:11:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [00:11:41] PROBLEM - ensure kvm processes are running on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:22:34] vgutierrez: Hi! Can I provide you a new prod key for foks? [00:26:08] (I see it's 2:30am where he is so I'll try later :) ) [00:59:59] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T0100) [01:31:33] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:00] (03PS2) 10Tks4Fish: brwikimedia: Add logo and wordmark for vector-2022 and minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814372 (https://phabricator.wikimedia.org/T313194) [01:52:06] (03CR) 10Tks4Fish: brwikimedia: Add logo and wordmark for vector-2022 and minerva (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814372 (https://phabricator.wikimedia.org/T313194) (owner: 10Tks4Fish) [01:52:23] (03PS2) 10Tks4Fish: brwikimedia: Use logo and wordmark in vector-2022 and minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814373 (https://phabricator.wikimedia.org/T313194) [01:52:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:01:21] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:04:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:05:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:05:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:06:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.21 [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/814920 [02:07:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.21 [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/814920 (owner: 10TrainBranchBot) [02:17:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:28] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.21 [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/814920 (owner: 10TrainBranchBot) [02:25:10] (03PS1) 10Andrew Bogott: new cloudcontrols: stagger fernet key syncing a bit more [puppet] - 10https://gerrit.wikimedia.org/r/814924 [02:27:08] (03CR) 10Andrew Bogott: [C: 03+2] new cloudcontrols: stagger fernet key syncing a bit more [puppet] - 10https://gerrit.wikimedia.org/r/814924 (owner: 10Andrew Bogott) [02:31:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:32:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:32:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:32:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:32:57] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:57:38] RECOVERY - ensure kvm processes are running on cloudvirt1051 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [03:05:52] (03PS1) 10Dzahn: alertmanager: switch IRC channel for gitlab (serviceops-collab) alerts [puppet] - 10https://gerrit.wikimedia.org/r/814926 [03:06:34] (03PS2) 10Dzahn: alertmanager: switch IRC channel for gitlab (serviceops-collab) alerts [puppet] - 10https://gerrit.wikimedia.org/r/814926 [03:09:44] (03PS3) 10Dzahn: alertmanager: switch IRC channel for gitlab (serviceops-collab) alerts [puppet] - 10https://gerrit.wikimedia.org/r/814926 [03:42:35] (03CR) 10Dzahn: "hey all..so either this.. or I am also totally fine with just commenting out the IRC part completely, while keeping the email and ticket p" [puppet] - 10https://gerrit.wikimedia.org/r/814926 (owner: 10Dzahn) [03:49:20] (03CR) 10Dzahn: "the way this works is these "receivers" in alertmanager can be used in any specific check defined elsewhere in puppet code. in this case i" [puppet] - 10https://gerrit.wikimedia.org/r/814926 (owner: 10Dzahn) [05:15:23] (03PS1) 10Marostegui: db2084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815080 (https://phabricator.wikimedia.org/T313121) [05:16:57] (03CR) 10Marostegui: [C: 03+2] db2084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815080 (https://phabricator.wikimedia.org/T313121) (owner: 10Marostegui) [05:17:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2084 from dbctl T313121', diff saved to https://phabricator.wikimedia.org/P31386 and previous config saved to /var/cache/conftool/dbconfig/20220719-051725-marostegui.json [05:17:30] T313121: decommission db2084 - https://phabricator.wikimedia.org/T313121 [05:24:33] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 135, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:37:29] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:05] kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T0600). [06:38:55] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:44:14] (03PS4) 10Sohom Datta: Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) [06:47:09] (03PS1) 10Marostegui: mariadb: Decommission db2084 [puppet] - 10https://gerrit.wikimedia.org/r/815202 (https://phabricator.wikimedia.org/T313121) [06:47:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2084.codfw.wmnet [06:51:36] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:52:41] (03CR) 10Samwilson: [C: 03+1] Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [06:54:18] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you Brett, this is great!" [puppet] - 10https://gerrit.wikimedia.org/r/814894 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [06:56:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2084 [puppet] - 10https://gerrit.wikimedia.org/r/815202 (https://phabricator.wikimedia.org/T313121) (owner: 10Marostegui) [06:56:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:58:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2084.codfw.wmnet [07:00:05] Amir1 and Urbanecm: Dear deployers, time to do the UTC morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:07:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/814915 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [07:07:45] (03PS1) 10Marostegui: db2167: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815204 (https://phabricator.wikimedia.org/T311493) [07:10:42] (03CR) 10Marostegui: [C: 03+2] db2167: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815204 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:11:56] (03PS1) 10Marostegui: instances.yaml: Add db2167 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815205 (https://phabricator.wikimedia.org/T311493) [07:12:49] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2167 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815205 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:13:47] (03CR) 10Filippo Giunchedi: "See inline, idea LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [07:14:16] (03PS1) 10Marostegui: instances.yaml: Fix db2167's sections [puppet] - 10https://gerrit.wikimedia.org/r/815207 (https://phabricator.wikimedia.org/T311493) [07:15:06] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Fix db2167's sections [puppet] - 10https://gerrit.wikimedia.org/r/815207 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:16:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2167:3311 and db2167:3318 to dbctl T311493', diff saved to https://phabricator.wikimedia.org/P31389 and previous config saved to /var/cache/conftool/dbconfig/20220719-071656-marostegui.json [07:17:03] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [07:18:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust db2167:3311 and db2167:3318 weight T311493', diff saved to https://phabricator.wikimedia.org/P31390 and previous config saved to /var/cache/conftool/dbconfig/20220719-071836-marostegui.json [07:20:33] (03PS1) 10Marostegui: db2086: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815208 (https://phabricator.wikimedia.org/T311493) [07:21:41] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/814884 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:21:52] (03PS2) 10Muehlenhoff: diffscan: Add SPDX headers to diffscan profile [puppet] - 10https://gerrit.wikimedia.org/r/814884 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:22:58] (03CR) 10Muehlenhoff: [C: 03+2] cassandra: Add SPDX headers to cassandra profile [puppet] - 10https://gerrit.wikimedia.org/r/814876 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:23:06] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/814876 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:23:09] (03CR) 10Marostegui: [C: 03+2] db2086: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815208 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:23:55] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/814877 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:27:43] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/814878 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:28:46] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/814879 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:31:05] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/814880 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:35:14] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/814881 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:37:05] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:19] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:41:57] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/814882 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:43:01] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/814883 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:50:12] (03PS1) 10Vgutierrez: admin: Add new SSH key for Joe Sutherland [puppet] - 10https://gerrit.wikimedia.org/r/815210 [07:51:57] (03CR) 10Marostegui: [C: 04-1] "The following roles would get this set to 1, and they are currently running it with 0:" [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [07:55:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [07:55:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [07:55:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:56:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [07:57:55] (03PS1) 10Marostegui: add_gt_lat_int_T312990.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/815211 (https://phabricator.wikimedia.org/T312990) [08:05:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet [08:14:14] (03CR) 10RhinosF1: [C: 03+1] Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [08:15:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet [08:22:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2018.codfw.wmnet to cluster codfw and group D [08:23:13] (03CR) 10Ladsgroup: [C: 03+1] add_gt_lat_int_T312990.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/815211 (https://phabricator.wikimedia.org/T312990) (owner: 10Marostegui) [08:23:27] (03CR) 10Marostegui: [C: 03+2] add_gt_lat_int_T312990.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/815211 (https://phabricator.wikimedia.org/T312990) (owner: 10Marostegui) [08:23:51] (03Merged) 10jenkins-bot: add_gt_lat_int_T312990.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/815211 (https://phabricator.wikimedia.org/T312990) (owner: 10Marostegui) [08:33:24] (03PS1) 10Zabe: maintain-views: Add pagetriage-copyvio to allowed logtypes [puppet] - 10https://gerrit.wikimedia.org/r/815215 (https://phabricator.wikimedia.org/T313281) [08:39:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2018.codfw.wmnet to cluster codfw and group D [08:52:52] jouncebot: nowandnext [08:52:52] No deployments scheduled for the next 4 hour(s) and 7 minute(s) [08:52:52] In 4 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T1300) [08:52:52] In 4 hour(s) and 7 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T1300) [08:52:55] * urbanecm going to ship a security patch [08:53:50] (03CR) 10Jbond: beaker: add initial beaker files (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond) [08:57:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:58:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:58:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:58:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:00:37] !log Deployed patch for T313205 [09:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:39] (03PS30) 10Jbond: beaker: add initial beaker files [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [09:01:34] (03PS1) 10Marostegui: control-mariadb-10.6-bullseye: Drop libaio1 [software] - 10https://gerrit.wikimedia.org/r/815218 (https://phabricator.wikimedia.org/T311106) [09:01:40] (03CR) 10CI reject: [V: 04-1] beaker: add initial beaker files [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond) [09:03:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:04:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:04:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:05:02] (03CR) 10Jbond: beaker: add a method to hack fixes specific to beaker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814866 (owner: 10Jbond) [09:05:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:05:56] (03Abandoned) 10Jbond: test reverting storconfig change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/814826 (owner: 10Jbond) [09:09:09] (03PS1) 10Ayounsi: Remove Netbox 2.10 hosts from Puppet before decom [puppet] - 10https://gerrit.wikimedia.org/r/815219 (https://phabricator.wikimedia.org/T296452) [09:11:06] (03PS2) 10Marostegui: control-mariadb-10.6-bullseye: Drop libaio1 [software] - 10https://gerrit.wikimedia.org/r/815218 (https://phabricator.wikimedia.org/T311106) [09:16:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36288/console" [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [09:17:31] (03CR) 10Ladsgroup: core.pp: Make sync_binlog and trx_commit configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [09:18:47] (03CR) 10Jelto: [C: 03+1] "lgtm for the long term solution." [puppet] - 10https://gerrit.wikimedia.org/r/814926 (owner: 10Dzahn) [09:20:47] (03CR) 10Jbond: "lgtm tjust a couple of extra things that where mised" [puppet] - 10https://gerrit.wikimedia.org/r/815219 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:22:03] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/pcc-worker1001/36289/" [puppet] - 10https://gerrit.wikimedia.org/r/815219 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:22:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36290/console" [puppet] - 10https://gerrit.wikimedia.org/r/815219 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:23:36] (03PS2) 10Ayounsi: Remove Netbox 2.10 hosts from Puppet before decom [puppet] - 10https://gerrit.wikimedia.org/r/815219 (https://phabricator.wikimedia.org/T296452) [09:24:25] (03CR) 10Muehlenhoff: control-mariadb-10.6-bullseye: Drop libaio1 (031 comment) [software] - 10https://gerrit.wikimedia.org/r/815218 (https://phabricator.wikimedia.org/T311106) (owner: 10Marostegui) [09:24:59] (03CR) 10Ayounsi: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/815219 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:26:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/815219 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:26:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36291/console" [puppet] - 10https://gerrit.wikimedia.org/r/814824 (https://phabricator.wikimedia.org/T313217) (owner: 10Jbond) [09:26:53] (03CR) 10Ayounsi: [C: 03+2] Remove Netbox 2.10 hosts from Puppet before decom [puppet] - 10https://gerrit.wikimedia.org/r/815219 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:29:51] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts netbox2001.wikimedia.org [09:31:24] (03PS3) 10Marostegui: control-mariadb-10.6-bullseye: Drop libaio1 [software] - 10https://gerrit.wikimedia.org/r/815218 (https://phabricator.wikimedia.org/T311106) [09:31:52] (03CR) 10Ayounsi: [C: 03+2] Decom cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 (owner: 10Ayounsi) [09:32:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software] - 10https://gerrit.wikimedia.org/r/815218 (https://phabricator.wikimedia.org/T311106) (owner: 10Marostegui) [09:32:08] (03CR) 10Volans: [C: 03+1] "LGTM, make sure to test it both with physical and VM hosts" [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 (owner: 10Ayounsi) [09:32:25] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.6-bullseye: Drop libaio1 [software] - 10https://gerrit.wikimedia.org/r/815218 (https://phabricator.wikimedia.org/T311106) (owner: 10Marostegui) [09:33:52] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi) [09:34:01] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:35:25] (03Merged) 10jenkins-bot: Decom cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/803262 (owner: 10Ayounsi) [09:35:27] (03Merged) 10jenkins-bot: control-mariadb-10.6-bullseye: Drop libaio1 [software] - 10https://gerrit.wikimedia.org/r/815218 (https://phabricator.wikimedia.org/T311106) (owner: 10Marostegui) [09:36:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1157.eqiad.wmnet with reason: Maintenance [09:37:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1157.eqiad.wmnet with reason: Maintenance [09:37:22] (03PS1) 10Ayounsi: sre.hosts.decommission: remove the 3 min wait for Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/815222 [09:37:45] (JobUnavailable) firing: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:37:58] t/win 30 [09:38:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:38:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netbox2001.wikimedia.org [09:39:26] (03CR) 10Volans: [C: 03+1] "LGTM, let's try to remove it and see what happens" [cookbooks] - 10https://gerrit.wikimedia.org/r/815222 (owner: 10Ayounsi) [09:40:13] (03PS2) 10Ayounsi: sre.hosts.decommission: remove the 3 min wait for Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/815222 [09:40:36] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts netbox1001.wikimedia.org [09:44:34] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [09:44:40] (03CR) 10Ayounsi: [C: 03+2] sre.hosts.decommission: remove the 3 min wait for Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/815222 (owner: 10Ayounsi) [09:46:10] !log draining ganeti2029 T310483 [09:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:36] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:47:45] (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:48:02] (03Merged) 10jenkins-bot: sre.hosts.decommission: remove the 3 min wait for Netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/815222 (owner: 10Ayounsi) [09:48:43] (03PS1) 10Marostegui: mariadb: Productionize db2168 [puppet] - 10https://gerrit.wikimedia.org/r/815223 (https://phabricator.wikimedia.org/T311493) [09:48:50] (03PS8) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 [09:48:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:48:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netbox1001.wikimedia.org [09:49:26] (03CR) 10Marostegui: [C: 03+1] core.pp: Make sync_binlog and trx_commit configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [09:49:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2168 [puppet] - 10https://gerrit.wikimedia.org/r/815223 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [09:50:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36293/console" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [09:50:36] (03PS3) 10Majavah: O:openstack: prepare for dedicated rabbit nodes [puppet] - 10https://gerrit.wikimedia.org/r/813944 [09:50:49] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts netboxdb2001.codfw.wmnet [09:51:37] (03CR) 10Jbond: "ready for a second review" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [09:51:51] (03CR) 10Jbond: [V: 03+1] P:varnish::common: Add support for passing wikimedia_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [09:52:17] (03PS10) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [09:52:45] (JobUnavailable) resolved: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:53:43] (03CR) 10Jbond: C:varnish: Rate limit hotlinking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [09:59:14] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36294/console" [puppet] - 10https://gerrit.wikimedia.org/r/813944 (owner: 10Majavah) [09:59:47] (03PS1) 10Majavah: P:openstack::{nova,neutron}: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815224 [10:00:16] (03PS1) 10Vgutierrez: cloud: Replace ats-tls with HAproxy in traffic-cache-atstext-buster [puppet] - 10https://gerrit.wikimedia.org/r/815225 [10:00:56] (03CR) 10CI reject: [V: 04-1] P:openstack::{nova,neutron}: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815224 (owner: 10Majavah) [10:01:20] (03PS1) 10Urbanecm: Initial configuration for blkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815226 (https://phabricator.wikimedia.org/T310777) [10:01:22] (03PS1) 10Urbanecm: Initial configuration for pcmwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815227 (https://phabricator.wikimedia.org/T310776) [10:01:24] (03PS1) 10Urbanecm: Initial configuration for guwwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815228 (https://phabricator.wikimedia.org/T309054) [10:01:26] (03PS1) 10Urbanecm: Initial configuration for bjnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815229 (https://phabricator.wikimedia.org/T312209) [10:02:01] (03PS2) 10Majavah: P:openstack::{nova,neutron}: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815224 [10:02:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:02:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:02:23] (03PS3) 10Ayounsi: provision cookbook: configure switches using cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 [10:02:25] (03PS1) 10Ayounsi: sre.hosts.decommission: don't wait for VM shutdown timeout [cookbooks] - 10https://gerrit.wikimedia.org/r/815230 [10:02:46] (03CR) 10Vgutierrez: [C: 03+2] cloud: Replace ats-tls with HAproxy in traffic-cache-atstext-buster [puppet] - 10https://gerrit.wikimedia.org/r/815225 (owner: 10Vgutierrez) [10:03:44] (03PS2) 10Ayounsi: sre.hosts.decommission: don't wait for VM shutdown timeout [cookbooks] - 10https://gerrit.wikimedia.org/r/815230 [10:04:56] (03CR) 10CI reject: [V: 04-1] P:openstack::{nova,neutron}: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815224 (owner: 10Majavah) [10:05:45] !log reboot an-worker1127 - hdfs datanode caused CPU stalls [10:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:14] (03PS3) 10Majavah: P:openstack::{nova,neutron}: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815224 [10:07:45] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36296/console" [puppet] - 10https://gerrit.wikimedia.org/r/815224 (owner: 10Majavah) [10:08:59] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-etcd2002.codfw.wmnet with reason: Switch instance to DRBD, T311686 [10:09:04] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [10:09:08] (03CR) 10CI reject: [V: 04-1] P:openstack::{nova,neutron}: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815224 (owner: 10Majavah) [10:09:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-etcd2002.codfw.wmnet with reason: Switch instance to DRBD, T311686 [10:09:52] (03CR) 10Majavah: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/815224 (owner: 10Majavah) [10:15:02] (03PS4) 10Majavah: P:openstack::{nova,neutron}: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815224 [10:15:41] RECOVERY - Check systemd state on an-worker1127 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:23] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36297/console" [puppet] - 10https://gerrit.wikimedia.org/r/815224 (owner: 10Majavah) [10:17:33] RECOVERY - puppet last run on an-worker1127 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:24:11] (03CR) 10Vgutierrez: P:varnish::common: Add support for passing wikimedia_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [10:24:21] (03PS3) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 [10:24:41] (03CR) 10Jbond: admin: Add new SSH key for Joe Sutherland (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815210 (owner: 10Vgutierrez) [10:25:40] (03CR) 10Vgutierrez: admin: Add new SSH key for Joe Sutherland (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815210 (owner: 10Vgutierrez) [10:27:40] (03PS9) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 [10:27:57] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [10:29:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36298/console" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [10:29:51] (03CR) 10Muehlenhoff: sre.hosts.decommission: don't wait for VM shutdown timeout (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/815230 (owner: 10Ayounsi) [10:30:15] (03PS1) 10Vgutierrez: admin: Add new SSH key for jnuche [puppet] - 10https://gerrit.wikimedia.org/r/815233 (https://phabricator.wikimedia.org/T313293) [10:30:45] (03CR) 10Hashar: Send events to Wikimedia EventGate (032 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [10:31:25] The https://www.mediawiki.org/wiki/MediaWiki_1.39/wmf.21 page seems to be missing commits that would have been in wmf.20 [10:32:12] (03CR) 10Hashar: Send events to Wikimedia EventGate (031 comment) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [10:33:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1127', diff saved to https://phabricator.wikimedia.org/P31392 and previous config saved to /var/cache/conftool/dbconfig/20220719-103341-root.json [10:34:02] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/815230 (owner: 10Ayounsi) [10:35:04] Okay. I think MediaWiki 1.39/wmf.20/Changelog and MediaWiki 1.39/wmf.21/Changelog need merging together. [10:39:11] (03CR) 10Hashar: Send events to Wikimedia EventGate (031 comment) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [10:39:52] (03CR) 10Vgutierrez: admin: Add new SSH key for Joe Sutherland (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815210 (owner: 10Vgutierrez) [10:40:42] I say that because wmf.20 never happened and the changes that would have been in wmf.20 are in the wmf.21 train so the list of patches going out is not complete. [10:43:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:43:38] Dreamy_Jazz: probably best to ask -releng as they in charge and it's quieter there [10:43:40] (03CR) 10David Caro: wmcs.labstore: add some alerts for labstore (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/813926 (owner: 10David Caro) [10:43:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:43:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/815230 (owner: 10Ayounsi) [10:43:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:44:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1180.eqiad.wmnet with reason: Maintenance [10:44:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T312990)', diff saved to https://phabricator.wikimedia.org/P31393 and previous config saved to /var/cache/conftool/dbconfig/20220719-104414-marostegui.json [10:44:18] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [10:45:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/815233 (https://phabricator.wikimedia.org/T313293) (owner: 10Vgutierrez) [10:45:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1179.eqiad.wmnet with reason: Maintenance [10:45:46] (03PS4) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 [10:45:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1179.eqiad.wmnet with reason: Maintenance [10:45:59] (03CR) 10Vgutierrez: [C: 03+2] admin: Add new SSH key for jnuche [puppet] - 10https://gerrit.wikimedia.org/r/815233 (https://phabricator.wikimedia.org/T313293) (owner: 10Vgutierrez) [10:46:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T312984)', diff saved to https://phabricator.wikimedia.org/P31394 and previous config saved to /var/cache/conftool/dbconfig/20220719-104559-ladsgroup.json [10:46:04] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [10:46:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T312990)', diff saved to https://phabricator.wikimedia.org/P31395 and previous config saved to /var/cache/conftool/dbconfig/20220719-104622-marostegui.json [10:48:06] (03CR) 10Hashar: "On commenting with a +1 I have managed to get a ref-updated and comment-added json events dumped to stderr:" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [10:52:30] Thanks RhinosF1. I've posted there. [10:54:42] Noted [10:56:30] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [10:56:36] (03PS1) 10David Caro: tests: Add nice message to runbook check test failure [alerts] - 10https://gerrit.wikimedia.org/r/815238 [10:59:42] !log draining ganeti2020 T310483 [10:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Urbanecm and Amir1: Your horoscope predicts another unfortunate New wiki creation deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T1100). [11:00:10] o/ [11:00:16] o/ [11:00:21] let's start then :) [11:00:27] (03PS5) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 [11:00:39] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for blkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815226 (https://phabricator.wikimedia.org/T310777) (owner: 10Urbanecm) [11:00:51] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:00:52] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netboxdb2001.codfw.wmnet [11:01:27] (03Merged) 10jenkins-bot: Initial configuration for blkwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815226 (https://phabricator.wikimedia.org/T310777) (owner: 10Urbanecm) [11:01:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P31396 and previous config saved to /var/cache/conftool/dbconfig/20220719-110127-marostegui.json [11:01:55] (03CR) 10Hashar: Send events to Wikimedia EventGate (031 comment) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [11:02:07] pulled to mwmaint, running addWiki [11:02:37] and...i got an error :/ https://www.irccloud.com/pastebin/o7rNigR7/ [11:03:16] * urbanecm doesn't know what "Cannot close DBConnRef instance; it must be shareable" means. [11:03:20] Amir1: do you know please? [11:03:31] let me see [11:04:08] it's probably one of those refactors [11:05:17] I need to look at the code [11:06:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:07:06] urbanecm: ah, I think you can ignore this part [11:07:12] just remove the line 112 [11:07:29] but now you need to run that part manually (maybe do skip-cluster=main?) [11:07:43] yeah, looks so [11:07:50] I'll make a patch for it later [11:07:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:07:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:07:57] thanks [11:08:32] line removed. checking how far addWiki.php got [11:08:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2002.codfw.wmnet with reason: Switch instance to plain, T311686 [11:08:48] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [11:08:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2002.codfw.wmnet with reason: Switch instance to plain, T311686 [11:08:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:09:16] db is there, but empty. [11:10:03] yup because that condition only creates the db, not tables [11:10:10] yeah [11:10:25] urbanecm: you need to move line 124 out of the condition [11:10:31] $this->createMainClusterSchema( $dbw, $dbName, $siteGroup ); [11:10:44] that's better than manually sourcing the sql files :D [11:11:02] thanks [11:12:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312984)', diff saved to https://phabricator.wikimedia.org/P31397 and previous config saved to /var/cache/conftool/dbconfig/20220719-111203-ladsgroup.json [11:12:07] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [11:12:53] Amir1: looks i also need to remove line 395 (same thing, but externalstore), 139 and 164 (both x1), right? [11:13:35] yup [11:13:39] doing [11:14:37] and running with --skipclusters=main [11:15:07] we're progressing. another error. Could not open "/srv/mediawiki/php-1.39.0-wmf.19/extensions/Math/db/mathoid.mysql.sql". that should be easy to fix... [11:15:57] looks like that should be extensions/Math/sql/mysql/mathoid.sql and mathlatexml.sql [11:16:03] yeah [11:16:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P31398 and previous config saved to /var/cache/conftool/dbconfig/20220719-111632-marostegui.json [11:16:59] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36299/console" [puppet] - 10https://gerrit.wikimedia.org/r/812402 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [11:17:09] although mathlatexml.sql wasn't even included previously? [11:17:11] taavi: was fixed with https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/commit/f64bcf3f1e41dd3ac5c6a1565f83f95308bc4d68 [11:17:28] yeah [11:18:01] taavi: that's always been there so unsued? [11:18:29] * urbanecm manually sources remaining sql files [11:22:25] moving line 124 back to the condition and running the rest [11:22:56] of course. _another_ error. [11:23:04] https://www.irccloud.com/pastebin/XhHPf6YV/ [11:24:58] looks like is the last step before the email and MassMessage cache revalidation, and is relevant to uploads. blkwiki doesn't need uploads (now), so probably fine to leave broken and fix later. Amir1: what do you think? [11:25:21] sure [11:25:59] on the plus side, all database tables appear to exist. [11:26:36] * urbanecm runs the remaining few lines in shell.php [11:27:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P31399 and previous config saved to /var/cache/conftool/dbconfig/20220719-112708-ladsgroup.json [11:27:11] !log remove ganeti 3.0.1-2+deb11u0 from buster-wikimedia, superceded by ganeti 3.0.2-1~deb11u1 from Bullseye 11.4 point release T312637 [11:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:17] T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 [11:27:20] urbanecm: i think that script can just be retried cleanly when fixed [11:27:52] no idea why it didn't set backend [11:28:42] WikimediaMaintenance/filebackend/setZoneAccess.php --backend=local-multiwrite might work [11:30:42] (03CR) 10David Caro: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/813926 (owner: 10David Caro) [11:31:09] blkwiki works at mwdebug1001, syncing [11:31:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T312990)', diff saved to https://phabricator.wikimedia.org/P31400 and previous config saved to /var/cache/conftool/dbconfig/20220719-113137-marostegui.json [11:31:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:31:42] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [11:31:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1168.eqiad.wmnet with reason: Maintenance [11:31:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T312990)', diff saved to https://phabricator.wikimedia.org/P31401 and previous config saved to /var/cache/conftool/dbconfig/20220719-113158-marostegui.json [11:34:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T312990)', diff saved to https://phabricator.wikimedia.org/P31403 and previous config saved to /var/cache/conftool/dbconfig/20220719-113406-marostegui.json [11:34:09] !log urbanecm@deploy1002 Synchronized wmf-config/db-production.php: Creating blkwiki (T310777) (duration: 02m 47s) [11:34:15] T310777: Create Wikipedia Pa'O - https://phabricator.wikimedia.org/T310777 [11:37:26] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe20 [11:37:26] ://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [11:37:29] !log urbanecm@deploy1002 Synchronized dblists: Creating blkwiki (T310777) (duration: 02m 52s) [11:40:24] PROBLEM - Check systemd state on parse2017 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:00] !log urbanecm@deploy1002 rebuilt and synchronized wikiversions files: Creating blkwiki (T310777) [11:41:04] T310777: Create Wikipedia Pa'O - https://phabricator.wikimedia.org/T310777 [11:42:06] i made user 3, only behind urbanec.m and maintenance script [11:42:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P31404 and previous config saved to /var/cache/conftool/dbconfig/20220719-114214-ladsgroup.json [11:43:57] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: Creating blkwiki (T310777) (duration: 02m 56s) [11:44:20] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:44:42] (03PS31) 10Jbond: beaker: add initial beaker files [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [11:46:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2001.codfw.wmnet with reason: Switch instance to DRBD, T311686 [11:46:23] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [11:46:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2001.codfw.wmnet with reason: Switch instance to DRBD, T311686 [11:46:47] !log urbanecm@deploy1002 Synchronized wmf-config/logos.php: Creating blkwiki (T310777) (duration: 02m 49s) [11:46:50] T310777: Create Wikipedia Pa'O - https://phabricator.wikimedia.org/T310777 [11:48:10] Amir1: fyi phabricatorized the errors as T313302. [11:48:10] T313302: addWiki.php is broken (2022-07) - https://phabricator.wikimedia.org/T313302 [11:48:21] Thanks! [11:49:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P31405 and previous config saved to /var/cache/conftool/dbconfig/20220719-114911-marostegui.json [11:49:23] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Creating blkwiki (T310777) (duration: 02m 35s) [11:51:50] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:51:50] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 102.1 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [11:52:05] !log urbanecm@deploy1002 Synchronized langlist: Creating blkwiki (T310777) (duration: 02m 42s) [11:52:09] T310777: Create Wikipedia Pa'O - https://phabricator.wikimedia.org/T310777 [11:52:28] okay, syncs finished [11:52:44] let's update interwiki cache and call it a wiki [11:52:55] other three will need to wait for T313302 to be resolved [11:53:04] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:53:06] Hello, I would like to ask what's the procedure to change the subdomain from zh-yue back to yue for zh_yuewiki ( yue.wikipedia.org / zh-yue.wikipedia.org ). [11:53:38] Winston_Sung[m]: a very messy one [11:54:02] What do you mean by "messy"? [11:54:21] I think changing the domain name is not so bad [11:54:41] not as bad as changing the DB name [11:54:45] TimStarling: don't db names have to match URLs? [11:55:18] The yue community asked for a long time and would like this issue to be solved as soon as possible. [11:55:29] Winston_Sung[m]: is there a task [11:55:35] (03CR) 10Jelto: [V: 03+1 C: 03+2] "lgtm now. I tested both Exec['gitlab-runner-config-subst-token'] and Exec['gitlab-runner-save-auth-token'] locally." [puppet] - 10https://gerrit.wikimedia.org/r/812402 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [11:56:32] Also for https://phabricator.wikimedia.org/T10217 [11:56:36] https://phabricator.wikimedia.org/T10217 [11:56:39] TimStarling: Winston_Sung[m] The biggest blocker of renaming a wiki is wikidata support [11:56:45] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815247 (https://phabricator.wikimedia.org/T310777) [11:56:47] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815247 (https://phabricator.wikimedia.org/T310777) (owner: 10Urbanecm) [11:56:54] we've done a couple of domain name changes recently, but all of those for wikis that aren't connected to wikidata, and AIUI renaming a wikidata wiki has some issues [11:56:54] see $staticMappings in MWMultiVersion::setSiteInfoForWiki() [11:57:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312984)', diff saved to https://phabricator.wikimedia.org/P31406 and previous config saved to /var/cache/conftool/dbconfig/20220719-115719-ladsgroup.json [11:57:23] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [11:57:29] I looked for ways to get it fixed one or two years ago [11:57:45] does anyone know how does one push to gerrit from deployment host those days? the push URL changed to ssh one for some reason [11:57:53] it just needs wikidata team fixing it and then we can move forward [11:58:02] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815247 (https://phabricator.wikimedia.org/T310777) (owner: 10Urbanecm) [11:58:05] you see there that it would join one other wiki, be-tarask.wikipedia.org which is be_x_oldwiki [11:58:13] i added this to my ~/.gitconfig temporarily, so i can push via the old way https://www.irccloud.com/pastebin/XZnMYSWM/ [11:58:16] but it's good to have a friend right? [11:58:23] So the biggest blocker is the site Wikidata itself or the extension Wikibase? [11:58:57] wikibase support of dbname aliases basically [11:59:38] e.g. right now, you can't insert be_tarask or be-tarask in sitelink lookup in wikidata [11:59:57] T111822 [11:59:58] T111822: [Bug] Adding sitelinks at be-tarask Wikipedia doesn't work - https://phabricator.wikimedia.org/T111822 [12:00:16] T112426 [12:00:17] T112426: [Bug] Querying Wikipedia for langlinks doesn't work for be-tarask, but works for be-x-old - https://phabricator.wikimedia.org/T112426 [12:00:25] T114772 [12:00:25] T114772: Allow entering Wikidata sitelinks to wikis that have non-typical wiki ID (not matching the database name) - https://phabricator.wikimedia.org/T114772 [12:00:28] !log upgrading ganeti/ulsfo to 3.0.2 T312637 [12:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:31] T312637: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 [12:00:59] once these are resolved, we can migrate a lot of wikis that are waiting, I suggest you talk to Wikidata's PM (Lydia) Winston_Sung[m] [12:01:26] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (T310777) (duration: 02m 49s) [12:01:30] T310777: Create Wikipedia Pa'O - https://phabricator.wikimedia.org/T310777 [12:02:10] urbanecm: you can do the interwiki cache the latest :D [12:02:18] (after all wikis are done) [12:02:52] OK, so the issue is basically https://phabricator.wikimedia.org/T172035 ? [12:03:02] Amir1: yeah, i know. I decided to postpone the other three wikis until T313302 is fixed [12:03:03] T313302: addWiki.php is broken (2022-07) - https://phabricator.wikimedia.org/T313302 [12:03:25] urbanecm: I see, sgtm [12:03:47] Okay. Thanks for the information. [12:03:52] Winston_Sung[m]: not all tbh, I'm not sure CX support is really a blocker (or not even already fixed) [12:04:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:04:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P31407 and previous config saved to /var/cache/conftool/dbconfig/20220719-120416-marostegui.json [12:05:11] Another thing is Wikipedia-yue would like to temporarily disable the zh-related fallback until they decided whether to accept zh-hans translation appear on the user interface in yue. [12:05:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:05:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:05:51] that's the other Amir's area of expertise [12:05:54] So what's the procedure to backport https://gerrit.wikimedia.org/r/811417 to Group 2 (1.39.0-wmf.19)? [12:06:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:06:21] https://wikitech.wikimedia.org/wiki/Deployments [12:06:30] > not all tbh, I'm not sure CX support is really a blocker (or not even already fixed) [12:06:30] Okay. Thanks for the info. [12:06:31] Winston_Sung[m]: it first needs to be reviewed (and merged) [12:06:33] you might get it by end of the week tho [12:06:36] > not all tbh, I'm not sure CX support is really a blocker (or not even already fixed) [12:06:36] Okay. Thanks for the info. [12:06:46] * > not all tbh, I'm not sure CX support is really a blocker (or not even already fixed) [12:06:47] Okay. Thanks for the info. [12:06:48] oh it's not merged yet [12:06:50] fun [12:07:18] > It first needs to be reviewed (and merged) [12:07:18] Well, sadly, no one reviewed it yet. [12:07:22] yeah [12:07:24] i added Amir A. as a reviewer, so they can take a look. [12:08:39] Winston_Sung[m]: https://www.mediawiki.org/wiki/Gerrit/Code_review/Getting_reviews covers "how to get a review". basically, once a patch is merged, it will be deployed automatically (usually, it takes about a week) [12:09:03] #translatewiki telegram is best place to find the other Amir [12:09:15] backports are only needed when a patch that's merged, but not yet deployed, needs to be deployed more quickly. [12:10:05] (03PS1) 10KartikMistry: Enable ContentTranslation out of Beta for sswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815251 (https://phabricator.wikimedia.org/T309384) [12:11:48] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 101.9 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [12:12:27] "The Other Am/ir" is a great band name omg [12:12:48] (03PS1) 10David Caro: novafullstack: use the correct metric [alerts] - 10https://gerrit.wikimedia.org/r/815253 [12:14:56] Thanks. [12:14:56] I've contacted Amire80 in #translatewiki. [12:17:42] (03CR) 10David Caro: [C: 03+2] novafullstack: use the correct metric [alerts] - 10https://gerrit.wikimedia.org/r/815253 (owner: 10David Caro) [12:18:46] (03CR) 10Ayounsi: [C: 03+2] sre.hosts.decommission: don't wait for VM shutdown timeout [cookbooks] - 10https://gerrit.wikimedia.org/r/815230 (owner: 10Ayounsi) [12:18:52] (03PS3) 10Ayounsi: sre.hosts.decommission: don't wait for VM shutdown timeout [cookbooks] - 10https://gerrit.wikimedia.org/r/815230 [12:19:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T312990)', diff saved to https://phabricator.wikimedia.org/P31408 and previous config saved to /var/cache/conftool/dbconfig/20220719-121921-marostegui.json [12:19:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:19:29] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [12:19:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:19:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T312990)', diff saved to https://phabricator.wikimedia.org/P31409 and previous config saved to /var/cache/conftool/dbconfig/20220719-121941-marostegui.json [12:20:38] (03Merged) 10jenkins-bot: novafullstack: use the correct metric [alerts] - 10https://gerrit.wikimedia.org/r/815253 (owner: 10David Caro) [12:21:46] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 101.7 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [12:22:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T312990)', diff saved to https://phabricator.wikimedia.org/P31411 and previous config saved to /var/cache/conftool/dbconfig/20220719-122201-marostegui.json [12:25:51] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts netboxdb1001.eqiad.wmnet [12:26:56] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [12:28:27] (03PS1) 10Jelto: Revert "gitlab_runner: Handle changes to runner config" [puppet] - 10https://gerrit.wikimedia.org/r/815267 [12:30:08] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:30:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netboxdb1001.eqiad.wmnet [12:33:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you !" [alerts] - 10https://gerrit.wikimedia.org/r/815238 (owner: 10David Caro) [12:37:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P31412 and previous config saved to /var/cache/conftool/dbconfig/20220719-123706-marostegui.json [12:40:21] (03CR) 10Jelto: "@Dduvall puppet runs fail with" [puppet] - 10https://gerrit.wikimedia.org/r/815267 (owner: 10Jelto) [12:40:55] (03CR) 10David Caro: [C: 03+2] tests: Add nice message to runbook check test failure [alerts] - 10https://gerrit.wikimedia.org/r/815238 (owner: 10David Caro) [12:41:36] (03CR) 10Jelto: [C: 03+2] Revert "gitlab_runner: Handle changes to runner config" [puppet] - 10https://gerrit.wikimedia.org/r/815267 (owner: 10Jelto) [12:43:04] (03PS1) 10Ayounsi: Revert "sre.hosts.decommission: remove the 3 min wait for Netbox" [cookbooks] - 10https://gerrit.wikimedia.org/r/815268 [12:43:20] (03PS12) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [12:43:22] (03PS4) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) [12:43:24] (03PS1) 10Filippo Giunchedi: icinga: set 'instance' for commons probe [puppet] - 10https://gerrit.wikimedia.org/r/815258 (https://phabricator.wikimedia.org/T305847) [12:43:32] (03PS2) 10Ayounsi: Revert "sre.hosts.decommission: remove the 3 min wait for Netbox" [cookbooks] - 10https://gerrit.wikimedia.org/r/815268 [12:44:09] After taking a look at the previous issue, I got another question: Why "don't" we change the real DB name for be-taraskwiki? Is there any technical issue or reason? [12:45:11] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts netboxdb1001.eqiad.wmnet [12:45:17] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [12:46:02] because changing a database name and all references to it is rather hard [12:48:23] (03PS1) 10David Caro: openstack.galera: double the max connections [puppet] - 10https://gerrit.wikimedia.org/r/815259 [12:48:25] (03PS2) 10Filippo Giunchedi: icinga: set 'instance' for commons probe [puppet] - 10https://gerrit.wikimedia.org/r/815258 (https://phabricator.wikimedia.org/T305847) [12:49:59] Tech debt.png [12:50:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:50:38] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts netboxdb1001.eqiad.wmnet [12:52:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P31413 and previous config saved to /var/cache/conftool/dbconfig/20220719-125211-marostegui.json [12:52:31] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: set 'instance' for commons probe [puppet] - 10https://gerrit.wikimedia.org/r/815258 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:53:49] (03CR) 10David Caro: [C: 03+2] "This is a temporary measure" [puppet] - 10https://gerrit.wikimedia.org/r/815259 (owner: 10David Caro) [12:53:50] Is this the reason why we don't use https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/156189 ? [12:54:03] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:56:50] (03CR) 10Ayounsi: [C: 03+2] Revert "sre.hosts.decommission: remove the 3 min wait for Netbox" [cookbooks] - 10https://gerrit.wikimedia.org/r/815268 (owner: 10Ayounsi) [12:57:54] it's complicated [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T1300). [13:00:05] Tks4Fish: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T1300) [13:00:16] (03Merged) 10jenkins-bot: Revert "sre.hosts.decommission: remove the 3 min wait for Netbox" [cookbooks] - 10https://gerrit.wikimedia.org/r/815268 (owner: 10Ayounsi) [13:03:44] O/ [13:04:48] I can *maybe* deploy later in the hour, 13:45 or so [13:04:56] right now I’m busy and after that my brain will need a bit of a break, I’m afraid [13:05:08] Tks4Fish: feel free to ping me later if no one else shows up to deploy [13:05:15] (forgot to put “UTC” after that time) [13:05:48] Lucas_WMDE: Tks4Fish I can do the deployments ;) [13:05:55] \o/ [13:06:10] Sure, I can wait a bit, it's even better for me heh [13:06:19] (03PS1) 10Ssingh: durum: add IP version (v4/v6) in test result [puppet] - 10https://gerrit.wikimedia.org/r/815264 [13:06:23] Just ping me when you're available :) [13:07:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T312990)', diff saved to https://phabricator.wikimedia.org/P31414 and previous config saved to /var/cache/conftool/dbconfig/20220719-130716-marostegui.json [13:07:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:07:21] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [13:07:21] So if we don't change the actual DB name, why don't we use a DB name converter for it? [13:07:25] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36300/console" [puppet] - 10https://gerrit.wikimedia.org/r/815264 (owner: 10Ssingh) [13:07:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:07:32] Tks4Fish: I think hashar was volunteering to deploy now? [13:07:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T312990)', diff saved to https://phabricator.wikimedia.org/P31415 and previous config saved to /var/cache/conftool/dbconfig/20220719-130736-marostegui.json [13:07:39] yeah can do it [13:07:43] thanks a lot! [13:07:45] I am looking at the changes [13:07:53] (03CR) 10Hashar: [C: 03+2] brwikimedia: Add logo and wordmark for vector-2022 and minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814372 (https://phabricator.wikimedia.org/T313194) (owner: 10Tks4Fish) [13:07:57] (03PS1) 10MSantos: mobileapps: bump to 2022-07-19-125630-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/815265 [13:08:17] Ohhh I saw the Lucas and missed the hashar before it lol [13:08:40] (03Merged) 10jenkins-bot: brwikimedia: Add logo and wordmark for vector-2022 and minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814372 (https://phabricator.wikimedia.org/T313194) (owner: 10Tks4Fish) [13:08:52] (03PS2) 10Ssingh: durum: add IP version (v4/v6) in test result [puppet] - 10https://gerrit.wikimedia.org/r/815264 [13:08:58] Tks4Fish: do you know how to test the patch on mwdebug hosts? [13:09:39] else I will test it ;) [13:09:41] (03PS1) 10MSantos: proton: bump to 2022-07-14-103746-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/815286 [13:09:44] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36301/console" [puppet] - 10https://gerrit.wikimedia.org/r/815264 (owner: 10Ssingh) [13:09:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T312990)', diff saved to https://phabricator.wikimedia.org/P31416 and previous config saved to /var/cache/conftool/dbconfig/20220719-130956-marostegui.json [13:10:25] pulled on mwdebug1001 [13:10:35] If you can test it, as I'm a bit here and there rn so don't want to keep you waiting, otherwise I can test in a moment [13:11:32] well I don't know what this change is changing exactly :] [13:11:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:11:53] https://br.m.wikimedia.org/wiki/P%C3%A1gina_principal doesn't show any diff then I don't even see a logo there [13:11:54] hashar: it adds a wordmark and logo to 2 skins [13:11:59] i don't see the logo [13:12:06] but it might be cache [13:12:08] oh [13:12:26] I guess we need the config change as https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/814373/2/wmf-config/InitialiseSettings.php [13:12:35] (03PS3) 10Hashar: brwikimedia: Use logo and wordmark in vector-2022 and minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814373 (https://phabricator.wikimedia.org/T313194) (owner: 10Tks4Fish) [13:12:35] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:12:41] (03CR) 10Hashar: [C: 03+2] brwikimedia: Use logo and wordmark in vector-2022 and minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814373 (https://phabricator.wikimedia.org/T313194) (owner: 10Tks4Fish) [13:12:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:13:10] Yep, one just adds the files, the other uses it :) [13:13:28] hashar: i can test [13:13:32] I guess you could try that the files are available under /static/ [13:13:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:13:41] ok, the other change was already +2ed [13:13:55] (03Merged) 10jenkins-bot: brwikimedia: Use logo and wordmark in vector-2022 and minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814373 (https://phabricator.wikimedia.org/T313194) (owner: 10Tks4Fish) [13:14:12] I will deploy that config change to mwdebug once the svg files have been synced [13:14:41] ok [13:14:54] it restarts php-fpm bah [13:15:09] yup, takes roughly three minutes now [13:16:09] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2022-07-19-125630-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/815265 (owner: 10MSantos) [13:16:15] (03CR) 10MSantos: [C: 03+2] proton: bump to 2022-07-14-103746-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/815286 (owner: 10MSantos) [13:16:21] !log hashar@deploy1002 Synchronized static/images/mobile/copyright: Config: [[gerrit:814372|brwikimedia: Add logo and wordmark for vector-2022 and minerva (T313194)]] (duration: 02m 57s) [13:16:25] T313194: Add logo and wordmark for brwikimedia - https://phabricator.wikimedia.org/T313194 [13:17:06] I have pulled the change on mwdebug1001 [13:17:27] hashar: vector fine [13:17:51] hashar: looks to be good now [13:18:02] awesome thank you RhinosF1 ! [13:18:27] np [13:18:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:19:03] (03PS1) 10Jbond: raid_fact: add new refactered raid fact [puppet] - 10https://gerrit.wikimedia.org/r/815287 (https://phabricator.wikimedia.org/T313312) [13:19:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:19:43] (03CR) 10Jbond: Extend custom raid fact to support Perc 750 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809913 (https://phabricator.wikimedia.org/T297913) (owner: 10Muehlenhoff) [13:19:59] (03Merged) 10jenkins-bot: mobileapps: bump to 2022-07-19-125630-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/815265 (owner: 10MSantos) [13:20:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:20:57] (03CR) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:21:02] !log hashar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814373|brwikimedia: Use logo and wordmark in vector-2022 and minerva (T313194)]] (duration: 02m 48s) [13:21:10] (03Merged) 10jenkins-bot: proton: bump to 2022-07-14-103746-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/815286 (owner: 10MSantos) [13:22:21] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:22:51] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:23:11] Tks4Fish: should be deployed now [13:23:12] hashar: i not seeing the change to vector-2022 on prod now, I can see it on debug. [13:23:28] maybe there is some caching involved? [13:23:30] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:23:45] Thanks hashar and RhinosF1 :) [13:24:13] RhinosF1, hashar: could be https://phabricator.wikimedia.org/T311788 ? [13:24:22] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:24:26] if it persists I would try a re-sync [13:24:57] (03CR) 10Ayounsi: Netbox _get_circuits: add patch panel support (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [13:25:00] hashar: working now [13:25:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P31417 and previous config saved to /var/cache/conftool/dbconfig/20220719-132501-marostegui.json [13:25:05] :] [13:25:17] the opcache invalidation is a bit funky [13:25:27] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:25:29] Lucas_WMDE: i think it likely a cache. As soon as I logged in, it worked [13:25:41] logged out now picks it up [13:25:43] it also seems to vary by host [13:26:01] last time I had this issue, the config change seemed to be effective on some mw hosts and not on others [13:26:17] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:26:29] Lucas_WMDE: i spammed force refresh and it seems fine now [13:26:33] ok [13:26:45] when I once looked at how php invalidates the opcache, it was based on the files last modification time and I am guessing there might well be some race condition [13:26:56] https://br.wikimedia.org/w/index.php?title=P%C3%A1gina_principal&useskin=vector-2022 should show a picture instead of text if it's fine [13:27:09] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [13:27:22] I think I still get the old logo on some force reloads [13:27:40] Lucas_WMDE: old is just saying Wikimedia Brazil [13:27:47] yes [13:27:54] hmm [13:28:03] can you tell what hosts [13:28:07] there we go, got “Wikimedia Brasil” from mw1385 [13:28:09] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [13:28:13] (had to reload with network panel open) [13:28:28] (assuming it’s the index.php request that’s relevant and not any load.php) [13:28:57] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [13:29:35] hashar: maybe resync [13:29:51] ok [13:29:58] maybe I can touch it [13:30:06] mw1452 too [13:30:18] resync sounds god [13:30:20] *good [13:30:34] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [13:32:06] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [13:32:12] oops [13:32:18] something about poolcounter timing out [13:32:32] (03PS1) 10Majavah: microsites: block access to .git folder on various sites [puppet] - 10https://gerrit.wikimedia.org/r/815290 (https://phabricator.wikimedia.org/T294917) [13:32:35] on mwdebug1001 [13:33:19] !log hashar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Resync after touching (duration: 02m 38s) [13:33:34] RhinosF1: I did the resync [13:34:02] Lucas_WMDE: [13:34:05] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [13:34:41] (03CR) 10Lucas Werkmeister (WMDE): "Adding httpbb tests (https://wikitech.wikimedia.org/wiki/Httpbb) for this would be nice, at least the query service UI has some existing o" [puppet] - 10https://gerrit.wikimedia.org/r/815290 (https://phabricator.wikimedia.org/T294917) (owner: 10Majavah) [13:34:51] thanks [13:35:47] now I consistently get the proper logo [13:36:01] :@ [13:36:04] :) [13:36:14] thanks hashar [13:36:16] thank you both for your assistance! [13:36:34] (03PS2) 10Majavah: microsites: block access to .git folder on various sites [puppet] - 10https://gerrit.wikimedia.org/r/815290 (https://phabricator.wikimedia.org/T294917) [13:36:38] Tks4Fish: the logo should have been updated everywhere now [13:37:25] !log Stop mysql on db1132 to upgrade package [13:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:14] (03PS3) 10Majavah: microsites: block access to .git folder on various sites [puppet] - 10https://gerrit.wikimedia.org/r/815290 (https://phabricator.wikimedia.org/T294917) [13:39:24] (03CR) 10Ayounsi: provision cookbook: configure switches using cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi) [13:39:52] (03CR) 10Majavah: microsites: block access to .git folder on various sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815290 (https://phabricator.wikimedia.org/T294917) (owner: 10Majavah) [13:40:06] (03CR) 10Ladsgroup: [C: 03+1] "I let Daniel take a look first and merge it if there is no objection by the next couple of days, or someone beats me to it." [puppet] - 10https://gerrit.wikimedia.org/r/815290 (https://phabricator.wikimedia.org/T294917) (owner: 10Majavah) [13:40:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P31418 and previous config saved to /var/cache/conftool/dbconfig/20220719-134006-marostegui.json [13:45:46] !log installing cron security updates [13:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:53] RECOVERY - Check systemd state on mw1383 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:13] (03CR) 10Jbond: [C: 03+2] microsites: block access to .git folder on various sites [puppet] - 10https://gerrit.wikimedia.org/r/815290 (https://phabricator.wikimedia.org/T294917) (owner: 10Majavah) [13:55:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T312990)', diff saved to https://phabricator.wikimedia.org/P31419 and previous config saved to /var/cache/conftool/dbconfig/20220719-135511-marostegui.json [13:55:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:55:16] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [13:55:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance [13:55:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T312990)', diff saved to https://phabricator.wikimedia.org/P31420 and previous config saved to /var/cache/conftool/dbconfig/20220719-135532-marostegui.json [13:55:53] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts sretest1001.eqiad.wmnet [13:56:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T312990)', diff saved to https://phabricator.wikimedia.org/P31421 and previous config saved to /var/cache/conftool/dbconfig/20220719-135652-marostegui.json [13:58:33] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) for hosts sretest1001.eqiad.wmnet [14:01:43] (03PS1) 10Ayounsi: sre.network: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/815295 (https://phabricator.wikimedia.org/T306552) [14:05:56] (03CR) 10Ayounsi: [C: 03+2] sre.network: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/815295 (https://phabricator.wikimedia.org/T306552) (owner: 10Ayounsi) [14:10:09] (03Merged) 10jenkins-bot: sre.network: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/815295 (https://phabricator.wikimedia.org/T306552) (owner: 10Ayounsi) [14:11:17] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts sretest1001.eqiad.wmnet [14:11:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P31422 and previous config saved to /var/cache/conftool/dbconfig/20220719-141158-marostegui.json [14:15:56] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [14:16:40] !log installing glib2.0 security updates [14:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:20:05] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts sretest1001.eqiad.wmnet [14:22:44] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/815298 [14:22:49] !log dancy@deploy1002 Installing scap version "4.11.0" for 557 hosts [14:23:09] !log dancy@deploy1002 Installation of scap version "4.11.0" completed for 557 hosts [14:23:42] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [14:24:02] (03CR) 10Ladsgroup: "just to be sure, let's run this everywhere and see the result." [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [14:27:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P31423 and previous config saved to /var/cache/conftool/dbconfig/20220719-142703-marostegui.json [14:37:24] (03PS3) 10Ssingh: durum: add IP version (v4/v6) in test result [puppet] - 10https://gerrit.wikimedia.org/r/815264 [14:38:07] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36302/console" [puppet] - 10https://gerrit.wikimedia.org/r/815264 (owner: 10Ssingh) [14:42:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T312990)', diff saved to https://phabricator.wikimedia.org/P31424 and previous config saved to /var/cache/conftool/dbconfig/20220719-144208-marostegui.json [14:42:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:42:10] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [14:42:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1165.eqiad.wmnet with reason: Maintenance [14:42:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:42:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:42:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T312990)', diff saved to https://phabricator.wikimedia.org/P31425 and previous config saved to /var/cache/conftool/dbconfig/20220719-144245-marostegui.json [14:44:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T312990)', diff saved to https://phabricator.wikimedia.org/P31426 and previous config saved to /var/cache/conftool/dbconfig/20220719-144453-marostegui.json [14:48:06] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:50:36] !log installing python-urlllib3 security updates [14:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P31427 and previous config saved to /var/cache/conftool/dbconfig/20220719-145958-marostegui.json [15:03:30] !log installing nghttp2 security updates [15:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:01] !log ayounsi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host sretest1001 [15:12:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest1001 [15:13:08] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:14:25] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:14:58] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1007.wikimedia.org [15:15:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P31429 and previous config saved to /var/cache/conftool/dbconfig/20220719-151503-marostegui.json [15:16:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:17:18] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:21:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:22:50] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [15:26:56] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1007.wikimedia.org [15:30:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T312990)', diff saved to https://phabricator.wikimedia.org/P31430 and previous config saved to /var/cache/conftool/dbconfig/20220719-153009-marostegui.json [15:30:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1131.eqiad.wmnet with reason: Maintenance [15:30:13] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [15:30:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1131.eqiad.wmnet with reason: Maintenance [15:30:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T312990)', diff saved to https://phabricator.wikimedia.org/P31431 and previous config saved to /var/cache/conftool/dbconfig/20220719-153040-marostegui.json [15:32:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T312990)', diff saved to https://phabricator.wikimedia.org/P31432 and previous config saved to /var/cache/conftool/dbconfig/20220719-153248-marostegui.json [15:47:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P31433 and previous config saved to /var/cache/conftool/dbconfig/20220719-154753-marostegui.json [15:49:14] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:55:30] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS bullseye [15:56:35] !log ayounsi@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [15:57:12] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ganeti2029.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [15:57:16] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [15:57:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ganeti2029.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [15:58:22] !log draining ganeti2020 T310483 [15:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T1600). [16:00:05] dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:25] o/ [16:02:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P31434 and previous config saved to /var/cache/conftool/dbconfig/20220719-160258-marostegui.json [16:04:05] !log installing node-minimist security updates [16:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:34] jbond / rzl : anyone around? [16:14:56] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.21 refs T308074 [16:14:56] !log jhuneidi@deploy1002 scap failed: PermissionError [Errno 13] Permission denied: '/srv/mediawiki-staging/php-1.39.0-wmf.19/cache/gitinfo/info-extensions-GrowthExperiments.json' (duration: 00m 00s) [16:14:59] T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074 [16:17:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:18:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T312990)', diff saved to https://phabricator.wikimedia.org/P31435 and previous config saved to /var/cache/conftool/dbconfig/20220719-161803-marostegui.json [16:18:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2129.codfw.wmnet with reason: Maintenance [16:18:09] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [16:18:16] !log drain traffic away from cr2-eqiad:fpc3 - T312745 [16:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:20] T312745: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 [16:18:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2129.codfw.wmnet with reason: Maintenance [16:18:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on 8 hosts with reason: Maintenance [16:18:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 8 hosts with reason: Maintenance [16:18:43] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS bullseye [16:19:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:19:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:20:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:20:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:20:42] dancy: am now, sorry! taking a look [16:21:37] ok [16:21:43] oh man, good catch [16:22:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:23:24] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.21 refs T308074 [16:23:24] !log jhuneidi@deploy1002 scap failed: PermissionError [Errno 13] Permission denied: '/srv/mediawiki-staging/php-1.39.0-wmf.19/cache/gitinfo/info-extensions-FileImporter.json' (duration: 00m 00s) [16:23:28] dancy: did you have any particular manual testing in mind while deploying this? [16:23:28] T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074 [16:23:44] not sure if you can intentionally tickle that bug again [16:23:49] rzl: Unfortunately there's no manual test available unless we make the error occurr [16:24:00] <_joe_> dancy: sigh I was sure I fixed that bug already [16:24:08] nod [16:24:16] <_joe_> sorry I must have reintroduced it with the last refactor :/ [16:24:45] maybe we can just restart something real quick for overall confidence, but this sure looks right to me [16:25:59] Jeena and I are in the process of doing the initial train for this week so eventually it will execute the restart code. [16:28:22] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.21 refs T308074 [16:29:08] dancy: seems good -- I'll be nearby in case this needs to be rolled back for whatever reason [16:29:16] ok thanks! [16:29:48] note puppet won't have rolled it out everywhere until 16:55 UTC or so, but looking at the deployment calendar that ought to be fine [16:30:02] Agreed [16:30:10] sorry for being late to the window, my calendar got away from me a little today :) [16:31:53] No problem. :-) [16:32:54] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [16:33:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:37:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:37:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:38:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:42:02] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudvirt1028.eqiad.wmnet [16:43:28] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [16:43:58] !log cr2-eqiad# run request chassis fpc slot 3 offline [16:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:44] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:47:38] PROBLEM - Router interfaces on pfw3-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 57, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:48:42] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:50:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1028.eqiad.wmnet [16:52:06] (03PS8) 10Ladsgroup: core.pp: Make sync_binlog and trx_commit configurable [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [16:52:50] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:53:20] (03CR) 10JHathaway: beaker: add a method to hack fixes specific to beaker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814866 (owner: 10Jbond) [16:54:39] (03CR) 10Ladsgroup: "So running it fleet-wide showed it is actually breaking ParserCache. I fixed it and ran it on PC only, it's working as expected: https://p" [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [16:54:43] (03PS9) 10Ladsgroup: core.pp: Make sync_binlog and trx_commit configurable [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [16:54:47] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] core.pp: Make sync_binlog and trx_commit configurable [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [16:55:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:55:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:56:02] PROBLEM - BGP status on pfw3-eqiad is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:56:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:56:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance [16:57:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1129.eqiad.wmnet with reason: Maintenance [16:57:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1129.eqiad.wmnet with reason: Maintenance [16:57:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T312990)', diff saved to https://phabricator.wikimedia.org/P31436 and previous config saved to /var/cache/conftool/dbconfig/20220719-165747-marostegui.json [16:57:52] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [16:59:00] (03PS1) 10Urbanecm: [beta] Growth: Enable structured mentor list at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815331 (https://phabricator.wikimedia.org/T310905) [16:59:35] (03CR) 10Urbanecm: [C: 03+2] "no-op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815331 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [17:00:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312990)', diff saved to https://phabricator.wikimedia.org/P31437 and previous config saved to /var/cache/conftool/dbconfig/20220719-170002-marostegui.json [17:00:18] (03Merged) 10jenkins-bot: [beta] Growth: Enable structured mentor list at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815331 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm) [17:00:26] RECOVERY - Router interfaces on pfw3-eqiad is OK: OK: host 208.80.154.219, interfaces up: 58, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:00:40] RECOVERY - BGP status on pfw3-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:01:48] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:01:56] * urbanecm done [17:02:04] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:02:56] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:04:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:04:32] (03PS12) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 [17:05:03] (03CR) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [17:06:16] !log jhuneidi@deploy1002 scap failed: ValueError php_fpm expected targets, 0 given (duration: 37m 54s) [17:06:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:06:56] (03CR) 10David Caro: wmcs: vps: create_instance_with_prefix: unbreak (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [17:08:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:11:00] ^^ seems there was a temporary outage with memcached which produced a lot of logs [17:11:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:11:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:13:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:15:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P31438 and previous config saved to /var/cache/conftool/dbconfig/20220719-171507-marostegui.json [17:15:55] (03PS4) 10Ssingh: durum: add IP version (v4/v6) in test result [puppet] - 10https://gerrit.wikimedia.org/r/815264 [17:16:54] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36305/console" [puppet] - 10https://gerrit.wikimedia.org/r/815264 (owner: 10Ssingh) [17:17:13] (03CR) 10DCausse: [C: 03+1] "do we need to update the corresponding template inside elastic after merging it?" [puppet] - 10https://gerrit.wikimedia.org/r/815327 (owner: 10Ebernhardson) [17:17:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:22:02] (03PS38) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [17:22:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:25:13] (03PS5) 10Ssingh: durum: add IP version (v4/v6) in test result [puppet] - 10https://gerrit.wikimedia.org/r/815264 [17:26:39] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36306/console" [puppet] - 10https://gerrit.wikimedia.org/r/815264 (owner: 10Ssingh) [17:28:23] (03CR) 10Ori: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/815335 (https://phabricator.wikimedia.org/T313334) (owner: 10Ori) [17:29:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:29:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:29:52] (03CR) 10Jforrester: [C: 03+1] role::beta::docker_services: prune docker images [puppet] - 10https://gerrit.wikimedia.org/r/815335 (https://phabricator.wikimedia.org/T313334) (owner: 10Ori) [17:30:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P31439 and previous config saved to /var/cache/conftool/dbconfig/20220719-173012-marostegui.json [17:30:31] (03PS1) 10Chad: Update my SSH key [puppet] - 10https://gerrit.wikimedia.org/r/815336 [17:31:31] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: add IP version (v4/v6) in test result [puppet] - 10https://gerrit.wikimedia.org/r/815264 (owner: 10Ssingh) [17:33:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:36:38] (03PS1) 10Eigyan: [wmf-config]: Undeploy GDI Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815338 (https://phabricator.wikimedia.org/T310852) [17:40:44] (03PS39) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [17:44:03] (03PS2) 10Eigyan: [wmf-config]: Undeploy GDI Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815338 (https://phabricator.wikimedia.org/T3128662) [17:45:01] !log jhuneidi@deploy1002 Installing scap version "4.11.1" for 557 hosts [17:45:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312990)', diff saved to https://phabricator.wikimedia.org/P31440 and previous config saved to /var/cache/conftool/dbconfig/20220719-174517-marostegui.json [17:45:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1182.eqiad.wmnet with reason: Maintenance [17:45:21] !log jhuneidi@deploy1002 Installation of scap version "4.11.1" completed for 557 hosts [17:45:23] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [17:45:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1182.eqiad.wmnet with reason: Maintenance [17:45:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T312990)', diff saved to https://phabricator.wikimedia.org/P31441 and previous config saved to /var/cache/conftool/dbconfig/20220719-174537-marostegui.json [17:46:54] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.21 refs T308074 [17:46:58] T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074 [17:48:14] (03PS3) 10Eigyan: [wmf-config]: Undeploy GDI Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815338 (https://phabricator.wikimedia.org/T3128662) [17:48:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312990)', diff saved to https://phabricator.wikimedia.org/P31442 and previous config saved to /var/cache/conftool/dbconfig/20220719-174815-marostegui.json [17:49:41] (03PS4) 10Eigyan: [wmf-config]: Undeploy GDI Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815338 (https://phabricator.wikimedia.org/T312866) [17:50:23] (03PS6) 10Jdlrobson: Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) [17:51:19] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.21 refs T308074 (duration: 04m 24s) [17:51:21] (03PS1) 10Ssingh: test_dns: update tests to reflect changes in durum's API [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/815341 [17:52:29] (03CR) 10Ssingh: [C: 03+2] test_dns: update tests to reflect changes in durum's API [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/815341 (owner: 10Ssingh) [17:56:22] We're going to start rolling out to group 0 now, a teeny bit early [17:57:20] (03CR) 10Dzahn: "Thank you. I can imagine these get reported but this is exactly like when people find "labs/private" repo. They think it's an issue but it" [puppet] - 10https://gerrit.wikimedia.org/r/815290 (https://phabricator.wikimedia.org/T294917) (owner: 10Majavah) [17:58:51] (03PS13) 10Majavah: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 [18:00:05] jeena and jnuche: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T1800). [18:03:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P31443 and previous config saved to /var/cache/conftool/dbconfig/20220719-180320-marostegui.json [18:03:31] (03PS2) 10Dzahn: admin: update SSH key for demon [puppet] - 10https://gerrit.wikimedia.org/r/815336 (owner: 10Chad) [18:03:34] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 100.6 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [18:04:35] (03CR) 10Dzahn: "wow that you even added tests to check if these are NOT accessible is impressive 😎" [puppet] - 10https://gerrit.wikimedia.org/r/815290 (https://phabricator.wikimedia.org/T294917) (owner: 10Majavah) [18:05:46] (03CR) 10Dzahn: [C: 03+2] "I called Chad via POTS and confirmed it's him." [puppet] - 10https://gerrit.wikimedia.org/r/815336 (owner: 10Chad) [18:06:19] (03CR) 10Ebernhardson: apifeatureusage: Remove disabled _all field (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815327 (owner: 10Ebernhardson) [18:08:21] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815338 (https://phabricator.wikimedia.org/T312866) (owner: 10Eigyan) [18:08:27] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.21 refs T308074 [18:08:34] T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074 [18:08:54] (03CR) 10Dzahn: [C: 03+2] "deployed and confirmed on deploy1002.eqiad.wmnet. on all other hosts it will work within 30 min" [puppet] - 10https://gerrit.wikimedia.org/r/815336 (owner: 10Chad) [18:09:07] (03CR) 10Bking: [C: 03+2] apifeatureusage: Remove disabled _all field [puppet] - 10https://gerrit.wikimedia.org/r/815327 (owner: 10Ebernhardson) [18:10:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:11:23] (03PS1) 10Jeena Huneidi: group0 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815344 (https://phabricator.wikimedia.org/T308074) [18:11:26] (03CR) 10JHathaway: [C: 03+2] lists: convert apache template to epp [puppet] - 10https://gerrit.wikimedia.org/r/812938 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [18:11:28] (03PS3) 10Dzahn: Revert "vrts/prometheus: comment out broken check" [puppet] - 10https://gerrit.wikimedia.org/r/812282 (https://phabricator.wikimedia.org/T312194) [18:12:01] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815344 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [18:12:55] (03CR) 10Andrew Bogott: "I'm in favor of most or all of these changes. Two concerns are: 1) I'd like a page if a cloudvirt goes down (or at least if it goes down a" [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [18:13:40] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815344 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [18:14:22] (03PS7) 10Jdlrobson: Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) [18:18:21] (03CR) 10JHathaway: [C: 03+2] lists: add apache security configs [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [18:18:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P31444 and previous config saved to /var/cache/conftool/dbconfig/20220719-181825-marostegui.json [18:18:40] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided) [18:18:55] (03CR) 10Dzahn: [C: 03+2] Revert "vrts/prometheus: comment out broken check" [puppet] - 10https://gerrit.wikimedia.org/r/812282 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [18:19:05] (03PS6) 10JHathaway: lists: add apache security configs [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) [18:19:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:19:32] (03PS3) 10Andrea Denisse: netmon: Add suppport for multiple backup/passive nodes in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) [18:19:37] (03PS3) 10Dzahn: vrts/prometheus: re-activate commented check after fixing path [puppet] - 10https://gerrit.wikimedia.org/r/812326 (https://phabricator.wikimedia.org/T312194) [18:19:41] (03CR) 10JHathaway: [V: 03+2] lists: add apache security configs [puppet] - 10https://gerrit.wikimedia.org/r/812939 (https://phabricator.wikimedia.org/T312506) (owner: 10JHathaway) [18:19:52] (03CR) 10Dzahn: [C: 03+2] vrts/prometheus: re-activate commented check after fixing path [puppet] - 10https://gerrit.wikimedia.org/r/812326 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [18:20:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:20:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:20:27] (03PS4) 10Dzahn: vrts/prometheus: fix path in blackbox http monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/812326 (https://phabricator.wikimedia.org/T312194) [18:20:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [18:20:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:21:53] (03CR) 10Andrea Denisse: netmon: Add suppport for multiple backup/passive nodes in Puppet (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [18:23:26] (03CR) 10Dzahn: "You can define more than one "receiver" in modules/alertmanager/templates/alertmanager.yml.erb that you can then use with all these checks" [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [18:25:41] (03CR) 10Dzahn: alertmanager: switch IRC channel for gitlab (serviceops-collab) alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814926 (owner: 10Dzahn) [18:26:32] (03PS40) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [18:27:59] (03CR) 10Jbond: "updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [18:29:27] (03CR) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [18:33:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312990)', diff saved to https://phabricator.wikimedia.org/P31445 and previous config saved to /var/cache/conftool/dbconfig/20220719-183330-marostegui.json [18:33:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:33:35] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [18:33:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [18:33:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T312990)', diff saved to https://phabricator.wikimedia.org/P31446 and previous config saved to /var/cache/conftool/dbconfig/20220719-183351-marostegui.json [18:36:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T312990)', diff saved to https://phabricator.wikimedia.org/P31447 and previous config saved to /var/cache/conftool/dbconfig/20220719-183632-marostegui.json [18:41:48] (03CR) 10Andrew Bogott: [C: 03+2] openstack::nova: reduce max amount of open connections [puppet] - 10https://gerrit.wikimedia.org/r/806460 (owner: 10Majavah) [18:41:54] (03PS2) 10Andrew Bogott: openstack::nova: reduce max amount of open connections [puppet] - 10https://gerrit.wikimedia.org/r/806460 (owner: 10Majavah) [18:42:25] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [18:42:30] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [18:44:25] (03CR) 10Dzahn: "Jelto..just fyi that I was thinking about it. this would be falling back to monitor 80 and not envoy. but it seems to me a bit better to m" [puppet] - 10https://gerrit.wikimedia.org/r/812142 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [18:44:34] (03PS1) 10Jdrewniak: Revert "Bumping portals to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815276 [18:44:51] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2056.codfw.wmnet with OS bullseye [18:45:09] (03Abandoned) 10Dzahn: vrts/blackbox: adjust monitoring back to port 80, but fix path [puppet] - 10https://gerrit.wikimedia.org/r/812142 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [18:46:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:47:09] (03CR) 10Dzahn: "There is actually another option here. Both services, gitlab and VRTS, already have their own topic-based (as opposed to team-based) IRC c" [puppet] - 10https://gerrit.wikimedia.org/r/814926 (owner: 10Dzahn) [18:49:09] !log dancy@deploy1002 Pruned MediaWiki: 1.39.0-wmf.17, 1.39.0-wmf.18 (duration: 02m 09s) [18:50:34] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2069.codfw.wmnet with OS bullseye [18:51:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P31448 and previous config saved to /var/cache/conftool/dbconfig/20220719-185137-marostegui.json [18:52:05] (03PS1) 10Andrew Bogott: nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 [18:52:42] (03CR) 10CI reject: [V: 04-1] nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 (owner: 10Andrew Bogott) [18:53:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:53:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:53:14] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:54:28] (03PS2) 10Andrew Bogott: nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 [18:55:08] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 101.4 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [18:55:32] (03CR) 10CI reject: [V: 04-1] nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 (owner: 10Andrew Bogott) [18:56:51] (03PS3) 10Andrew Bogott: nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 [18:57:45] (03CR) 10CI reject: [V: 04-1] nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 (owner: 10Andrew Bogott) [18:59:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:59:39] (03PS4) 10Andrew Bogott: nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 [19:00:21] (03CR) 10CI reject: [V: 04-1] nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 (owner: 10Andrew Bogott) [19:02:13] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2056.codfw.wmnet with reason: host reimage [19:02:20] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2066.codfw.wmnet with OS bullseye [19:04:11] (03PS5) 10Andrew Bogott: nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 [19:04:36] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2069.codfw.wmnet with reason: host reimage [19:04:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:05:01] (03CR) 10CI reject: [V: 04-1] nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 (owner: 10Andrew Bogott) [19:05:48] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2056.codfw.wmnet with reason: host reimage [19:06:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P31449 and previous config saved to /var/cache/conftool/dbconfig/20220719-190642-marostegui.json [19:06:57] (03PS6) 10Andrew Bogott: nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 [19:07:35] (03PS7) 10Andrew Bogott: nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 [19:07:37] (03CR) 10CI reject: [V: 04-1] nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 (owner: 10Andrew Bogott) [19:08:18] (03CR) 10CI reject: [V: 04-1] nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 (owner: 10Andrew Bogott) [19:08:31] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2069.codfw.wmnet with reason: host reimage [19:10:22] (03PS8) 10Andrew Bogott: nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 [19:11:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:11:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:11:28] (03CR) 10Andrew Bogott: "This seems to have slightly increased the number of open connections to Galera, so perhaps the overflow setting wasn't getting used much; " [puppet] - 10https://gerrit.wikimedia.org/r/806460 (owner: 10Majavah) [19:12:36] (03PS9) 10Andrew Bogott: nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 [19:15:39] (03CR) 10CI reject: [V: 04-1] nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 (owner: 10Andrew Bogott) [19:17:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:18:24] (03PS10) 10Andrew Bogott: nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 [19:19:21] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools visualenhancements as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815359 (https://phabricator.wikimedia.org/T312670) [19:19:38] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815360 (https://phabricator.wikimedia.org/T128546) [19:19:56] (03Abandoned) 10Jdrewniak: Revert "Bumping portals to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815276 (owner: 10Jdrewniak) [19:21:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T312990)', diff saved to https://phabricator.wikimedia.org/P31450 and previous config saved to /var/cache/conftool/dbconfig/20220719-192147-marostegui.json [19:21:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1122.eqiad.wmnet with reason: Maintenance [19:21:52] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [19:22:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1122.eqiad.wmnet with reason: Maintenance [19:22:04] (03CR) 10Andrew Bogott: [C: 03+2] nova: properly configure the number of api and metadata-api workers [puppet] - 10https://gerrit.wikimedia.org/r/815346 (owner: 10Andrew Bogott) [19:22:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T312990)', diff saved to https://phabricator.wikimedia.org/P31451 and previous config saved to /var/cache/conftool/dbconfig/20220719-192207-marostegui.json [19:23:01] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2069.codfw.wmnet with OS bullseye [19:27:31] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2066.codfw.wmnet with OS bullseye [19:29:53] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2056.codfw.wmnet with OS bullseye [19:29:58] (03PS1) 10Andrew Bogott: Openstack Nova: install the uwsgi.ini files into /etc/nova [puppet] - 10https://gerrit.wikimedia.org/r/815363 [19:34:10] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: install the uwsgi.ini files into /etc/nova [puppet] - 10https://gerrit.wikimedia.org/r/815363 (owner: 10Andrew Bogott) [19:44:09] (03PS6) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 [19:47:43] Reverting all wikis back to wmf.19 until the cache poisoning bug is resolved [19:47:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T312990)', diff saved to https://phabricator.wikimedia.org/P31452 and previous config saved to /var/cache/conftool/dbconfig/20220719-194752-marostegui.json [19:47:59] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [19:50:22] (03CR) 10Hashar: "I got the dev server from mediawiki/extensions/EventLogging to accept the events as defined from the parent change and proposed at https:/" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [19:51:58] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.39.0-wmf.19" [19:52:59] (03CR) 10Eigyan: [wmf-config]: Undeploy GDI Survey Wave 2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815338 (https://phabricator.wikimedia.org/T312866) (owner: 10Eigyan) [19:53:31] (03PS1) 10Jeena Huneidi: Revert "group0 wikis to 1.39.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815372 (https://phabricator.wikimedia.org/T308074) [19:53:33] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group0 wikis to 1.39.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815372 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [19:54:43] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.39.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815372 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [19:56:57] (03PS1) 10Stang: uzwiki: Create "eliminator" group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815374 (https://phabricator.wikimedia.org/T302670) [19:58:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:59:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:59:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:59:59] (03PS1) 10Jeena Huneidi: Revert "testwikis wikis to 1.39.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815277 [20:00:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:00:04] RoanKattouw, Urbanecm, and cjming: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220719T2000). [20:00:04] ebernhardson, Jdlrobson, eigyan, jan_drewniak, and koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] i can deploy o/ [20:00:40] i'll also be covering for Jon's patch [20:00:48] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "testwikis wikis to 1.39.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815277 (owner: 10Jeena Huneidi) [20:01:04] Greetings all! [20:01:25] (03PS2) 10Clare Ming: cirrus: Dont recycle completion suggester indices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814908 (owner: 10Ebernhardson) [20:01:26] o/ [20:02:08] ebernhardson: do you happen to be around? [20:02:49] (03Merged) 10jenkins-bot: Revert "testwikis wikis to 1.39.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815277 (owner: 10Jeena Huneidi) [20:02:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P31453 and previous config saved to /var/cache/conftool/dbconfig/20220719-200257-marostegui.json [20:03:06] please hold the backport window [20:03:18] I need to finish reverting to wmf.19 [20:03:28] jeena: will do [20:03:31] Thanks [20:04:17] cjming: here [20:05:03] ebernhardson: great -- once i get the green light, i'll move forward with your patch [20:06:22] (03Abandoned) 10Stang: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [20:06:41] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2055.codfw.wmnet with OS bullseye [20:09:31] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: Revert "testwikis to 1.39.0-wmf.19" [20:09:53] cjming: all clear, thanks for waiting [20:10:04] jeena: np - thanks! [20:10:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:10:24] (03CR) 10Clare Ming: [C: 03+2] cirrus: Dont recycle completion suggester indices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814908 (owner: 10Ebernhardson) [20:11:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:11:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:11:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:12:13] (03Merged) 10jenkins-bot: cirrus: Dont recycle completion suggester indices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814908 (owner: 10Ebernhardson) [20:12:53] (03PS1) 10CDanis: add sretools.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/815376 (https://phabricator.wikimedia.org/T313355) [20:13:08] ebernhardson: is your patch testable? on mwdebug1002 [20:13:33] cjming: yup, running a rebuild on testwiki should take only a minute [20:13:56] cjming: works as expected [20:14:03] great - syncing [20:14:13] (03CR) 10Andrew Bogott: [C: 03+2] labweb: point tlsproxy envoy at %{facts.ipaddress}:8080 for striker [puppet] - 10https://gerrit.wikimedia.org/r/812381 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [20:14:35] (03CR) 10Clare Ming: [C: 03+2] Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [20:14:50] (03PS8) 10Clare Ming: Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [20:16:12] (03CR) 10Clare Ming: [C: 03+2] Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [20:16:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:17:18] (03PS1) 10Jforrester: Hooks: Bump scribunto-stats cache version [extensions/Scribunto] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815281 (https://phabricator.wikimedia.org/T313341) [20:17:27] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814908|cirrus: Dont recycle completion suggester indices]] (duration: 03m 12s) [20:17:36] (03Merged) 10jenkins-bot: Deploy the new grid layout to group 0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814869 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [20:17:46] ebernhardson: your change should be live [20:17:54] doing Jon's patch now [20:17:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:17:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:18:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P31454 and previous config saved to /var/cache/conftool/dbconfig/20220719-201802-marostegui.json [20:18:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:19:57] (03CR) 10Jforrester: [C: 03+2] Hooks: Bump scribunto-stats cache version [extensions/Scribunto] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815281 (https://phabricator.wikimedia.org/T313341) (owner: 10Jforrester) [20:20:16] (Don't mind me, just getting the train-unblocker merged. Won't get in your way of scapping.) [20:21:11] (03PS5) 10Clare Ming: [wmf-config]: Undeploy GDI Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815338 (https://phabricator.wikimedia.org/T312866) (owner: 10Eigyan) [20:21:28] eigyan: doing your patch now [20:21:45] many thanks cjming [20:21:48] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2055.codfw.wmnet with reason: host reimage [20:23:01] (03CR) 10Clare Ming: [C: 03+2] [wmf-config]: Undeploy GDI Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815338 (https://phabricator.wikimedia.org/T312866) (owner: 10Eigyan) [20:23:29] !log cjming@deploy1002 Synchronized wmf-config: Config: [[gerrit:814869|Deploy the new grid layout to group 0 wikis (T312241)]] (duration: 03m 05s) [20:23:33] T312241: Deploy the new grid layout - https://phabricator.wikimedia.org/T312241 [20:23:46] (03Merged) 10jenkins-bot: [wmf-config]: Undeploy GDI Survey Wave 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815338 (https://phabricator.wikimedia.org/T312866) (owner: 10Eigyan) [20:23:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:24:26] eigyan: np! can you test on mwdebug1002? [20:24:27] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2055.codfw.wmnet with reason: host reimage [20:24:41] cjming will do thank you! [20:24:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:24:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:25:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:25:52] cjming looks good on my end [20:25:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [20:26:01] great - syncing now [20:26:50] hi jan_drewniak: doing your patch now [20:27:06] (03CR) 10Clare Ming: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815360 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:27:45] @cjming I can do my deploy :) [20:27:53] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815360 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [20:27:57] oh - ok ! [20:28:06] go for it -- i just merged -- lmk when you're done [20:28:23] the deploy instructions aren't entirely accurate (not sure how to change them) [20:28:33] shoot - should i not have merged? [20:29:21] cjming: no no, it's all good [20:29:25] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:815338|[wmf-config]: Undeploy GDI Survey Wave 2 (T312866)]] (duration: 03m 12s) [20:29:29] T312866: Undeploy GDI Safety Survey Wave 2 from EN, ES, FA, FR, and PT wikis - https://phabricator.wikimedia.org/T312866 [20:29:40] eigyan: ^^ your change should be live [20:30:45] jan_drewniak: can i continue with the other patches in the window or do i need to wait til you're finished? [20:30:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:31:19] cjming: I'm syncing now, not sure if that can be done in parallel [20:31:36] np - just lmk when it's done [20:32:25] (03PS5) 10Clare Ming: Add file mover user group for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774841 (https://phabricator.wikimedia.org/T304968) (owner: 10NguoiDungKhongDinhDanh) [20:32:41] koi: i think i saw earlier that you're here -- queuing up your patches next [20:32:59] got it [20:33:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T312990)', diff saved to https://phabricator.wikimedia.org/P31455 and previous config saved to /var/cache/conftool/dbconfig/20220719-203307-marostegui.json [20:33:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance [20:33:14] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [20:33:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance [20:33:23] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:815360| Bumping portals to master (T128546)]] (duration: 03m 09s) [20:33:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T312990)', diff saved to https://phabricator.wikimedia.org/P31456 and previous config saved to /var/cache/conftool/dbconfig/20220719-203327-marostegui.json [20:33:28] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [20:35:00] (03Merged) 10jenkins-bot: Hooks: Bump scribunto-stats cache version [extensions/Scribunto] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815281 (https://phabricator.wikimedia.org/T313341) (owner: 10Jforrester) [20:35:47] jan_drewniak: am i gtg? looks like your stuff sync'd [20:36:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T312990)', diff saved to https://phabricator.wikimedia.org/P31457 and previous config saved to /var/cache/conftool/dbconfig/20220719-203613-marostegui.json [20:36:16] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:815360| Bumping portals to master (T128546)]] (duration: 02m 53s) [20:36:25] cjming: yup all done [20:36:31] ty! [20:36:42] (03CR) 10Clare Ming: [C: 03+2] Add file mover user group for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774841 (https://phabricator.wikimedia.org/T304968) (owner: 10NguoiDungKhongDinhDanh) [20:37:29] (03Merged) 10jenkins-bot: Add file mover user group for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/774841 (https://phabricator.wikimedia.org/T304968) (owner: 10NguoiDungKhongDinhDanh) [20:37:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:37:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:38:17] koi: your 1st patch is up on mwdebug1002 - can you check? [20:38:25] looking [20:38:43] (03PS1) 10Andrew Bogott: Put cloudweb100[34] into service [puppet] - 10https://gerrit.wikimedia.org/r/815378 (https://phabricator.wikimedia.org/T305414) [20:39:50] cjming: lgtm [20:39:57] great - going live [20:40:14] (03PS4) 10Clare Ming: Add "uploader" user group for kswiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776334 (https://phabricator.wikimedia.org/T305320) (owner: 10NguoiDungKhongDinhDanh) [20:42:06] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2055.codfw.wmnet with OS bullseye [20:43:08] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 100.8 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [20:43:20] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:774841|Add file mover user group for azwiki (T304968)]] (duration: 03m 15s) [20:43:23] (03CR) 10Clare Ming: [C: 03+2] Add "uploader" user group for kswiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776334 (https://phabricator.wikimedia.org/T305320) (owner: 10NguoiDungKhongDinhDanh) [20:43:24] T304968: Add file mover user group to az.wiki - https://phabricator.wikimedia.org/T304968 [20:44:10] (03Merged) 10jenkins-bot: Add "uploader" user group for kswiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776334 (https://phabricator.wikimedia.org/T305320) (owner: 10NguoiDungKhongDinhDanh) [20:44:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:44:46] koi: 1st patch live, 2nd patch on mwdebug1002 [20:45:12] (03PS2) 10Clare Ming: uzwiki: Create "eliminator" group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815374 (https://phabricator.wikimedia.org/T302670) (owner: 10Stang) [20:45:34] cjming, is the patch for azwiki really live now? I failed to find such group w/o setting of WikimediaDebug [20:45:42] you could check at https://az.wikipedia.org/wiki/X%C3%BCsusi:ListGroupRights [20:47:28] koi: i see "movefile" on that page [20:47:48] but only one time, and it should appear twice [20:48:14] oh looks ok now, maybe a cache issue [20:48:29] 2nd patch LGTM, btw [20:48:47] hmm -- oh good [20:49:05] "oh looks ok now, maybe a cache issue" no, I checked again and it's still a problem [20:49:25] it looks nice at mwdebug1002, but not ok on prod [20:49:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:50:59] koi: ya - i'm guessing it's cache -- i'm seeing it on the debug server but not on prod either [20:51:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P31458 and previous config saved to /var/cache/conftool/dbconfig/20220719-205118-marostegui.json [20:51:30] ...but maybe a bug with scap, I'm not sure? [20:52:09] koi: Was this a config change? If so, it's probably T311788 [20:52:09] T311788: MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off) - https://phabricator.wikimedia.org/T311788 [20:52:21] yes it is [20:52:21] in which case the workaround is to re-run the last sync command. [20:52:47] alrighty - i'll run it one more time [20:53:38] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:53:58] Sorry about the troubles. I made a commit with a fix / workaround today: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/815317/ Hopefully it will be processed soon. [20:54:44] dancy: no worries! thanks for piping in -- so in the event this happens again, just re-run scap sync? [20:54:54] Yes please [20:56:09] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:774841|Add file mover user group for azwiki (T304968)]] (duration: 02m 52s) [20:56:12] T304968: Add file mover user group to az.wiki - https://phabricator.wikimedia.org/T304968 [20:56:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:56:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:56:37] koi: now i see it on prod [20:56:40] looks good now, thanks! [20:56:46] syncing your 2nd patch now [20:58:05] (03PS2) 10Ahmon Dancy: MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815317 (https://phabricator.wikimedia.org/T311788) [21:00:14] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:776334|Add "uploader" user group for kswiki. (T305320)]] (duration: 02m 58s) [21:00:19] T305320: Create 'Uploaders' user group on kswiki. - https://phabricator.wikimedia.org/T305320 [21:00:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:00:52] dancy: we saw issues in afternoon swat too [21:00:55] koi: can you check that 2nd patch is ok on prod? working on your 3rd patch now [21:01:09] it's fine on prod [21:01:13] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2026.codfw.wmnet with OS bullseye [21:01:14] (03CR) 10Clare Ming: [C: 03+2] uzwiki: Create "eliminator" group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815374 (https://phabricator.wikimedia.org/T302670) (owner: 10Stang) [21:01:30] RhinosF1: Ack. :-( [21:01:57] (03Merged) 10jenkins-bot: uzwiki: Create "eliminator" group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815374 (https://phabricator.wikimedia.org/T302670) (owner: 10Stang) [21:02:37] dancy: don't worry [21:02:49] Resync worked [21:02:55] koi: 3rd patch on mwdebug1002 [21:03:07] looking [21:03:27] It was nice to do some changes again [21:04:19] cjming: LGTM [21:04:24] yay - syncing [21:05:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:06:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P31459 and previous config saved to /var/cache/conftool/dbconfig/20220719-210623-marostegui.json [21:07:52] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:815374|uzwiki: Create "eliminator" group (T302670)]] (duration: 03m 19s) [21:07:56] T302670: Enable 'eliminator' flag on uzwiki and grant administrators to add/remove this group - https://phabricator.wikimedia.org/T302670 [21:08:00] cjming: All done? [21:08:17] James_F: i think so - let me just confirm [21:08:23] Thanks. [21:08:24] koi: last patch should be live [21:09:36] unfortunately, it is still not alive on prod.. [21:09:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:09:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:09:48] hmm - i'll try syncing again - 1 sec [21:10:20] oh it's fine now, my cache issue [21:10:42] whoops - oh well - double-sync just to be sure [21:10:58] please ignore my post a few seconds ago, it is indeed not synced :( [21:11:00] measure once, sync twice. [21:11:17] I just failed to check my WikimediaDebug setting [21:11:25] *forgot [21:11:28] not in vain after all - syncing again now [21:11:31] That's what the helmfile deploy does after all. ;-) [21:12:46] James_F: I just added you as a reviewer on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/815317/ Can you have a look? it deals w/ the sync confidence issue (one of them, at least) [21:13:02] dancy: Oh, yeah, I saw that earlier. Looks sensible. [21:13:20] Is there a risk of a stampede at time+10 though? [21:13:28] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:815374|uzwiki: Create "eliminator" group (T302670)]] (duration: 03m 13s) [21:13:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:13:31] T302670: Enable 'eliminator' flag on uzwiki and grant administrators to add/remove this group - https://phabricator.wikimedia.org/T302670 [21:13:36] No more risk than there was before. [21:13:36] koi: how about now? [21:13:45] dancy: Yeah, fair. Go for it, I say. [21:14:12] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2026.codfw.wmnet with reason: host reimage [21:14:25] it's fine this time [21:14:31] great! [21:14:37] !log end of UTC late backport window [21:14:39] OK, I'm deploying a train unblocker. [21:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:46] James_F: all done, all yours [21:15:05] Nice work cjming! [21:15:11] James_F: I'm in line after you. [21:15:13] (Command staged in my terminal for 20 mins. ;-) ) [21:15:27] dancy: thank you for your help! [21:16:25] (03PS3) 10Jdlrobson: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) [21:16:38] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2026.codfw.wmnet with reason: host reimage [21:17:55] !log jforrester@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/Scribunto/includes/Hooks.php: Train unblocker: [[gerrit:815281|Hooks: Bump scribunto-stats cache version (T313341)]] (duration: 03m 14s) [21:17:58] Kk, all done from me; dancy, over to you; jeena, train should now be unblocked. [21:17:59] T313341: PHP Notice: Undefined property: Wikimedia\PSquare::$increments - https://phabricator.wikimedia.org/T313341 [21:18:16] thanks James_F [21:19:10] (03CR) 10Ahmon Dancy: [C: 03+2] MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815317 (https://phabricator.wikimedia.org/T311788) (owner: 10Ahmon Dancy) [21:20:27] (03Merged) 10jenkins-bot: MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815317 (https://phabricator.wikimedia.org/T311788) (owner: 10Ahmon Dancy) [21:21:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T312990)', diff saved to https://phabricator.wikimedia.org/P31460 and previous config saved to /var/cache/conftool/dbconfig/20220719-212128-marostegui.json [21:21:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1146.eqiad.wmnet with reason: Maintenance [21:21:38] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [21:21:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1146.eqiad.wmnet with reason: Maintenance [21:21:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T312990)', diff saved to https://phabricator.wikimedia.org/P31461 and previous config saved to /var/cache/conftool/dbconfig/20220719-212149-marostegui.json [21:23:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:24:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T312990)', diff saved to https://phabricator.wikimedia.org/P31462 and previous config saved to /var/cache/conftool/dbconfig/20220719-212431-marostegui.json [21:24:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:24:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:25:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:26:47] !log dancy@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Config: [[gerrit:815317|MWConfigCacheGenerator: If opcache.revalidate_freq is 0, use grace period of 10 seconds (T311788)]] (duration: 02m 59s) [21:26:51] T311788: MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off) - https://phabricator.wikimedia.org/T311788 [21:27:11] I'm done: jeena you're up. [21:27:17] thanks dancy! [21:30:03] (03PS1) 10TrainBranchBot: testwikis wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815383 (https://phabricator.wikimedia.org/T308074) [21:30:05] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815383 (https://phabricator.wikimedia.org/T308074) (owner: 10TrainBranchBot) [21:31:18] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815383 (https://phabricator.wikimedia.org/T308074) (owner: 10TrainBranchBot) [21:31:26] cjming & RhinosF1: The configuration-not-taking-effect problem should be resolved now. Please ping me if not. [21:31:47] will do - thanks dancy! [21:32:12] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.21 refs T308074 [21:32:16] T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074 [21:33:12] dancy: ok [21:33:24] (03CR) 10Novem Linguae: "Hi Stang. Please see my detailed analysis at https://phabricator.wikimedia.org/T310974#8083456 which should completely resolve Krinkle's c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [21:33:41] dancy: thanks for getting it fixed [21:35:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:36:14] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.21 refs T308074 (duration: 04m 02s) [21:36:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:36:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:37:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:38:31] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2026.codfw.wmnet with OS bullseye [21:39:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P31463 and previous config saved to /var/cache/conftool/dbconfig/20220719-213936-marostegui.json [21:40:22] (03PS1) 10TrainBranchBot: group0 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815384 (https://phabricator.wikimedia.org/T308074) [21:40:24] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815384 (https://phabricator.wikimedia.org/T308074) (owner: 10TrainBranchBot) [21:41:58] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815384 (https://phabricator.wikimedia.org/T308074) (owner: 10TrainBranchBot) [21:45:36] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.21 refs T308074 [21:45:45] T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074 [21:47:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:47:45] PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:48:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:48:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:49:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:52:03] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 25.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [21:54:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P31464 and previous config saved to /var/cache/conftool/dbconfig/20220719-215441-marostegui.json [22:00:48] RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:09:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T312990)', diff saved to https://phabricator.wikimedia.org/P31465 and previous config saved to /var/cache/conftool/dbconfig/20220719-220946-marostegui.json [22:09:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1156.eqiad.wmnet with reason: Maintenance [22:09:51] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [22:10:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1156.eqiad.wmnet with reason: Maintenance [22:10:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:10:12] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2050.codfw.wmnet with OS bullseye [22:10:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [22:10:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T312990)', diff saved to https://phabricator.wikimedia.org/P31466 and previous config saved to /var/cache/conftool/dbconfig/20220719-221035-marostegui.json [22:11:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:11:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:11:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:13:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T312990)', diff saved to https://phabricator.wikimedia.org/P31467 and previous config saved to /var/cache/conftool/dbconfig/20220719-221312-marostegui.json [22:18:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:20:36] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:23:44] (03CR) 10Dzahn: [C: 03+2] "rr="Get \"https://[2620:0:861:102:10:64:16:39]:1443/otrs/index.pl\": dial tcp [2620:0:861:102:10:64:16:39]:1443: connect: connection refus" [puppet] - 10https://gerrit.wikimedia.org/r/812326 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [22:28:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P31468 and previous config saved to /var/cache/conftool/dbconfig/20220719-222818-marostegui.json [22:28:28] (03CR) 10Dzahn: [C: 03+2] "envoy on otrs1001 is only listening on IPv4 but monitoring tries IPv6 by default" [puppet] - 10https://gerrit.wikimedia.org/r/812326 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [22:30:49] (03Restored) 10Stang: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [22:31:26] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2050.codfw.wmnet with reason: host reimage [22:35:00] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2050.codfw.wmnet with reason: host reimage [22:38:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:39:16] (03PS1) 10Dzahn: vrts/prometheus: configure monitoring to use only IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/815386 (https://phabricator.wikimedia.org/T312194) [22:40:28] (03CR) 10CI reject: [V: 04-1] vrts/prometheus: configure monitoring to use only IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/815386 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [22:42:17] (03PS2) 10Dzahn: vrts/prometheus: configure monitoring to use only IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/815386 (https://phabricator.wikimedia.org/T312194) [22:43:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P31469 and previous config saved to /var/cache/conftool/dbconfig/20220719-224323-marostegui.json [22:45:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:45:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:48:05] (03CR) 10Dzahn: [C: 03+1] "code change looks good. this would copy other micro sites on miscweb. the name also seems fine with me" [dns] - 10https://gerrit.wikimedia.org/r/815376 (https://phabricator.wikimedia.org/T313355) (owner: 10CDanis) [22:51:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:53:11] (03PS4) 10Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [22:55:20] (03PS1) 10Andrew Bogott: Keystone: reduce number of admin workers from 8 to 4 [puppet] - 10https://gerrit.wikimedia.org/r/815388 [22:55:22] (03PS1) 10Andrew Bogott: neutron: reduce api and rpc workers from 8/10 to 6/8 [puppet] - 10https://gerrit.wikimedia.org/r/815389 [22:57:34] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2050.codfw.wmnet with OS bullseye [22:58:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T312990)', diff saved to https://phabricator.wikimedia.org/P31470 and previous config saved to /var/cache/conftool/dbconfig/20220719-225828-marostegui.json [22:58:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:58:32] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [22:58:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:58:58] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: reduce number of admin workers from 8 to 4 [puppet] - 10https://gerrit.wikimedia.org/r/815388 (owner: 10Andrew Bogott) [22:59:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2104.codfw.wmnet with reason: Maintenance [22:59:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2104.codfw.wmnet with reason: Maintenance [22:59:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on 8 hosts with reason: Maintenance [22:59:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 8 hosts with reason: Maintenance [23:02:04] (03CR) 10Dzahn: [C: 03+2] vrts/prometheus: configure monitoring to use only IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/815386 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [23:05:55] (03PS1) 10Dzahn: vrts/prometheus: fix IP family name, ip4 not ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/815390 (https://phabricator.wikimedia.org/T312194) [23:06:22] (03CR) 10CI reject: [V: 04-1] vrts/prometheus: fix IP family name, ip4 not ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/815390 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [23:06:40] (03PS2) 10Dzahn: vrts/prometheus: fix IP family name, ip4 not ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/815390 (https://phabricator.wikimedia.org/T312194) [23:07:08] (03CR) 10Andrew Bogott: [C: 03+2] neutron: reduce api and rpc workers from 8/10 to 6/8 [puppet] - 10https://gerrit.wikimedia.org/r/815389 (owner: 10Andrew Bogott) [23:11:51] PROBLEM - Check systemd state on elastic2050 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9200.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:12] (03PS1) 10Andrew Bogott: Glance: reduce number of api workers per host from 8 to 2. [puppet] - 10https://gerrit.wikimedia.org/r/815392 [23:16:05] (03CR) 10CI reject: [V: 04-1] Glance: reduce number of api workers per host from 8 to 2. [puppet] - 10https://gerrit.wikimedia.org/r/815392 (owner: 10Andrew Bogott) [23:18:43] (03PS2) 10Andrew Bogott: Glance: reduce number of api workers per host from 8 to 2. [puppet] - 10https://gerrit.wikimedia.org/r/815392 [23:20:18] (03CR) 10Dzahn: [C: 03+2] vrts/prometheus: fix IP family name, ip4 not ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/815390 (https://phabricator.wikimedia.org/T312194) (owner: 10Dzahn) [23:20:38] (03CR) 10Andrew Bogott: [C: 03+2] Glance: reduce number of api workers per host from 8 to 2. [puppet] - 10https://gerrit.wikimedia.org/r/815392 (owner: 10Andrew Bogott) [23:23:12] (03PS4) 10Dzahn: alertmanager: switch IRC channel for gitlab (serviceops-collab) alerts [puppet] - 10https://gerrit.wikimedia.org/r/814926 [23:24:30] (03CR) 10Dzahn: [C: 03+2] "after more follow-ups I am just moving it to #wikimedia-serviceops-test right now to watch it a couple days. once stable will make a chang" [puppet] - 10https://gerrit.wikimedia.org/r/814926 (owner: 10Dzahn) [23:30:53] PROBLEM - Check systemd state on elastic2051 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:43:27] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 101.8 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [23:44:12] (03CR) 10Dzahn: phabricator: switch to prometheus-only network probes/checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [23:46:25] (03CR) 10Dzahn: "or is there a way to test the actual command it will run before merge?" [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)