[00:03:47] !log removing 1 file for legal compliance [00:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:53] (last one, I promise) [00:05:02] (03CR) 10Dzahn: [C: 03+1] "compiler output looks ok though: https://puppet-compiler.wmflabs.org/pcc-worker1001/38071/" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [01:34:22] PROBLEM - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: mnt-nfs-dumps\x2dclouddumps1001.wikimedia.org.mount,mnt-nfs-dumps\x2dclouddumps1002.wikimedia.org.mount,mnt-nfs-dumps\x2dlabstore1006.wikimedia.org.mount,mnt-nfs-dumps\x2dlabstore1007.wikimedia.org.mount,nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:52] (JobUnavailable) firing: (6) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:35] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10RKemper) @Papaul Yup per jbond's comment above we're still seeing the RAID issue. Could we try either rebuilding raid with the current disk, or swapping in a new one and rebuildi... [01:43:52] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:52] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:52] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:52] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:24] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:50:10] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:52:10] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:17:00] (03CR) 10RLazarus: "Perfect, thanks for looking at this!" [puppet] - 10https://gerrit.wikimedia.org/r/854521 (owner: 10Volans) [03:33:31] (03PS31) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [03:35:58] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:51:10] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:55:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T318605)', diff saved to https://phabricator.wikimedia.org/P38858 and previous config saved to /var/cache/conftool/dbconfig/20221110-035509-ladsgroup.json [03:55:14] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [03:55:39] (03PS26) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [03:56:56] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:57:09] (03PS27) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [03:59:33] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [04:10:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P38859 and previous config saved to /var/cache/conftool/dbconfig/20221110-041016-ladsgroup.json [04:13:12] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:18] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:25:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P38860 and previous config saved to /var/cache/conftool/dbconfig/20221110-042522-ladsgroup.json [04:40:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T318605)', diff saved to https://phabricator.wikimedia.org/P38861 and previous config saved to /var/cache/conftool/dbconfig/20221110-044028-ladsgroup.json [04:40:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [04:40:34] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [04:40:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [04:40:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T318605)', diff saved to https://phabricator.wikimedia.org/P38862 and previous config saved to /var/cache/conftool/dbconfig/20221110-044050-ladsgroup.json [04:44:04] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (netmon2002, ...), Fresh: 122 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:13:46] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_aphlict.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:41:28] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS13030/IPv4: Idle - Init7, AS13030/IPv6: Idle - Init7, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:45:02] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:13:01] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1098.eqiad.wmnet [06:20:55] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:21:02] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1098.eqiad.wmnet [06:22:22] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1097.eqiad.wmnet [06:30:28] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1097.eqiad.wmnet [06:32:07] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1096.eqiad.wmnet [06:35:55] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:40:10] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1096.eqiad.wmnet [06:41:23] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1011.eqiad.wmnet [06:45:05] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1011.eqiad.wmnet [07:00:04] kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T0700). [07:00:13] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1023.eqiad.wmnet with OS bullseye [07:07:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2107.codfw.wmnet with reason: Maintenance [07:08:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance [07:08:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2107.codfw.wmnet with reason: Maintenance [07:08:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2127.codfw.wmnet with reason: Maintenance [07:19:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fixup development tooling for wider compatibility [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 (owner: 10Stef Dunlap) [07:30:08] (03PS16) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [07:30:31] (03CR) 10CI reject: [V: 04-1] New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [07:31:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1122.eqiad.wmnet with reason: Maintenance [07:31:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:31:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1122.eqiad.wmnet with reason: Maintenance [07:32:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:37:37] (03PS17) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [07:49:51] 10SRE, 10API Platform, 10serviceops: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10Joe) FWIW we're banning more generic UAs via dynamic requestctl rules; our rule of thumb is to start rate-limiting requests from a specific UA only when it s... [07:50:44] 10SRE, 10API Platform, 10Traffic: Block non-browser requests that use generic user agent (UA) headers - https://phabricator.wikimedia.org/T319423 (10Joe) [07:52:18] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:25] (03PS1) 10Hashar: Merge tag 'v3.4.8' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/855480 (https://phabricator.wikimedia.org/T322724) [07:55:39] (03PS2) 10Ryan Kemper: elastic: finish decom of elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/855004 (https://phabricator.wikimedia.org/T313842) (owner: 10Bking) [07:57:08] (03CR) 10JMeybohm: [C: 03+1] New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [07:58:20] ACKNOWLEDGEMENT - Check systemd state on wcqs2001 is CRITICAL: CRITICAL - degraded: The following units failed: mnt-nfs-dumps\x2dlabstore1006.wikimedia.org.mount,mnt-nfs-dumps\x2dlabstore1007.wikimedia.org.mount Ryan Kemper T316236 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:04] Amir1, apergos, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T0800). [08:00:14] morning! there are no trainees signed up for the window and no patches scheduled for deployment either, so that's a wrap :-D [08:00:15] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.4.8' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/855480 (https://phabricator.wikimedia.org/T322724) (owner: 10Hashar) [08:04:36] (03CR) 10Ryan Kemper: [C: 03+2] elastic: finish decom of elastic2049 [puppet] - 10https://gerrit.wikimedia.org/r/855004 (https://phabricator.wikimedia.org/T313842) (owner: 10Bking) [08:06:17] (03CR) 10Hashar: [C: 03+2] "Will adjust as needed" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853306 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [08:06:39] (03PS1) 10Hashar: Update plugins for Gerrit 3.4.8 [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/855482 (https://phabricator.wikimedia.org/T322724) [08:07:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:07:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:07:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T321123)', diff saved to https://phabricator.wikimedia.org/P38863 and previous config saved to /var/cache/conftool/dbconfig/20221110-080746-marostegui.json [08:07:50] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [08:08:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance [08:08:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance [08:08:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P38864 and previous config saved to /var/cache/conftool/dbconfig/20221110-080823-marostegui.json [08:08:27] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [08:08:38] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 136933 [08:09:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 136933 [08:09:34] (03CR) 10CI reject: [V: 04-1] Update plugins for Gerrit 3.4.8 [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/855482 (https://phabricator.wikimedia.org/T322724) (owner: 10Hashar) [08:09:37] (03PS2) 10Hashar: Gerrit 3.4.8 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/855482 (https://phabricator.wikimedia.org/T322724) [08:09:41] (03Merged) 10jenkins-bot: Merge tag 'v3.4.8' into wmf/stable-3.4 [software/gerrit] (wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/855480 (https://phabricator.wikimedia.org/T322724) (owner: 10Hashar) [08:09:47] (03Merged) 10jenkins-bot: build: add eslint for JavaScript plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/853306 (https://phabricator.wikimedia.org/T319378) (owner: 10Hashar) [08:10:08] (03CR) 10CI reject: [V: 04-1] Gerrit 3.4.8 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/855482 (https://phabricator.wikimedia.org/T322724) (owner: 10Hashar) [08:12:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P38865 and previous config saved to /var/cache/conftool/dbconfig/20221110-081216-marostegui.json [08:16:26] (03PS14) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) [08:16:28] (03PS16) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [08:17:50] !log installing pixman security updates on buster [08:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:26] (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/855482 (https://phabricator.wikimedia.org/T322724) (owner: 10Hashar) [08:18:39] (03CR) 10CI reject: [V: 04-1] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [08:19:05] (03CR) 10CI reject: [V: 04-1] centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [08:19:42] mmmm [08:19:44] Error: Error while evaluating a Resource Statement, Could not find declared class openstack::nova::common::victoria::buster [08:21:58] maybe I missed a rebase [08:22:08] (03PS15) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) [08:22:10] (03PS17) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [08:23:29] (03CR) 10Hashar: [C: 03+2] Gerrit 3.4.8 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/855482 (https://phabricator.wikimedia.org/T322724) (owner: 10Hashar) [08:24:00] (03Merged) 10jenkins-bot: Gerrit 3.4.8 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/855482 (https://phabricator.wikimedia.org/T322724) (owner: 10Hashar) [08:24:17] (03CR) 10CI reject: [V: 04-1] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [08:24:49] (03CR) 10CI reject: [V: 04-1] centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [08:26:18] ah no it is profile_openstack_base_nova_compute_service_spec.rb, probably outdated? [08:26:37] !log hashar@deploy1002 Started deploy [gerrit/gerrit@84648b3]: Gerrit to 3.4.8 on gerrit2002 # T322724 [08:26:42] T322724: Upgrade Gerrit to 3.4.8 - https://phabricator.wikimedia.org/T322724 [08:26:48] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@84648b3]: Gerrit to 3.4.8 on gerrit2002 # T322724 (duration: 00m 10s) [08:27:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P38866 and previous config saved to /var/cache/conftool/dbconfig/20221110-082722-marostegui.json [08:28:31] (03PS16) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) [08:28:33] (03PS18) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [08:28:35] (03PS1) 10Elukey: Remove old openstack nova spec test [puppet] - 10https://gerrit.wikimedia.org/r/855484 [08:30:33] !log hashar@deploy1002 Started deploy [gerrit/gerrit@84648b3]: Gerrit to 3.4.8 on gerrit1001 # T322724 [08:30:41] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@84648b3]: Gerrit to 3.4.8 on gerrit1001 # T322724 (duration: 00m 08s) [08:32:13] I am restarting Gerrit [08:36:21] elukey: I have completed the Gerrit restart [08:36:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321123)', diff saved to https://phabricator.wikimedia.org/P38867 and previous config saved to /var/cache/conftool/dbconfig/20221110-083655-marostegui.json [08:37:00] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [08:37:27] (03CR) 10Arturo Borrero Gonzalez: Remove old openstack nova spec test (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/855484 (owner: 10Elukey) [08:42:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P38868 and previous config saved to /var/cache/conftool/dbconfig/20221110-084229-marostegui.json [08:46:00] (03CR) 10David Caro: [C: 03+2] global: replace labsproject by wmcs_project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [08:46:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1013.eqiad.wmnet with OS bullseye [08:46:42] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1013.eqiad.wmnet with OS bullseye [08:52:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38869 and previous config saved to /var/cache/conftool/dbconfig/20221110-085201-marostegui.json [08:57:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P38870 and previous config saved to /var/cache/conftool/dbconfig/20221110-085735-marostegui.json [08:57:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1100.eqiad.wmnet with reason: Maintenance [08:57:41] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [08:57:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1100.eqiad.wmnet with reason: Maintenance [08:57:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T321130)', diff saved to https://phabricator.wikimedia.org/P38871 and previous config saved to /var/cache/conftool/dbconfig/20221110-085756-marostegui.json [08:58:28] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-codfw [08:58:57] (03PS1) 10Marostegui: mariadb: Promote es1024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/855488 (https://phabricator.wikimedia.org/T322187) [09:00:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T321130)', diff saved to https://phabricator.wikimedia.org/P38872 and previous config saved to /var/cache/conftool/dbconfig/20221110-090023-marostegui.json [09:00:49] (03PS1) 10Marostegui: db-production.php: Disable es5 writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855489 (https://phabricator.wikimedia.org/T322187) [09:00:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1013.eqiad.wmnet with reason: host reimage [09:02:41] (03PS2) 10Elukey: Update openstack nova spec test [puppet] - 10https://gerrit.wikimedia.org/r/855484 [09:02:43] (03PS17) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) [09:02:45] (03PS19) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [09:03:07] (03CR) 10Elukey: Update openstack nova spec test (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/855484 (owner: 10Elukey) [09:04:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1013.eqiad.wmnet with reason: host reimage [09:04:19] (03CR) 10Alexandros Kosiaris: DNM: utils: Add a role_team_stats.py script (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [09:04:23] (03PS2) 10Alexandros Kosiaris: DNM: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 [09:04:51] (03CR) 10CI reject: [V: 04-1] Update openstack nova spec test [puppet] - 10https://gerrit.wikimedia.org/r/855484 (owner: 10Elukey) [09:05:08] (03CR) 10CI reject: [V: 04-1] DNM: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [09:05:13] (03PS1) 10Marostegui: wmnet: Update es5 CNAME [dns] - 10https://gerrit.wikimedia.org/r/855491 (https://phabricator.wikimedia.org/T322187) [09:05:28] (03CR) 10CI reject: [V: 04-1] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:06:01] (03CR) 10Giuseppe Lavagetto: "We probably need to add network policies for the redis lock servers." [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) (owner: 10Clément Goubert) [09:06:17] (03CR) 10CI reject: [V: 04-1] centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38873 and previous config saved to /var/cache/conftool/dbconfig/20221110-090708-marostegui.json [09:12:37] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Sustainability (Incident Followup): Review sizing of maps cluster - https://phabricator.wikimedia.org/T228497 (10fgiunchedi) 05Open→03Declined I'm going to be bold and decline the task -- while it is seems something valid in general I don't think... [09:12:39] still trying to fix the nova test, gimme 5 mins fols :) [09:12:45] *folks [09:14:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10fgiunchedi) a:03Jcross Thank you @Dzahn and @Htriedman, I'm assigning to @Jcross for final approval [09:15:09] (03PS3) 10Elukey: Update openstack nova spec test [puppet] - 10https://gerrit.wikimedia.org/r/855484 [09:15:11] (03PS18) 10Elukey: Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) [09:15:13] (03PS20) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [09:15:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P38874 and previous config saved to /var/cache/conftool/dbconfig/20221110-091530-marostegui.json [09:18:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Update openstack nova spec test [puppet] - 10https://gerrit.wikimedia.org/r/855484 (owner: 10Elukey) [09:19:29] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/855484 (owner: 10Elukey) [09:19:45] (03CR) 10Muehlenhoff: site: move contint2002 from insetup::unowned to insetup::serviceops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/855147 (https://phabricator.wikimedia.org/T294276) (owner: 10Dzahn) [09:20:49] (03CR) 10Elukey: [C: 03+2] Update openstack nova spec test [puppet] - 10https://gerrit.wikimedia.org/r/855484 (owner: 10Elukey) [09:20:53] (03PS5) 10Clément Goubert: mediawiki: Create new mw-web deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) [09:21:10] ok CI unblocked :) [09:21:50] (03PS5) 10Clément Goubert: mediawiki: Create new mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853952 (https://phabricator.wikimedia.org/T321896) [09:21:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1013.eqiad.wmnet with OS bullseye [09:21:59] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1013.eqiad.wmnet with OS bullseye completed: - ganeti1013 (**PASS**) - Downtimed on... [09:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321123)', diff saved to https://phabricator.wikimedia.org/P38875 and previous config saved to /var/cache/conftool/dbconfig/20221110-092215-marostegui.json [09:22:19] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [09:22:27] (03PS5) 10Clément Goubert: mediawiki: Create new mw-api-int deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853933 (https://phabricator.wikimedia.org/T321895) [09:22:49] (03PS1) 10Filippo Giunchedi: admin: add ryasmeen to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/855492 (https://phabricator.wikimedia.org/T322795) [09:22:59] jouncebot: next [09:22:59] In 1 hour(s) and 37 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T1100) [09:23:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T322187 [09:23:14] T322187: Switchover es5 master (es1023 -> es1024) - https://phabricator.wikimedia.org/T322187 [09:23:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for ryasmeen (superset access with no server access) - https://phabricator.wikimedia.org/T322795 (10fgiunchedi) [09:23:22] (03PS5) 10Clément Goubert: mediawiki: Create new mw-jobrunner deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853958 (https://phabricator.wikimedia.org/T321897) [09:23:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T322187 [09:23:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1024 with weight 0 T322187', diff saved to https://phabricator.wikimedia.org/P38876 and previous config saved to /var/cache/conftool/dbconfig/20221110-092336-root.json [09:23:45] (03CR) 10CI reject: [V: 04-1] admin: add ryasmeen to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/855492 (https://phabricator.wikimedia.org/T322795) (owner: 10Filippo Giunchedi) [09:24:03] (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable es5 writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855489 (https://phabricator.wikimedia.org/T322187) (owner: 10Marostegui) [09:25:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for ryasmeen (superset access with no server access) - https://phabricator.wikimedia.org/T322795 (10fgiunchedi) Request looks good to me, @Ottomata @odimitrijevic I'm seeking approval for the above! Thank you [09:25:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for ryasmeen (superset access with no server access) - https://phabricator.wikimedia.org/T322795 (10fgiunchedi) p:05Triage→03Medium [09:25:16] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es1024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/855488 (https://phabricator.wikimedia.org/T322187) (owner: 10Marostegui) [09:25:29] (03PS2) 10Filippo Giunchedi: admin: add ryasmeen to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/855492 (https://phabricator.wikimedia.org/T322795) [09:26:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/855102 (https://phabricator.wikimedia.org/T283838) (owner: 10Eevans) [09:26:15] (03CR) 10Filippo Giunchedi: [C: 03+1] netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:26:56] (03Merged) 10jenkins-bot: db-production.php: Disable es5 writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855489 (https://phabricator.wikimedia.org/T322187) (owner: 10Marostegui) [09:27:00] (03CR) 10CI reject: [V: 04-1] admin: add ryasmeen to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/855492 (https://phabricator.wikimedia.org/T322795) (owner: 10Filippo Giunchedi) [09:27:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855489 (https://phabricator.wikimedia.org/T322187) (owner: 10Marostegui) [09:27:18] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:855489|db-production.php: Disable es5 writes (T322187)]] [09:27:25] (03PS1) 10Vgutierrez: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/855494 (https://phabricator.wikimedia.org/T292815) [09:27:40] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:855489|db-production.php: Disable es5 writes (T322187)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [09:27:55] (03PS6) 10Clément Goubert: mediawiki: Create new mw-jobrunner deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853958 (https://phabricator.wikimedia.org/T321897) [09:28:03] (03CR) 10Muehlenhoff: [C: 03+2] profile::java: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/854555 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [09:28:16] (03CR) 10Filippo Giunchedi: dispatch: sync user role and info from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [09:28:41] (03PS1) 10Marostegui: Revert "db-production.php: Disable es5 writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855514 [09:28:46] (03CR) 10Marostegui: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855514 (owner: 10Marostegui) [09:28:49] (03PS6) 10Clément Goubert: mediawiki: Create new mw-api-int deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853933 (https://phabricator.wikimedia.org/T321895) [09:29:36] (03PS6) 10Clément Goubert: mediawiki: Create new mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853952 (https://phabricator.wikimedia.org/T321896) [09:29:45] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38073/console" [puppet] - 10https://gerrit.wikimedia.org/r/855494 (https://phabricator.wikimedia.org/T292815) (owner: 10Vgutierrez) [09:30:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:30:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P38877 and previous config saved to /var/cache/conftool/dbconfig/20221110-093036-marostegui.json [09:31:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:31:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:31:49] (03PS6) 10Clément Goubert: mediawiki: Create new mw-web deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) [09:31:53] (03PS3) 10Filippo Giunchedi: admin: add ryasmeen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/855492 (https://phabricator.wikimedia.org/T322795) [09:31:58] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:855489|db-production.php: Disable es5 writes (T322187)]] (duration: 04m 39s) [09:32:02] T322187: Switchover es5 master (es1023 -> es1024) - https://phabricator.wikimedia.org/T322187 [09:32:19] !log Starting es5 eqiad failover from es1023 to es1024 T322187 [09:32:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1024 to es5 primary T322187', diff saved to https://phabricator.wikimedia.org/P38878 and previous config saved to /var/cache/conftool/dbconfig/20221110-093243-root.json [09:33:34] (03PS1) 10Clément Goubert: mw-debug: add redis_lock egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/855495 [09:33:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1023 T322187', diff saved to https://phabricator.wikimedia.org/P38879 and previous config saved to /var/cache/conftool/dbconfig/20221110-093354-root.json [09:34:06] (03CR) 10Marostegui: Revert "db-production.php: Disable es5 writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855514 (owner: 10Marostegui) [09:34:08] (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable es5 writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855514 (owner: 10Marostegui) [09:34:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [09:34:48] (03CR) 10Marostegui: [C: 03+2] wmnet: Update es5 CNAME [dns] - 10https://gerrit.wikimedia.org/r/855491 (https://phabricator.wikimedia.org/T322187) (owner: 10Marostegui) [09:35:15] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable es5 writes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855514 (owner: 10Marostegui) [09:35:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855514 (owner: 10Marostegui) [09:35:32] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:855514|Revert "db-production.php: Disable es5 writes"]] [09:35:51] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:855514|Revert "db-production.php: Disable es5 writes"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [09:36:07] (03PS1) 10Marostegui: es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/855496 [09:36:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-debug: add redis_lock egress rules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855495 (owner: 10Clément Goubert) [09:36:32] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad [09:37:35] (03PS2) 10Clément Goubert: mw-debug: add redis_lock egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/855495 [09:38:39] (03CR) 10Vgutierrez: "Hal please see my inline comment regarding subkey rotation" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [09:38:53] (03CR) 10Clément Goubert: mw-debug: add redis_lock egress rules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855495 (owner: 10Clément Goubert) [09:38:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:39:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:39:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P38880 and previous config saved to /var/cache/conftool/dbconfig/20221110-093929-marostegui.json [09:39:33] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [09:39:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [09:39:52] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:855514|Revert "db-production.php: Disable es5 writes"]] (duration: 04m 20s) [09:40:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P38881 and previous config saved to /var/cache/conftool/dbconfig/20221110-094037-marostegui.json [09:41:08] (03Abandoned) 10Clément Goubert: P:kubernetes::deployment_server: mw release conf [puppet] - 10https://gerrit.wikimedia.org/r/854982 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [09:41:32] (03CR) 10Clément Goubert: mediawiki: Create new mw-web deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) (owner: 10Clément Goubert) [09:41:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Create new mw-web deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) (owner: 10Clément Goubert) [09:41:54] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Create new mw-web deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) (owner: 10Clément Goubert) [09:42:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:42:39] (03PS3) 10Alexandros Kosiaris: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 [09:42:47] 10SRE, 10Analytics, 10Data-Engineering, 10Event-Platform Value Stream: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface - https://phabricator.wikimedia.org/T264021 (10Vgutierrez) Are we sure that this is a service side issue? this sounds a lot like a FetchError t... [09:43:14] (03CR) 10CI reject: [V: 04-1] utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [09:43:23] (03CR) 10Alexandros Kosiaris: utils: Add a role_team_stats.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [09:43:37] (03CR) 10Ayounsi: "One comment then good to merge!" [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [09:43:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:43:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:44:07] (03PS4) 10Alexandros Kosiaris: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 [09:44:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:45:05] (03CR) 10Marostegui: [C: 03+2] es1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/855496 (owner: 10Marostegui) [09:45:42] (03CR) 10Alexandros Kosiaris: utils: Add a role_team_stats.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [09:45:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T321130)', diff saved to https://phabricator.wikimedia.org/P38882 and previous config saved to /var/cache/conftool/dbconfig/20221110-094542-marostegui.json [09:45:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1110.eqiad.wmnet with reason: Maintenance [09:45:47] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:45:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1110.eqiad.wmnet with reason: Maintenance [09:46:01] (03Merged) 10jenkins-bot: mediawiki: Create new mw-web deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) (owner: 10Clément Goubert) [09:46:03] 10SRE, 10Observability-Metrics: Grafana: CVE-2022-39307 CVE-2022-39306 - https://phabricator.wikimedia.org/T322829 (10MoritzMuehlenhoff) [09:46:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T321130)', diff saved to https://phabricator.wikimedia.org/P38883 and previous config saved to /var/cache/conftool/dbconfig/20221110-094604-marostegui.json [09:49:25] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [09:49:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The unowned section takes into account Datacenter specific hiera files. Those should probably be skipped in most of the cases as the commo" [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [09:49:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T321130)', diff saved to https://phabricator.wikimedia.org/P38884 and previous config saved to /var/cache/conftool/dbconfig/20221110-094952-marostegui.json [09:50:38] (03CR) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:51:05] 10SRE, 10Observability-Metrics: Grafana: CVE-2022-39307 CVE-2022-39306 - https://phabricator.wikimedia.org/T322829 (10MoritzMuehlenhoff) There is also CVE-2022-39328 (https://github.com/grafana/grafana/security/advisories/GHSA-vqc4-mpj8-jxch), but it's specific to 9.x which we don't use yet. [09:52:11] (03PS1) 10Marostegui: Revert "es1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/855515 [09:52:11] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [09:53:00] (03CR) 10Marostegui: [C: 03+2] Revert "es1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/855515 (owner: 10Marostegui) [09:53:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38885 and previous config saved to /var/cache/conftool/dbconfig/20221110-095313-root.json [09:53:27] (03CR) 10Muehlenhoff: utils: Add a role_team_stats.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [09:55:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P38886 and previous config saved to /var/cache/conftool/dbconfig/20221110-095543-marostegui.json [09:56:27] (03CR) 10Ayounsi: Add Peering News to Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [09:57:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce es4 master weight', diff saved to https://phabricator.wikimedia.org/P38887 and previous config saved to /var/cache/conftool/dbconfig/20221110-095724-marostegui.json [09:58:12] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: component: account for .git in the git repository URL [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855497 [09:58:14] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: component: build: typo 'guessed' [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855498 [10:00:19] (03PS1) 10Clément Goubert: mediawiki: Include release name in configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/855499 (https://phabricator.wikimedia.org/T321786) [10:01:22] (03CR) 10CI reject: [V: 04-1] toolforge: k8s: component: account for .git in the git repository URL [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855497 (owner: 10Arturo Borrero Gonzalez) [10:01:42] (03CR) 10CI reject: [V: 04-1] toolforge: k8s: component: build: typo 'guessed' [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855498 (owner: 10Arturo Borrero Gonzalez) [10:01:46] (03PS2) 10Clément Goubert: mediawiki: Include release name in configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/855499 (https://phabricator.wikimedia.org/T321786) [10:03:53] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM but you need to bump the chart version too" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855499 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [10:04:25] (03PS3) 10Clément Goubert: mediawiki: Include release name in configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/855499 (https://phabricator.wikimedia.org/T321786) [10:04:40] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: component: account for .git in the git repository URL [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855497 [10:04:42] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: component: build: typo 'guessed' [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855498 [10:04:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Include release name in configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/855499 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [10:04:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P38888 and previous config saved to /var/cache/conftool/dbconfig/20221110-100459-marostegui.json [10:07:20] (03PS2) 10JMeybohm: calico: Allow different versions, drop pre bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/855012 (https://phabricator.wikimedia.org/T307943) [10:08:18] (03CR) 10CI reject: [V: 04-1] toolforge: k8s: component: account for .git in the git repository URL [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855497 (owner: 10Arturo Borrero Gonzalez) [10:08:20] (03CR) 10CI reject: [V: 04-1] toolforge: k8s: component: build: typo 'guessed' [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855498 (owner: 10Arturo Borrero Gonzalez) [10:08:22] (03PS1) 10Filippo Giunchedi: clinic-duty: add Telxius [software] - 10https://gerrit.wikimedia.org/r/855501 [10:08:24] (03PS1) 10Filippo Giunchedi: clinic-duty: update Telia/Arelion [software] - 10https://gerrit.wikimedia.org/r/855502 [10:09:19] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Include release name in configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/855499 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [10:10:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P38889 and previous config saved to /var/cache/conftool/dbconfig/20221110-101050-marostegui.json [10:13:24] (03Merged) 10jenkins-bot: mediawiki: Include release name in configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/855499 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [10:16:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1013.eqiad.wmnet [10:17:14] (03PS5) 10Alexandros Kosiaris: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 [10:19:19] (03CR) 10Alexandros Kosiaris: utils: Add a role_team_stats.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [10:20:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P38890 and previous config saved to /var/cache/conftool/dbconfig/20221110-102005-marostegui.json [10:22:55] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:23:12] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:23:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1013.eqiad.wmnet [10:23:28] !log installing libxml2 security updates [10:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1013.eqiad.wmnet to cluster eqiad and group B [10:25:28] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1013.eqiad.wmnet to cluster eqiad and group B [10:25:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P38891 and previous config saved to /var/cache/conftool/dbconfig/20221110-102556-marostegui.json [10:25:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:26:00] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:26:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:26:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P38892 and previous config saved to /var/cache/conftool/dbconfig/20221110-102617-marostegui.json [10:27:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P38893 and previous config saved to /var/cache/conftool/dbconfig/20221110-102725-marostegui.json [10:28:32] (03PS1) 10Clément Goubert: mediawiki: Fix volumemounts names [deployment-charts] - 10https://gerrit.wikimedia.org/r/855504 (https://phabricator.wikimedia.org/T321786) [10:29:25] (03CR) 10Elukey: [C: 03+1] calico: Allow different versions, drop pre bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/855012 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:35:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T321130)', diff saved to https://phabricator.wikimedia.org/P38894 and previous config saved to /var/cache/conftool/dbconfig/20221110-103512-marostegui.json [10:35:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:35:16] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:35:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:35:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P38895 and previous config saved to /var/cache/conftool/dbconfig/20221110-103533-marostegui.json [10:36:39] (HelmReleaseBadStatus) firing: Helm release mw-debug/pinkunicorn on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-debug - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:38:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P38896 and previous config saved to /var/cache/conftool/dbconfig/20221110-103822-marostegui.json [10:38:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38897 and previous config saved to /var/cache/conftool/dbconfig/20221110-103827-root.json [10:38:52] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix volumemounts names [deployment-charts] - 10https://gerrit.wikimedia.org/r/855504 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [10:41:57] (03PS6) 10Alexandros Kosiaris: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 [10:42:25] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: update Telia/Arelion [software] - 10https://gerrit.wikimedia.org/r/855502 (owner: 10Filippo Giunchedi) [10:42:27] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: add Telxius [software] - 10https://gerrit.wikimedia.org/r/855501 (owner: 10Filippo Giunchedi) [10:42:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P38898 and previous config saved to /var/cache/conftool/dbconfig/20221110-104231-marostegui.json [10:42:44] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] clinic-duty: add Telxius [software] - 10https://gerrit.wikimedia.org/r/855501 (owner: 10Filippo Giunchedi) [10:42:52] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] clinic-duty: update Telia/Arelion [software] - 10https://gerrit.wikimedia.org/r/855502 (owner: 10Filippo Giunchedi) [10:43:30] (03Merged) 10jenkins-bot: mediawiki: Fix volumemounts names [deployment-charts] - 10https://gerrit.wikimedia.org/r/855504 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [10:43:58] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [10:47:17] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:48:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:49:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:49:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P38899 and previous config saved to /var/cache/conftool/dbconfig/20221110-104919-ladsgroup.json [10:49:23] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:50:08] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:51:39] (HelmReleaseBadStatus) resolved: Helm release mw-debug/pinkunicorn on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-debug - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:53:17] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:53:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P38900 and previous config saved to /var/cache/conftool/dbconfig/20221110-105329-marostegui.json [10:53:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38901 and previous config saved to /var/cache/conftool/dbconfig/20221110-105338-root.json [10:54:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P38902 and previous config saved to /var/cache/conftool/dbconfig/20221110-105428-ladsgroup.json [10:54:33] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:55:41] 10SRE, 10ops-codfw: Troubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10jbond) @Papaul i think we have fixed this issue by ensuring the redfish spicerack module uses the management ip address and no the hostname. Can you confirm? [10:56:07] (03PS1) 10Arturo Borrero Gonzalez: cookbooks: py3-mypy fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855531 [10:56:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:56:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:56:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P38903 and previous config saved to /var/cache/conftool/dbconfig/20221110-105628-ladsgroup.json [10:57:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P38904 and previous config saved to /var/cache/conftool/dbconfig/20221110-105738-marostegui.json [10:58:00] (03PS2) 10Arturo Borrero Gonzalez: cookbooks: py3-mypy fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855531 [10:58:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P38905 and previous config saved to /var/cache/conftool/dbconfig/20221110-105841-ladsgroup.json [11:00:04] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T1100). [11:00:29] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:06:17] (03CR) 10FNegri: [C: 03+1] "LGTM, thanks for fixing this!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855531 (owner: 10Arturo Borrero Gonzalez) [11:06:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cookbooks: py3-mypy fixes [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855531 (owner: 10Arturo Borrero Gonzalez) [11:06:59] (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: component: account for .git in the git repository URL [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855497 [11:08:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P38906 and previous config saved to /var/cache/conftool/dbconfig/20221110-110835-marostegui.json [11:08:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38907 and previous config saved to /var/cache/conftool/dbconfig/20221110-110843-root.json [11:09:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P38908 and previous config saved to /var/cache/conftool/dbconfig/20221110-110935-ladsgroup.json [11:12:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T321123)', diff saved to https://phabricator.wikimedia.org/P38909 and previous config saved to /var/cache/conftool/dbconfig/20221110-111244-marostegui.json [11:12:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [11:12:49] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [11:13:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [11:13:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:13:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:13:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T321123)', diff saved to https://phabricator.wikimedia.org/P38910 and previous config saved to /var/cache/conftool/dbconfig/20221110-111323-marostegui.json [11:13:24] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [11:13:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P38911 and previous config saved to /var/cache/conftool/dbconfig/20221110-111347-ladsgroup.json [11:17:44] (03PS1) 10Giuseppe Lavagetto: mw-web: add deployment: production [deployment-charts] - 10https://gerrit.wikimedia.org/r/855534 [11:17:46] (03PS1) 10Giuseppe Lavagetto: mediawiki: DRY attempt [deployment-charts] - 10https://gerrit.wikimedia.org/r/855535 [11:20:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: Index rebuilt', diff saved to https://phabricator.wikimedia.org/P38912 and previous config saved to /var/cache/conftool/dbconfig/20221110-112022-ladsgroup.json [11:23:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P38913 and previous config saved to /var/cache/conftool/dbconfig/20221110-112342-marostegui.json [11:23:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:23:47] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:23:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38914 and previous config saved to /var/cache/conftool/dbconfig/20221110-112348-root.json [11:23:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:24:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P38915 and previous config saved to /var/cache/conftool/dbconfig/20221110-112403-marostegui.json [11:24:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P38916 and previous config saved to /var/cache/conftool/dbconfig/20221110-112441-ladsgroup.json [11:26:05] (03PS3) 10Arturo Borrero Gonzalez: toolforge: k8s: component: build: typo 'guessed' [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855498 [11:27:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P38917 and previous config saved to /var/cache/conftool/dbconfig/20221110-112753-marostegui.json [11:28:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P38918 and previous config saved to /var/cache/conftool/dbconfig/20221110-112854-ladsgroup.json [11:30:17] (03CR) 10Ssingh: [V: 03+1 C: 03+2] sslcert: refactor update-ocsp.py to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854608 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [11:31:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public [11:31:58] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-all [11:35:18] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Connect - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:35:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Index rebuilt', diff saved to https://phabricator.wikimedia.org/P38919 and previous config saved to /var/cache/conftool/dbconfig/20221110-113526-ladsgroup.json [11:36:09] (03PS1) 10Muehlenhoff: New aliases covering the various etcd installs [puppet] - 10https://gerrit.wikimedia.org/r/855537 [11:38:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1023 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P38920 and previous config saved to /var/cache/conftool/dbconfig/20221110-113853-root.json [11:39:06] (03CR) 10David Caro: [C: 03+1] toolforge: k8s: component: build: typo 'guessed' [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855498 (owner: 10Arturo Borrero Gonzalez) [11:39:38] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv4: Connect - Init7, AS13030/IPv6: Active - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:39:41] (03PS3) 10Vgutierrez: varnish: Add sessioncookie bit to X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/839512 (https://phabricator.wikimedia.org/T319324) [11:39:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P38921 and previous config saved to /var/cache/conftool/dbconfig/20221110-113948-ladsgroup.json [11:39:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:39:52] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:39:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [11:39:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P38922 and previous config saved to /var/cache/conftool/dbconfig/20221110-113958-ladsgroup.json [11:41:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-all [11:41:36] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:41:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T318605)', diff saved to https://phabricator.wikimedia.org/P38923 and previous config saved to /var/cache/conftool/dbconfig/20221110-114149-ladsgroup.json [11:41:53] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:41:56] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:42:57] (03CR) 10Jbond: [C: 04-1] "this removes all of the customisations we have added e.g. everything under src/resources. also the all-{cas-}properties files are super u" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/854998 (owner: 10Muehlenhoff) [11:43:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P38924 and previous config saved to /var/cache/conftool/dbconfig/20221110-114300-marostegui.json [11:43:56] (03CR) 10FNegri: [C: 03+1] "Nice one!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855497 (owner: 10Arturo Borrero Gonzalez) [11:44:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P38925 and previous config saved to /var/cache/conftool/dbconfig/20221110-114400-ladsgroup.json [11:44:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:44:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [11:44:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P38926 and previous config saved to /var/cache/conftool/dbconfig/20221110-114422-ladsgroup.json [11:44:34] (03CR) 10Vgutierrez: [C: 03+1] varnish::common: set Python version for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/854607 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [11:44:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw-web: add deployment: production [deployment-charts] - 10https://gerrit.wikimedia.org/r/855534 (owner: 10Giuseppe Lavagetto) [11:44:46] (03CR) 10Clément Goubert: [V: 03+1] mw-web: add deployment: production [deployment-charts] - 10https://gerrit.wikimedia.org/r/855534 (owner: 10Giuseppe Lavagetto) [11:44:49] moritzm: It is possible to create nodejs18 image for next task of https://phabricator.wikimedia.org/T308371 ? [11:45:00] (03CR) 10Clément Goubert: [C: 03+1] mediawiki: DRY attempt [deployment-charts] - 10https://gerrit.wikimedia.org/r/855535 (owner: 10Giuseppe Lavagetto) [11:45:05] moritzm: Let me know if task creation is needed, will do it. [11:45:19] (03CR) 10Ssingh: [V: 03+1 C: 03+2] varnish::common: set Python version for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/854607 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [11:46:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P38927 and previous config saved to /var/cache/conftool/dbconfig/20221110-114634-ladsgroup.json [11:46:40] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:46:43] (03CR) 10Vgutierrez: [C: 03+2] varnish: Add sessioncookie bit to X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/839512 (https://phabricator.wikimedia.org/T319324) (owner: 10Vgutierrez) [11:49:07] (03Merged) 10jenkins-bot: mw-web: add deployment: production [deployment-charts] - 10https://gerrit.wikimedia.org/r/855534 (owner: 10Giuseppe Lavagetto) [11:50:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P38928 and previous config saved to /var/cache/conftool/dbconfig/20221110-115007-ladsgroup.json [11:50:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Index rebuilt', diff saved to https://phabricator.wikimedia.org/P38929 and previous config saved to /var/cache/conftool/dbconfig/20221110-115032-ladsgroup.json [11:51:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: component: account for .git in the git repository URL [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855497 (owner: 10Arturo Borrero Gonzalez) [11:51:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: component: build: typo 'guessed' [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855498 (owner: 10Arturo Borrero Gonzalez) [11:52:24] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:54:38] (03PS2) 10Clément Goubert: mw-web: test templating servergroup [deployment-charts] - 10https://gerrit.wikimedia.org/r/855539 [11:55:10] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:56:20] (03PS2) 10Volans: json-webrequests-stats: add -t/--time-range [puppet] - 10https://gerrit.wikimedia.org/r/854521 [11:56:27] (03CR) 10Volans: "Addressed/replied comments, asking some follow up questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/854521 (owner: 10Volans) [11:56:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P38930 and previous config saved to /var/cache/conftool/dbconfig/20221110-115655-ladsgroup.json [11:56:58] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_aphlict.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:46] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:57:47] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:57:51] 10SRE, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Consider adding X-Analytics subfield for 'has a session cookie' - https://phabricator.wikimedia.org/T319324 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez CR merged and https://wikitech.wikimedia.org/wiki/X-Analytics#Keys updated, thanks for c... [11:58:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P38931 and previous config saved to /var/cache/conftool/dbconfig/20221110-115807-marostegui.json [11:58:24] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:58:26] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:59:11] (03PS3) 10Clément Goubert: mw-web: test templating servergroup [deployment-charts] - 10https://gerrit.wikimedia.org/r/855539 [11:59:54] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:59:56] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:01:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P38932 and previous config saved to /var/cache/conftool/dbconfig/20221110-120140-ladsgroup.json [12:01:53] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:01:55] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:02:08] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:02:09] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:02:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T318605)', diff saved to https://phabricator.wikimedia.org/P38933 and previous config saved to /var/cache/conftool/dbconfig/20221110-120224-ladsgroup.json [12:02:28] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:03:34] (03PS1) 10Muehlenhoff: phabricator::aphlict: Pass the ensure to the auto restart [puppet] - 10https://gerrit.wikimedia.org/r/855542 [12:05:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38934 and previous config saved to /var/cache/conftool/dbconfig/20221110-120513-ladsgroup.json [12:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Index rebuilt', diff saved to https://phabricator.wikimedia.org/P38935 and previous config saved to /var/cache/conftool/dbconfig/20221110-120537-ladsgroup.json [12:05:41] (03CR) 10David Caro: [C: 03+1] toolforge: k8s: component: account for .git in the git repository URL (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855497 (owner: 10Arturo Borrero Gonzalez) [12:06:03] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:06:05] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:06:06] (03PS4) 10Clément Goubert: mw-web: test templating servergroup [deployment-charts] - 10https://gerrit.wikimedia.org/r/855539 [12:06:07] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [12:06:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: DRY attempt [deployment-charts] - 10https://gerrit.wikimedia.org/r/855535 (owner: 10Giuseppe Lavagetto) [12:08:42] (03CR) 10CI reject: [V: 04-1] mw-web: test templating servergroup [deployment-charts] - 10https://gerrit.wikimedia.org/r/855539 (owner: 10Clément Goubert) [12:08:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [12:10:09] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:10:40] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:10:59] (03Merged) 10jenkins-bot: mediawiki: DRY attempt [deployment-charts] - 10https://gerrit.wikimedia.org/r/855535 (owner: 10Giuseppe Lavagetto) [12:12:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P38936 and previous config saved to /var/cache/conftool/dbconfig/20221110-121202-ladsgroup.json [12:12:23] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [12:12:58] (03PS1) 10Kosta Harlan: GrowthExperiments: Use job queue for refreshUserImpact script [puppet] - 10https://gerrit.wikimedia.org/r/855546 (https://phabricator.wikimedia.org/T322706) [12:13:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P38937 and previous config saved to /var/cache/conftool/dbconfig/20221110-121313-marostegui.json [12:13:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:13:16] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [12:13:18] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:13:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:13:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321123)', diff saved to https://phabricator.wikimedia.org/P38938 and previous config saved to /var/cache/conftool/dbconfig/20221110-121339-marostegui.json [12:13:44] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:13:58] (03CR) 10Kosta Harlan: [WIP] Add GrowthExperiments periodic maintenance scripts for user impact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [12:15:08] (03PS5) 10Clément Goubert: mw-web: test templating servergroup [deployment-charts] - 10https://gerrit.wikimedia.org/r/855539 [12:15:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:15:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:15:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:15:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:16:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T321130)', diff saved to https://phabricator.wikimedia.org/P38939 and previous config saved to /var/cache/conftool/dbconfig/20221110-121601-marostegui.json [12:16:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P38940 and previous config saved to /var/cache/conftool/dbconfig/20221110-121647-ladsgroup.json [12:17:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P38941 and previous config saved to /var/cache/conftool/dbconfig/20221110-121730-ladsgroup.json [12:19:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321130)', diff saved to https://phabricator.wikimedia.org/P38942 and previous config saved to /var/cache/conftool/dbconfig/20221110-121946-marostegui.json [12:19:51] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:20:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38943 and previous config saved to /var/cache/conftool/dbconfig/20221110-122020-ladsgroup.json [12:21:14] (03PS7) 10Kosta Harlan: Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [12:21:20] (03CR) 10Kosta Harlan: [C: 03+1] Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [12:22:33] (03PS2) 10Kosta Harlan: GrowthExperiments: Use job queue for refreshUserImpact script [puppet] - 10https://gerrit.wikimedia.org/r/855546 (https://phabricator.wikimedia.org/T322706) [12:24:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) [12:25:05] (03CR) 10CI reject: [V: 04-1] GrowthExperiments: Use job queue for refreshUserImpact script [puppet] - 10https://gerrit.wikimedia.org/r/855546 (https://phabricator.wikimedia.org/T322706) (owner: 10Kosta Harlan) [12:25:52] (03CR) 10Kosta Harlan: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/855546 (https://phabricator.wikimedia.org/T322706) (owner: 10Kosta Harlan) [12:25:56] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [12:25:59] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [12:26:07] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [12:26:10] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [12:26:28] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) [12:26:30] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:26:32] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:27:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T318605)', diff saved to https://phabricator.wikimedia.org/P38944 and previous config saved to /var/cache/conftool/dbconfig/20221110-122708-ladsgroup.json [12:27:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [12:27:13] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:27:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [12:27:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P38945 and previous config saved to /var/cache/conftool/dbconfig/20221110-122720-ladsgroup.json [12:27:22] (03CR) 10Muehlenhoff: [C: 03+2] New aliases covering the various etcd installs [puppet] - 10https://gerrit.wikimedia.org/r/855537 (owner: 10Muehlenhoff) [12:28:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P38946 and previous config saved to /var/cache/conftool/dbconfig/20221110-122845-marostegui.json [12:29:34] (03PS6) 10Clément Goubert: mw-web: test templating servergroup [deployment-charts] - 10https://gerrit.wikimedia.org/r/855539 [12:31:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P38947 and previous config saved to /var/cache/conftool/dbconfig/20221110-123153-ladsgroup.json [12:31:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:32:00] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:32:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:32:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P38948 and previous config saved to /var/cache/conftool/dbconfig/20221110-123215-ladsgroup.json [12:32:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P38949 and previous config saved to /var/cache/conftool/dbconfig/20221110-123237-ladsgroup.json [12:33:14] (03PS1) 10Muehlenhoff: Add ganeti1034 to Ganeti cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/855556 [12:34:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P38950 and previous config saved to /var/cache/conftool/dbconfig/20221110-123428-ladsgroup.json [12:34:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P38951 and previous config saved to /var/cache/conftool/dbconfig/20221110-123453-marostegui.json [12:35:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P38952 and previous config saved to /var/cache/conftool/dbconfig/20221110-123527-ladsgroup.json [12:37:13] (03PS7) 10Clément Goubert: mw-web: DRY mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/855539 [12:38:08] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10LSobanski) [12:42:44] (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti1034 to Ganeti cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/855556 (owner: 10Muehlenhoff) [12:43:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P38953 and previous config saved to /var/cache/conftool/dbconfig/20221110-124352-marostegui.json [12:46:00] (03CR) 10Clément Goubert: [C: 03+2] mw-web: DRY mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/855539 (owner: 10Clément Goubert) [12:46:02] (03CR) 10JMeybohm: [C: 03+2] calico: Allow different versions, drop pre bullseye support [puppet] - 10https://gerrit.wikimedia.org/r/855012 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:47:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T318605)', diff saved to https://phabricator.wikimedia.org/P38954 and previous config saved to /var/cache/conftool/dbconfig/20221110-124743-ladsgroup.json [12:47:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [12:47:48] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:47:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [12:48:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T318605)', diff saved to https://phabricator.wikimedia.org/P38955 and previous config saved to /var/cache/conftool/dbconfig/20221110-124805-ladsgroup.json [12:49:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P38956 and previous config saved to /var/cache/conftool/dbconfig/20221110-124936-ladsgroup.json [12:50:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P38957 and previous config saved to /var/cache/conftool/dbconfig/20221110-124959-marostegui.json [12:50:31] (03PS2) 10Majavah: P:openstack: explicit rules for haproxy backend traffic POC [puppet] - 10https://gerrit.wikimedia.org/r/854875 [12:50:47] (03Merged) 10jenkins-bot: mw-web: DRY mw-web [deployment-charts] - 10https://gerrit.wikimedia.org/r/855539 (owner: 10Clément Goubert) [12:51:24] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:51:26] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:51:34] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [12:51:36] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [12:53:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [12:58:38] (03CR) 10Ladsgroup: "Let me know when I can merge this, maybe next week?" [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [12:58:44] (03PS7) 10Clément Goubert: mediawiki: Create new mw-api-int deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853933 (https://phabricator.wikimedia.org/T321895) [12:58:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321123)', diff saved to https://phabricator.wikimedia.org/P38958 and previous config saved to /var/cache/conftool/dbconfig/20221110-125858-marostegui.json [12:59:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1107.eqiad.wmnet with reason: Maintenance [12:59:03] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:59:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1107.eqiad.wmnet with reason: Maintenance [12:59:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T321123)', diff saved to https://phabricator.wikimedia.org/P38959 and previous config saved to /var/cache/conftool/dbconfig/20221110-125919-marostegui.json [13:00:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321123)', diff saved to https://phabricator.wikimedia.org/P38960 and previous config saved to /var/cache/conftool/dbconfig/20221110-130027-marostegui.json [13:00:47] (03CR) 10Kosta Harlan: [C: 03+1] Add GrowthExperiments periodic maintenance scripts for user impact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [13:01:33] (03CR) 10Ladsgroup: Add GrowthExperiments periodic maintenance scripts for user impact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [13:01:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [13:02:30] (03PS3) 10Daniel Kinzler: mediawiki.org: set VE to new direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855029 [13:04:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1034.eqiad.wmnet to cluster eqiad and group D [13:04:29] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1034.eqiad.wmnet to cluster eqiad and group D [13:04:39] (03PS7) 10Clément Goubert: mediawiki: Create new mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853952 (https://phabricator.wikimedia.org/T321896) [13:04:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P38961 and previous config saved to /var/cache/conftool/dbconfig/20221110-130443-ladsgroup.json [13:05:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T321130)', diff saved to https://phabricator.wikimedia.org/P38962 and previous config saved to /var/cache/conftool/dbconfig/20221110-130506-marostegui.json [13:05:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1185.eqiad.wmnet with reason: Maintenance [13:05:10] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:05:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1185.eqiad.wmnet with reason: Maintenance [13:05:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T321130)', diff saved to https://phabricator.wikimedia.org/P38963 and previous config saved to /var/cache/conftool/dbconfig/20221110-130527-marostegui.json [13:06:50] (03PS8) 10Clément Goubert: mediawiki: Create new mw-api-int deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853933 (https://phabricator.wikimedia.org/T321895) [13:07:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T321130)', diff saved to https://phabricator.wikimedia.org/P38964 and previous config saved to /var/cache/conftool/dbconfig/20221110-130753-marostegui.json [13:11:19] (03PS7) 10Clément Goubert: mediawiki: Create new mw-jobrunner deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853958 (https://phabricator.wikimedia.org/T321897) [13:15:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P38965 and previous config saved to /var/cache/conftool/dbconfig/20221110-131533-marostegui.json [13:15:42] (03PS1) 10Clément Goubert: mw-web: move tls public_port out of global [deployment-charts] - 10https://gerrit.wikimedia.org/r/855566 [13:18:16] (03PS2) 10Clément Goubert: mw-web: move tls public_port out of global [deployment-charts] - 10https://gerrit.wikimedia.org/r/855566 [13:19:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T322618)', diff saved to https://phabricator.wikimedia.org/P38966 and previous config saved to /var/cache/conftool/dbconfig/20221110-131949-ladsgroup.json [13:19:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [13:19:54] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:19:57] (03Abandoned) 10JMeybohm: kubernetes: Switch to using systemd cgroupdriver [puppet] - 10https://gerrit.wikimedia.org/r/524186 (https://phabricator.wikimedia.org/T277876) (owner: 10Alexandros Kosiaris) [13:20:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [13:20:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T322618)', diff saved to https://phabricator.wikimedia.org/P38967 and previous config saved to /var/cache/conftool/dbconfig/20221110-132010-ladsgroup.json [13:20:37] (03PS8) 10Clément Goubert: mediawiki: Create new mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853952 (https://phabricator.wikimedia.org/T321896) [13:23:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P38968 and previous config saved to /var/cache/conftool/dbconfig/20221110-132300-marostegui.json [13:23:04] (03CR) 10Clément Goubert: [C: 03+2] mw-web: move tls public_port out of global [deployment-charts] - 10https://gerrit.wikimedia.org/r/855566 (owner: 10Clément Goubert) [13:26:08] (03CR) 10Effie Mouzeli: "Lovely to see this moving forward!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [13:26:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T322618)', diff saved to https://phabricator.wikimedia.org/P38969 and previous config saved to /var/cache/conftool/dbconfig/20221110-132622-ladsgroup.json [13:26:26] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:27:50] (03PS1) 10JMeybohm: k8s: Use systemd cgroup (v2) driver with 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/855567 (https://phabricator.wikimedia.org/T313473) [13:27:53] (03Merged) 10jenkins-bot: mw-web: move tls public_port out of global [deployment-charts] - 10https://gerrit.wikimedia.org/r/855566 (owner: 10Clément Goubert) [13:28:41] (03PS9) 10Vgutierrez: No-op change. Replace the idea of stickycounters with actions [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [13:29:11] (03PS2) 10JMeybohm: k8s: Use systemd cgroup (v2) driver with 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/855567 (https://phabricator.wikimedia.org/T313473) [13:29:15] RECOVERY - BGP status on cr1-drmrs is OK: BGP OK - up: 61, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:29:44] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Create new mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853952 (https://phabricator.wikimedia.org/T321896) (owner: 10Clément Goubert) [13:30:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P38970 and previous config saved to /var/cache/conftool/dbconfig/20221110-133040-marostegui.json [13:33:06] (03CR) 10Vgutierrez: [C: 03+1] No-op change. Replace the idea of stickycounters with actions [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [13:33:41] (03CR) 10Vgutierrez: [C: 03+1] haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [13:33:53] (03PS3) 10Phuedx: EditAttemptStep sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854570 (https://phabricator.wikimedia.org/T312016) [13:34:20] (03Merged) 10jenkins-bot: mediawiki: Create new mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853952 (https://phabricator.wikimedia.org/T321896) (owner: 10Clément Goubert) [13:34:55] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38078/console" [puppet] - 10https://gerrit.wikimedia.org/r/855567 (https://phabricator.wikimedia.org/T313473) (owner: 10JMeybohm) [13:35:44] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [13:36:07] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [13:36:40] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [13:37:02] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [13:37:25] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:38:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P38971 and previous config saved to /var/cache/conftool/dbconfig/20221110-133806-marostegui.json [13:38:58] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! We'll sync up on the timing of the change as I need to disable the second link before adding the Vlans to the first (avoid the brid" [puppet] - 10https://gerrit.wikimedia.org/r/855043 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:39:37] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [13:40:19] (03CR) 10Cathal Mooney: [C: 03+1] cloudvirt2002-dev: move to a single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/855042 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:40:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1034.eqiad.wmnet to cluster eqiad and group D [13:41:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P38972 and previous config saved to /var/cache/conftool/dbconfig/20221110-134128-ladsgroup.json [13:41:33] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Create new mw-api-int deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853933 (https://phabricator.wikimedia.org/T321895) (owner: 10Clément Goubert) [13:45:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321123)', diff saved to https://phabricator.wikimedia.org/P38973 and previous config saved to /var/cache/conftool/dbconfig/20221110-134546-marostegui.json [13:45:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1118.eqiad.wmnet with reason: Maintenance [13:45:49] (03Merged) 10jenkins-bot: mediawiki: Create new mw-api-int deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853933 (https://phabricator.wikimedia.org/T321895) (owner: 10Clément Goubert) [13:45:51] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [13:46:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1118.eqiad.wmnet with reason: Maintenance [13:46:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T321123)', diff saved to https://phabricator.wikimedia.org/P38974 and previous config saved to /var/cache/conftool/dbconfig/20221110-134608-marostegui.json [13:46:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10MoritzMuehlenhoff) >>! In T314303#8384940, @RobH wrote: > @MoritzMuehlenhoff i recall you stating the puppet run fails in the isntaller but then just re-run after a... [13:46:21] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [13:46:46] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [13:47:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T321123)', diff saved to https://phabricator.wikimedia.org/P38975 and previous config saved to /var/cache/conftool/dbconfig/20221110-134715-marostegui.json [13:48:02] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:48:24] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Create new mw-jobrunner deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853958 (https://phabricator.wikimedia.org/T321897) (owner: 10Clément Goubert) [13:50:16] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:52:12] (03CR) 10Elukey: [C: 03+1] k8s: Use systemd cgroup (v2) driver with 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/855567 (https://phabricator.wikimedia.org/T313473) (owner: 10JMeybohm) [13:52:28] !log installing expat securiy updates [13:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:50] (03Merged) 10jenkins-bot: mediawiki: Create new mw-jobrunner deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853958 (https://phabricator.wikimedia.org/T321897) (owner: 10Clément Goubert) [13:53:12] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [13:53:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T321130)', diff saved to https://phabricator.wikimedia.org/P38976 and previous config saved to /var/cache/conftool/dbconfig/20221110-135313-marostegui.json [13:53:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1200.eqiad.wmnet with reason: Maintenance [13:53:18] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:53:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1200.eqiad.wmnet with reason: Maintenance [13:53:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T321130)', diff saved to https://phabricator.wikimedia.org/P38977 and previous config saved to /var/cache/conftool/dbconfig/20221110-135334-marostegui.json [13:53:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1034.eqiad.wmnet to cluster eqiad and group D [13:56:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321130)', diff saved to https://phabricator.wikimedia.org/P38978 and previous config saved to /var/cache/conftool/dbconfig/20221110-135600-marostegui.json [13:56:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P38979 and previous config saved to /var/cache/conftool/dbconfig/20221110-135635-ladsgroup.json [13:57:22] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38079/console" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [13:57:43] (03PS1) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) [13:58:18] (03CR) 10CI reject: [V: 04-1] cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [13:58:25] (03PS2) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) [13:58:33] yeah yeah.. I forgot to add one file to the commit.. thanks CI [13:59:05] (03CR) 10CI reject: [V: 04-1] cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [13:59:29] (03PS3) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) [14:00:03] (03CR) 10CI reject: [V: 04-1] cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) (owner: 10Vgutierrez) [14:00:04] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T1400) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T1400). [14:00:04] duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:26] o/ [14:00:55] !log drain ganeti1020 for eventual reimage to bullseye T311687 [14:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:59] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [14:01:35] (03PS4) 10Vgutierrez: cache::haproxy: Support http --> https redirection [puppet] - 10https://gerrit.wikimedia.org/r/855570 (https://phabricator.wikimedia.org/T322774) [14:02:17] o/ [14:02:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P38980 and previous config saved to /var/cache/conftool/dbconfig/20221110-140222-marostegui.json [14:02:26] duesen: want to self-service? [14:02:34] can do [14:03:20] (03CR) 10Hokwelum: "Thank you Daniel for helping with this! Ariel and I looked at this and there are a few things we are concerned about. Firstly, Stdlib::Hos" [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [14:03:58] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Use systemd cgroup (v2) driver with 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/855567 (https://phabricator.wikimedia.org/T313473) (owner: 10JMeybohm) [14:04:33] !log rolling restart of FPM and Apache on mw canaries to pick up expat security update [14:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855029 (owner: 10Daniel Kinzler) [14:05:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for ryasmeen (superset access with no server access) - https://phabricator.wikimedia.org/T322795 (10Ottomata) Approved. [14:06:01] (03Merged) 10jenkins-bot: mediawiki.org: set VE to new direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855029 (owner: 10Daniel Kinzler) [14:06:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10Ottomata) Approved from DE. [14:06:15] !log daniel@deploy1002 Started scap: Backport for [[gerrit:855029|mediawiki.org: set VE to new direct mode]] [14:06:17] (03PS1) 10Kosta Harlan: refreshUserImpactData: Add option to use job queue [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855525 (https://phabricator.wikimedia.org/T322706) [14:06:35] !log daniel@deploy1002 daniel and daniel: Backport for [[gerrit:855029|mediawiki.org: set VE to new direct mode]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [14:06:39] (HelmReleaseBadStatus) firing: Helm release mw-jobrunner/main on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:06:41] hi [14:07:20] (03CR) 10ArielGlenn: "I see that a bunch of data types are being added generally, just curious,is this part of a migration plan to a later puppet version, or an" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [14:07:32] I have a late addition I just added to the window [14:07:53] as I just rejoined this channel, I am not sure if someone is deploying right now? duesen perhaps? [14:08:04] Lucas_WMDE: I'm now verifying on the debug host [14:08:17] ok [14:08:21] kostajh: yes, I'm on it. should be done in five monutes or so [14:08:21] kostajh: duesen is deploying, yeah [14:10:00] (03CR) 10Effie Mouzeli: "(Sorry for the "This change is ready for review" message, it was added by gerrit)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [14:10:29] ok, looking good. syncing [14:11:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P38981 and previous config saved to /var/cache/conftool/dbconfig/20221110-141106-marostegui.json [14:11:41] (03PS1) 10Giuseppe Lavagetto: mw*: tune down resource usage for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/855573 [14:11:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T322618)', diff saved to https://phabricator.wikimedia.org/P38982 and previous config saved to /var/cache/conftool/dbconfig/20221110-141141-ladsgroup.json [14:11:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:11:45] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:11:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:11:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:12:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:12:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P38983 and previous config saved to /var/cache/conftool/dbconfig/20221110-141220-ladsgroup.json [14:13:09] Amir1: I'm enabling parsoid cache warming on mediawiki.org now. The number of parsoid keys on the parser cache should start to slowly go up as people edit pages. If it doesn't go up, something is wrong. [14:13:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:14:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P38984 and previous config saved to /var/cache/conftool/dbconfig/20221110-141431-ladsgroup.json [14:14:33] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:855029|mediawiki.org: set VE to new direct mode]] (duration: 08m 17s) [14:16:13] noted [14:17:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P38985 and previous config saved to /var/cache/conftool/dbconfig/20221110-141728-marostegui.json [14:18:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mw*: tune down resource usage for now [deployment-charts] - 10https://gerrit.wikimedia.org/r/855573 (owner: 10Giuseppe Lavagetto) [14:19:06] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add ryasmeen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/855492 (https://phabricator.wikimedia.org/T322795) (owner: 10Filippo Giunchedi) [14:19:44] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:20:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Ryasmeen: Requesting access to analytics-privatedata-users for ryasmeen (superset access with no server access) - https://phabricator.wikimedia.org/T322795 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thank you @Ottomata ! @Ryasmeen this is... [14:20:33] (03PS1) 10Kosta Harlan: GrowthExperiments: Set feature-flag for RefreshUserImpactDataMaintenanceScriptEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855576 (https://phabricator.wikimedia.org/T313395) [14:21:00] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:21:14] (03CR) 10CI reject: [V: 04-1] GrowthExperiments: Set feature-flag for RefreshUserImpactDataMaintenanceScriptEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855576 (https://phabricator.wikimedia.org/T313395) (owner: 10Kosta Harlan) [14:21:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:21:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:21:43] (03PS2) 10Kosta Harlan: GrowthExperiments: Set feature-flag for RefreshUserImpactDataMaintenanceScriptEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855576 (https://phabricator.wikimedia.org/T313395) [14:22:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:22:26] duesen: are you finished, could I go ahead with syncing my patch? [14:22:33] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [14:24:11] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [14:24:23] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [14:24:44] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855576 (https://phabricator.wikimedia.org/T313395) (owner: 10Kosta Harlan) [14:25:00] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [14:26:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P38986 and previous config saved to /var/cache/conftool/dbconfig/20221110-142614-marostegui.json [14:26:27] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [14:27:18] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [14:27:36] (03PS3) 10JHathaway: aux-k8s: add BGP config for calico [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) [14:27:51] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [14:28:20] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [14:28:23] kostajh: yea, I'm done. Sorry, got lost in my inbox while scap was running... [14:28:28] (03CR) 10JHathaway: aux-k8s: add BGP config for calico (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [14:28:33] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:28:34] no worries. starting with my patches now, then [14:29:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855576 (https://phabricator.wikimedia.org/T313395) (owner: 10Kosta Harlan) [14:29:30] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:29:36] (03PS1) 10Kosta Harlan: refreshUserImpactData: Add feature flag [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855587 (https://phabricator.wikimedia.org/T313395) [14:29:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P38987 and previous config saved to /var/cache/conftool/dbconfig/20221110-142938-ladsgroup.json [14:30:19] ack [14:30:43] (03Merged) 10jenkins-bot: GrowthExperiments: Set feature-flag for RefreshUserImpactDataMaintenanceScriptEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855576 (https://phabricator.wikimedia.org/T313395) (owner: 10Kosta Harlan) [14:30:55] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:855576|GrowthExperiments: Set feature-flag for RefreshUserImpactDataMaintenanceScriptEnabled (T313395)]] [14:30:59] T313395: User impact API: Create maintenance script for refreshing data - https://phabricator.wikimedia.org/T313395 [14:31:15] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:855576|GrowthExperiments: Set feature-flag for RefreshUserImpactDataMaintenanceScriptEnabled (T313395)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:31:39] (HelmReleaseBadStatus) resolved: Helm release mw-jobrunner/main on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-jobrunner - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:32:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T321123)', diff saved to https://phabricator.wikimedia.org/P38988 and previous config saved to /var/cache/conftool/dbconfig/20221110-143235-marostegui.json [14:32:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [14:32:39] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [14:32:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [14:32:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T321123)', diff saved to https://phabricator.wikimedia.org/P38989 and previous config saved to /var/cache/conftool/dbconfig/20221110-143256-marostegui.json [14:33:37] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [14:34:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321123)', diff saved to https://phabricator.wikimedia.org/P38990 and previous config saved to /var/cache/conftool/dbconfig/20221110-143404-marostegui.json [14:34:38] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [14:35:53] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:855576|GrowthExperiments: Set feature-flag for RefreshUserImpactDataMaintenanceScriptEnabled (T313395)]] (duration: 04m 57s) [14:36:03] ok, two more to go [14:36:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855525 (https://phabricator.wikimedia.org/T322706) (owner: 10Kosta Harlan) [14:37:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:38:05] Lucas_WMDE urbanecm can I backport two patches together with scap backport? [14:38:13] no idea [14:38:17] kostajh: yes, `scap backport 123 456` will do that [14:38:24] !log kharlan@deploy1002 backport aborted: (duration: 02m 16s) [14:38:26] (where 123 and 456 are the numeric IDs of your backports) [14:38:30] ok, I'll try it [14:38:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:38:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:38:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855525 (https://phabricator.wikimedia.org/T322706) (owner: 10Kosta Harlan) [14:38:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855587 (https://phabricator.wikimedia.org/T313395) (owner: 10Kosta Harlan) [14:39:14] seems like it worked [14:39:18] rad [14:39:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:41:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T321130)', diff saved to https://phabricator.wikimedia.org/P38991 and previous config saved to /var/cache/conftool/dbconfig/20221110-144121-marostegui.json [14:41:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [14:41:26] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:41:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [14:43:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2101.codfw.wmnet with reason: Maintenance [14:43:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2101.codfw.wmnet with reason: Maintenance [14:44:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P38992 and previous config saved to /var/cache/conftool/dbconfig/20221110-144447-ladsgroup.json [14:45:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2111.codfw.wmnet with reason: Maintenance [14:45:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2111.codfw.wmnet with reason: Maintenance [14:46:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T321130)', diff saved to https://phabricator.wikimedia.org/P38993 and previous config saved to /var/cache/conftool/dbconfig/20221110-144602-marostegui.json [14:46:48] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet [14:48:22] (03CR) 10Volans: [C: 03+1] "LGTM, minor nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [14:49:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P38994 and previous config saved to /var/cache/conftool/dbconfig/20221110-144911-marostegui.json [14:49:49] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp6016.drmrs.wmnet [14:50:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321130)', diff saved to https://phabricator.wikimedia.org/P38995 and previous config saved to /var/cache/conftool/dbconfig/20221110-145018-marostegui.json [14:50:22] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:51:01] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-codfw [14:53:02] (03CR) 10Alexandros Kosiaris: utils: Add a role_team_stats.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [14:53:13] (03PS7) 10Alexandros Kosiaris: utils: Add a role_team_stats.py script [puppet] - 10https://gerrit.wikimedia.org/r/854992 [14:55:07] (03PS1) 10Ssingh: sites.yaml: add lvs4008, replacing lvs4005 (ulsfo hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/855583 (https://phabricator.wikimedia.org/T317247) [14:55:19] (03Merged) 10jenkins-bot: refreshUserImpactData: Add option to use job queue [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855525 (https://phabricator.wikimedia.org/T322706) (owner: 10Kosta Harlan) [14:55:30] still waiting on the tests to finish [14:57:00] oof [14:57:02] jouncebot: next [14:57:03] In 2 hour(s) and 2 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T1700) [14:57:17] it’s probably okay to overrun a bit [14:57:21] nearly there [14:57:48] * Lucas_WMDE notices that gate-and-submit-wmf doesn’t include php81 yet [14:57:48] (03CR) 10Vlad.shapik: [C: 04-1] Decode poolcounter messages, fix 429 error (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/855033 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [14:57:59] (I guess that makes sense, since the latest PHP 8.1 fixes aren’t part of the current train yet) [14:58:04] (03Merged) 10jenkins-bot: refreshUserImpactData: Add feature flag [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855587 (https://phabricator.wikimedia.org/T313395) (owner: 10Kosta Harlan) [14:58:10] yay [14:58:23] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:855525|refreshUserImpactData: Add option to use job queue (T322706)]], [[gerrit:855587|refreshUserImpactData: Add feature flag (T313395)]] [14:58:24] stuff is happening [14:58:28] T322706: User impact API: Maintenance scripts should defer work to the job queue - https://phabricator.wikimedia.org/T322706 [14:58:28] :D [14:58:28] T313395: User impact API: Create maintenance script for refreshing data - https://phabricator.wikimedia.org/T313395 [14:58:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks for the review and the +1. Merging!" [puppet] - 10https://gerrit.wikimedia.org/r/854992 (owner: 10Alexandros Kosiaris) [14:58:42] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:855525|refreshUserImpactData: Add option to use job queue (T322706)]], [[gerrit:855587|refreshUserImpactData: Add feature flag (T313395)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [14:59:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T322618)', diff saved to https://phabricator.wikimedia.org/P38996 and previous config saved to /var/cache/conftool/dbconfig/20221110-145953-ladsgroup.json [14:59:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:59:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:59:58] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:00:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:00:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P38997 and previous config saved to /var/cache/conftool/dbconfig/20221110-150015-ladsgroup.json [15:01:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:01:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:02:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:02:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P38998 and previous config saved to /var/cache/conftool/dbconfig/20221110-150226-ladsgroup.json [15:03:10] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:855525|refreshUserImpactData: Add option to use job queue (T322706)]], [[gerrit:855587|refreshUserImpactData: Add feature flag (T313395)]] (duration: 04m 47s) [15:03:25] done [15:04:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P38999 and previous config saved to /var/cache/conftool/dbconfig/20221110-150417-marostegui.json [15:05:02] \o/ [15:05:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P39000 and previous config saved to /var/cache/conftool/dbconfig/20221110-150524-marostegui.json [15:07:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:07:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:07:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:08:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:10:20] (03PS2) 10Ssingh: sites.yaml: add lvs4008 (ulsfo hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/855583 (https://phabricator.wikimedia.org/T317247) [15:13:13] (03PS1) 10Ssingh: lvs4008: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/855607 (https://phabricator.wikimedia.org/T317247) [15:13:58] (03PS1) 10Filippo Giunchedi: sre: fix dashboard links for k8s latency [alerts] - 10https://gerrit.wikimedia.org/r/855608 [15:15:13] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/855583 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [15:15:23] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/855607 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [15:16:09] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add lvs4008 (ulsfo hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/855583 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [15:17:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [15:17:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P39001 and previous config saved to /var/cache/conftool/dbconfig/20221110-151733-ladsgroup.json [15:18:22] (03PS2) 10Ottomata: Create platform-eng-deployers group for deploying airflow platform_eng [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) [15:18:54] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [15:19:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10Volans) >>! In T314303#8384910, @ops-monitoring-bot wrote: > Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1034.eqiad.wmnet with OS... [15:19:24] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/855546 (https://phabricator.wikimedia.org/T322706) (owner: 10Kosta Harlan) [15:19:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321123)', diff saved to https://phabricator.wikimedia.org/P39002 and previous config saved to /var/cache/conftool/dbconfig/20221110-151924-marostegui.json [15:19:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1128.eqiad.wmnet with reason: Maintenance [15:19:29] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [15:19:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1128.eqiad.wmnet with reason: Maintenance [15:19:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T321123)', diff saved to https://phabricator.wikimedia.org/P39003 and previous config saved to /var/cache/conftool/dbconfig/20221110-151945-marostegui.json [15:19:53] (03PS8) 10Urbanecm: Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [15:19:59] (03PS3) 10Urbanecm: GrowthExperiments: Use job queue for refreshUserImpact script [puppet] - 10https://gerrit.wikimedia.org/r/855546 (https://phabricator.wikimedia.org/T322706) (owner: 10Kosta Harlan) [15:20:04] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [15:20:06] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/855546 (https://phabricator.wikimedia.org/T322706) (owner: 10Kosta Harlan) [15:20:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P39004 and previous config saved to /var/cache/conftool/dbconfig/20221110-152031-marostegui.json [15:20:40] (03PS1) 10Volans: netbox: update allowed state transitions [software/spicerack] - 10https://gerrit.wikimedia.org/r/855610 (https://phabricator.wikimedia.org/T320696) [15:21:22] (03PS2) 10Ssingh: lvs4008: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/855607 (https://phabricator.wikimedia.org/T317247) [15:22:22] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38080/console" [puppet] - 10https://gerrit.wikimedia.org/r/855607 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [15:22:27] (03Merged) 10jenkins-bot: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [15:22:30] (03CR) 10Ayounsi: [C: 03+1] netbox: update allowed state transitions [software/spicerack] - 10https://gerrit.wikimedia.org/r/855610 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [15:27:47] (03PS3) 10Ssingh: lvs4008: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/855607 (https://phabricator.wikimedia.org/T317247) [15:28:26] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/855034 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [15:28:37] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw2002-dev: move to a single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/855034 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [15:28:51] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38081/console" [puppet] - 10https://gerrit.wikimedia.org/r/855607 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [15:28:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321123)', diff saved to https://phabricator.wikimedia.org/P39005 and previous config saved to /var/cache/conftool/dbconfig/20221110-152854-marostegui.json [15:28:59] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [15:30:30] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] lvs4008: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/855607 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [15:31:03] (03CR) 10CI reject: [V: 04-1] netbox: update allowed state transitions [software/spicerack] - 10https://gerrit.wikimedia.org/r/855610 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [15:31:12] (03CR) 10Ssingh: [C: 03+2] lvs4008: commission new LVS host (ulsfo hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/855607 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [15:32:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P39006 and previous config saved to /var/cache/conftool/dbconfig/20221110-153240-ladsgroup.json [15:33:06] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2002-dev.codfw.wmnet with OS bullseye [15:33:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudgw2002-dev.codfw.wmnet with... [15:35:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T321130)', diff saved to https://phabricator.wikimedia.org/P39007 and previous config saved to /var/cache/conftool/dbconfig/20221110-153537-marostegui.json [15:35:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:35:42] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:35:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:35:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T321130)', diff saved to https://phabricator.wikimedia.org/P39008 and previous config saved to /var/cache/conftool/dbconfig/20221110-153559-marostegui.json [15:38:21] (03CR) 10Michael Große: "Tested it locally and it works as expected" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855609 (https://phabricator.wikimedia.org/T318310) (owner: 10Michael Große) [15:38:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321130)', diff saved to https://phabricator.wikimedia.org/P39009 and previous config saved to /var/cache/conftool/dbconfig/20221110-153853-marostegui.json [15:40:05] (03CR) 10Krinkle: [C: 03+1] "Ack. Thx for heads up. We will continue to get arclamp-related cron failures by email, right?" [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris) [15:40:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Separate identifiers from other statements for Lexemes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855609 (https://phabricator.wikimedia.org/T318310) (owner: 10Michael Große) [15:40:46] (03CR) 10JHathaway: [C: 03+2] aux-k8s: add BGP config for calico [homer/public] - 10https://gerrit.wikimedia.org/r/854110 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [15:41:36] (03CR) 10Filippo Giunchedi: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853280 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:42:04] that's a lie ^ [15:42:22] lol [15:44:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P39010 and previous config saved to /var/cache/conftool/dbconfig/20221110-154400-marostegui.json [15:44:03] (03CR) 10Michael Große: "Thanks, I'll schedule it for deployment on Monday. I don't think this is important enough to still deploy it tonight." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855609 (https://phabricator.wikimedia.org/T318310) (owner: 10Michael Große) [15:44:17] out of habit I commented and then clicked the blue button to start the review instead of "send as wip" :> [15:45:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Separate identifiers from other statements for Lexemes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855609 (https://phabricator.wikimedia.org/T318310) (owner: 10Michael Große) [15:47:22] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage [15:47:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T322618)', diff saved to https://phabricator.wikimedia.org/P39011 and previous config saved to /var/cache/conftool/dbconfig/20221110-154746-ladsgroup.json [15:47:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:47:50] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:48:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:48:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:48:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:48:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P39012 and previous config saved to /var/cache/conftool/dbconfig/20221110-154827-ladsgroup.json [15:49:18] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:50:07] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage [15:52:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P39013 and previous config saved to /var/cache/conftool/dbconfig/20221110-155238-ladsgroup.json [15:53:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [15:54:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P39014 and previous config saved to /var/cache/conftool/dbconfig/20221110-155400-marostegui.json [15:56:09] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart rolling restart_daemons on A:maps-replica-eqiad [15:56:29] (03Abandoned) 10Clément Goubert: mw-debug: add redis_lock egress rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/855495 (owner: 10Clément Goubert) [15:56:30] (03PS2) 10Volans: netbox: update allowed state transitions [software/spicerack] - 10https://gerrit.wikimedia.org/r/855610 (https://phabricator.wikimedia.org/T320696) [15:56:32] (03PS1) 10Volans: mypy: add quick ignore to unblock release [software/spicerack] - 10https://gerrit.wikimedia.org/r/855618 [15:58:32] 10ops-ulsfo, 10DC-Ops: ulsfo next visit checklist - https://phabricator.wikimedia.org/T322861 (10RobH) p:05Triage→03Medium [15:59:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P39015 and previous config saved to /var/cache/conftool/dbconfig/20221110-155907-marostegui.json [15:59:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [16:01:02] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) [16:01:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [16:01:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [16:02:11] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 (10Andrew) 05Resolved→03Open a:05Jclark-ctr→03Andrew Apparently cumin still thinks that labstore1007 exists. [16:03:19] (03PS1) 10Ssingh: hiera: set profile::lvs::interface_tweaks for lvs4008 [puppet] - 10https://gerrit.wikimedia.org/r/855620 [16:03:50] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2002-dev.codfw.wmnet with OS bullseye [16:03:57] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38083/console" [puppet] - 10https://gerrit.wikimedia.org/r/855620 (owner: 10Ssingh) [16:03:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudgw2002-dev.codfw.wmnet with OS... [16:06:15] (03CR) 10Vgutierrez: [C: 03+1] hiera: set profile::lvs::interface_tweaks for lvs4008 [puppet] - 10https://gerrit.wikimedia.org/r/855620 (owner: 10Ssingh) [16:06:46] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: set profile::lvs::interface_tweaks for lvs4008 [puppet] - 10https://gerrit.wikimedia.org/r/855620 (owner: 10Ssingh) [16:06:48] hashar: is there anything going on on jenkins? I see on zuul a lot of jobs but vey few in progress [16:07:19] mines are queued since 10 minutes [16:07:24] yeah there is a large spike :-\ [16:07:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P39016 and previous config saved to /var/cache/conftool/dbconfig/20221110-160745-ladsgroup.json [16:07:56] (03CR) 10Muehlenhoff: [C: 03+1] arclamp: Add role contact information (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854985 (owner: 10Alexandros Kosiaris) [16:08:09] I see the spike in teh queue, but don't see many in progress, so I wonder if there is a problem with the workers [16:08:54] that is due to the `zuul-merger` process which craft the merge commits of the patches against the tip of the branch [16:09:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P39017 and previous config saved to /var/cache/conftool/dbconfig/20221110-160906-marostegui.json [16:09:17] it will recover [16:14:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321123)', diff saved to https://phabricator.wikimedia.org/P39018 and previous config saved to /var/cache/conftool/dbconfig/20221110-161413-marostegui.json [16:14:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1132.eqiad.wmnet with reason: Maintenance [16:14:18] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [16:14:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1132.eqiad.wmnet with reason: Maintenance [16:14:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T321123)', diff saved to https://phabricator.wikimedia.org/P39019 and previous config saved to /var/cache/conftool/dbconfig/20221110-161435-marostegui.json [16:15:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T321123)', diff saved to https://phabricator.wikimedia.org/P39020 and previous config saved to /var/cache/conftool/dbconfig/20221110-161543-marostegui.json [16:16:27] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:16:52] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS buster [16:20:13] (03CR) 10Volans: [V: 03+2 C: 03+2] "just a type ignore, self merging to unblock a release" [software/spicerack] - 10https://gerrit.wikimedia.org/r/855618 (owner: 10Volans) [16:20:25] (03CR) 10Volans: [C: 03+2] netbox: update allowed state transitions [software/spicerack] - 10https://gerrit.wikimedia.org/r/855610 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [16:22:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P39021 and previous config saved to /var/cache/conftool/dbconfig/20221110-162251-ladsgroup.json [16:24:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T321130)', diff saved to https://phabricator.wikimedia.org/P39022 and previous config saved to /var/cache/conftool/dbconfig/20221110-162416-marostegui.json [16:24:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2128.codfw.wmnet with reason: Maintenance [16:24:23] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:24:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2128.codfw.wmnet with reason: Maintenance [16:24:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance [16:24:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2094.codfw.wmnet with reason: Maintenance [16:24:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T321130)', diff saved to https://phabricator.wikimedia.org/P39023 and previous config saved to /var/cache/conftool/dbconfig/20221110-162453-marostegui.json [16:27:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321130)', diff saved to https://phabricator.wikimedia.org/P39024 and previous config saved to /var/cache/conftool/dbconfig/20221110-162749-marostegui.json [16:30:01] (03PS2) 10BBlack: Clean up trafficserver::tls and related [puppet] - 10https://gerrit.wikimedia.org/r/849178 [16:30:03] (03PS2) 10BBlack: Remove cache::(text|upload)_envoy remnants [puppet] - 10https://gerrit.wikimedia.org/r/849179 [16:30:17] (03CR) 10BBlack: Clean up trafficserver::tls and related (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849178 (owner: 10BBlack) [16:30:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P39025 and previous config saved to /var/cache/conftool/dbconfig/20221110-163049-marostegui.json [16:31:11] (03CR) 10JMeybohm: [C: 03+1] sre: fix dashboard links for k8s latency [alerts] - 10https://gerrit.wikimedia.org/r/855608 (owner: 10Filippo Giunchedi) [16:31:32] (03Merged) 10jenkins-bot: netbox: update allowed state transitions [software/spicerack] - 10https://gerrit.wikimedia.org/r/855610 (https://phabricator.wikimedia.org/T320696) (owner: 10Volans) [16:32:29] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1033'] [16:33:15] !log robh@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti1033'] [16:33:35] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1033'] [16:33:38] (03PS3) 10Volans: json-webrequests-stats: add -t/--time-range [puppet] - 10https://gerrit.wikimedia.org/r/854521 [16:33:40] (03PS1) 10Volans: admin: update permissions for datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/855647 [16:34:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [16:35:33] (03CR) 10BBlack: "PCC says NOOP for this and the parent ats_tls patch together, on text@eqiad + upload@ulsfo:" [puppet] - 10https://gerrit.wikimedia.org/r/849179 (owner: 10BBlack) [16:35:36] (03CR) 10Herron: [C: 03+1] netmon: Put the netmon2002 as passive server [puppet] - 10https://gerrit.wikimedia.org/r/854625 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [16:37:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [16:37:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T322618)', diff saved to https://phabricator.wikimedia.org/P39026 and previous config saved to /var/cache/conftool/dbconfig/20221110-163758-ladsgroup.json [16:38:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [16:38:03] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [16:38:05] (03CR) 10BBlack: [C: 03+2] Clean up trafficserver::tls and related [puppet] - 10https://gerrit.wikimedia.org/r/849178 (owner: 10BBlack) [16:38:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1191.eqiad.wmnet with reason: Maintenance [16:38:14] (03CR) 10BBlack: [C: 03+2] Remove cache::(text|upload)_envoy remnants [puppet] - 10https://gerrit.wikimedia.org/r/849179 (owner: 10BBlack) [16:38:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P39027 and previous config saved to /var/cache/conftool/dbconfig/20221110-163819-ladsgroup.json [16:40:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P39028 and previous config saved to /var/cache/conftool/dbconfig/20221110-164030-ladsgroup.json [16:41:43] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:42:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P39029 and previous config saved to /var/cache/conftool/dbconfig/20221110-164255-marostegui.json [16:44:43] (03PS1) 10Arturo Borrero Gonzalez: wmcs: cleanup SAL log messages [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855650 [16:44:50] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs4008.ulsfo.wmnet with OS buster [16:45:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P39030 and previous config saved to /var/cache/conftool/dbconfig/20221110-164556-marostegui.json [16:52:57] (03PS1) 10Ssingh: hiera: update lvs4008 interface names for buster [puppet] - 10https://gerrit.wikimedia.org/r/855652 [16:53:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs4008.ulsfo.wmnet with OS buster [16:54:35] (03CR) 10Vgutierrez: [C: 04-1] "don't forget interfaces.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/855652 (owner: 10Ssingh) [16:54:52] (03CR) 10Ssingh: [C: 03+2] hiera: update lvs4008 interface names for buster [puppet] - 10https://gerrit.wikimedia.org/r/855652 (owner: 10Ssingh) [16:54:57] sukhe: !! [16:55:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P39031 and previous config saved to /var/cache/conftool/dbconfig/20221110-165536-ladsgroup.json [16:56:06] (03PS2) 10Ssingh: hiera: update lvs4008 interface names for buster [puppet] - 10https://gerrit.wikimedia.org/r/855652 [16:57:33] (03CR) 10Vgutierrez: [C: 03+1] hiera: update lvs4008 interface names for buster [puppet] - 10https://gerrit.wikimedia.org/r/855652 (owner: 10Ssingh) [16:58:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P39032 and previous config saved to /var/cache/conftool/dbconfig/20221110-165802-marostegui.json [16:58:20] vgutierrez: :P [16:58:24] <3 [16:59:47] (03CR) 10Ssingh: [C: 03+2] hiera: update lvs4008 interface names for buster [puppet] - 10https://gerrit.wikimedia.org/r/855652 (owner: 10Ssingh) [17:00:04] jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T1700) [17:00:04] kostajh: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:18] kostajh: 👋 looking [17:01:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T321123)', diff saved to https://phabricator.wikimedia.org/P39033 and previous config saved to /var/cache/conftool/dbconfig/20221110-170102-marostegui.json [17:01:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1133.eqiad.wmnet with reason: Maintenance [17:01:07] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [17:01:10] kostajh: patch looks pretty straightforward :P will you want me to kick off a manual run for testing, or just merge it and wait for the next daily run? [17:01:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1133.eqiad.wmnet with reason: Maintenance [17:01:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance [17:01:22] rzl: hi! a manual run would be nice [17:01:28] can do [17:01:29] rzl: do you want me to squash the patch into the parent? [17:01:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance [17:01:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T321123)', diff saved to https://phabricator.wikimedia.org/P39034 and previous config saved to /var/cache/conftool/dbconfig/20221110-170139-marostegui.json [17:01:47] (03PS4) 10Kosta Harlan: GrowthExperiments: Use job queue for refreshUserImpact script [puppet] - 10https://gerrit.wikimedia.org/r/855546 (https://phabricator.wikimedia.org/T322706) [17:02:07] so, squash https://gerrit.wikimedia.org/r/c/operations/puppet/+/855546 -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/854142 ? [17:02:07] oh! I didn't even notice the parent, sorry -- happy to merge em both as-is, that's fine [17:02:14] ok cool [17:02:19] I'll leave it for you then [17:02:43] let me do a quick PCC run on 854142 and then I'll get em rolling [17:02:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321123)', diff saved to https://phabricator.wikimedia.org/P39035 and previous config saved to /var/cache/conftool/dbconfig/20221110-170247-marostegui.json [17:02:58] rzl: it should in theory only run on beta cluster wikis, per https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/855575 [17:03:20] nod [17:04:35] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38085/console" [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [17:06:23] pcc looks good, going ahead [17:08:48] (03CR) 10RLazarus: [V: 03+1 C: 03+2] Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [17:08:52] cool [17:09:18] want to do one test run after merging both, or test them one at a time? [17:09:29] rzl: both together, please [17:09:31] I'm guessing one run after merging both but want to check [17:09:33] cool [17:09:39] yeah one run, after merging both [17:09:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [17:09:53] !log robh@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['ganeti1033'] [17:10:01] (03PS5) 10RLazarus: GrowthExperiments: Use job queue for refreshUserImpact script [puppet] - 10https://gerrit.wikimedia.org/r/855546 (https://phabricator.wikimedia.org/T322706) (owner: 10Kosta Harlan) [17:10:06] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti1033'] [17:10:35] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata) [17:10:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P39036 and previous config saved to /var/cache/conftool/dbconfig/20221110-171043-ladsgroup.json [17:13:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T321130)', diff saved to https://phabricator.wikimedia.org/P39037 and previous config saved to /var/cache/conftool/dbconfig/20221110-171308-marostegui.json [17:13:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2137.codfw.wmnet with reason: Maintenance [17:13:13] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:13:21] (just waiting for jenkins to finish again after rebasing) [17:13:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2137.codfw.wmnet with reason: Maintenance [17:13:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P39038 and previous config saved to /var/cache/conftool/dbconfig/20221110-171329-marostegui.json [17:13:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs4008.ulsfo.wmnet with reason: host reimage [17:13:38] there we go [17:13:44] (03CR) 10RLazarus: [C: 03+2] GrowthExperiments: Use job queue for refreshUserImpact script [puppet] - 10https://gerrit.wikimedia.org/r/855546 (https://phabricator.wikimedia.org/T322706) (owner: 10Kosta Harlan) [17:14:17] ack [17:15:32] while I wait for puppet to finish running on mwmaint1002 -- is it okay to test-start the jobs all at once, or should I wait for each to finish before starting the next? any particular order? [17:16:11] let me check [17:16:19] jouncebot: nowandnext [17:16:19] For the next 0 hour(s) and 43 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T1700) [17:16:19] In 0 hour(s) and 43 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T1800) [17:16:43] rzl: maybe start with growthexperiments-userImpactUpdateRecentlyRegistered ? [17:16:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P39039 and previous config saved to /var/cache/conftool/dbconfig/20221110-171646-marostegui.json [17:16:55] Amir1: still doing maintenance-jobby stuff but feel free to deploy anything else if you need [17:16:59] if that one looks ok, then do growthexperiments-userImpactUpdateRecentlyEdited [17:17:07] rzl: noted, thanks. mine takes a bit [17:17:23] kostajh: will do [17:17:26] rzl: growthexperiments-userImpactDelete should be a no-op, but one never knows :) [17:17:35] that's what makes it fun :D [17:17:44] hehe [17:17:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P39040 and previous config saved to /var/cache/conftool/dbconfig/20221110-171753-marostegui.json [17:18:41] !log rzl@mwmaint1002:~$ sudo systemctl start mediawiki_job_growthexperiments-userImpactUpdateRecentlyRegistered.service # test run for T322706 T322541 [17:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:46] T322706: User impact API: Maintenance scripts should defer work to the job queue - https://phabricator.wikimedia.org/T322706 [17:18:47] T322541: UserImpact: Set up maintenance script to run in betalabs and production - https://phabricator.wikimedia.org/T322541 [17:18:57] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ganeti1033'] [17:20:01] kostajh: Nov 10 17:19:42 mwmaint1002 systemd[1]: mediawiki_job_growthexperiments-userImpactUpdateRecentlyRegistered.service: Succeeded. [17:20:17] let me know if you want to check logs etc, I'll move on to growthexperiments-userImpactUpdateRecentlyEdited when you're ready [17:20:30] rzl: yeah I'll have a look at the logs [17:20:30] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: fix dashboard links for k8s latency [alerts] - 10https://gerrit.wikimedia.org/r/855608 (owner: 10Filippo Giunchedi) [17:21:31] rzl: hmm, how can I tell if this is running for beta wikis? or is that some separate log? [17:21:40] I'm looking at /var/log/mediawiki/mediawiki_job_growthexperiments-userImpactUpdateRecentlyRegistered/syslog.log [17:21:49] I can tell that the feature flag so this doesn't run in production is working properly :) [17:21:55] but I expect to see output for e.g. beta wikis like enwiki [17:22:29] oh, those will run on deployment-mwmaint hosts in WMCS [17:23:02] Ok, then go ahead please [17:23:38] !log rzl@mwmaint1002:~$ sudo systemctl start mediawiki_job_growthexperiments-userImpactUpdateRecentlyEdited.service # test run for T322706 T322541 [17:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:44] kostajh: and done [17:25:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T322618)', diff saved to https://phabricator.wikimedia.org/P39041 and previous config saved to /var/cache/conftool/dbconfig/20221110-172549-ladsgroup.json [17:25:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:25:54] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [17:26:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:26:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P39042 and previous config saved to /var/cache/conftool/dbconfig/20221110-172611-ladsgroup.json [17:26:22] Thanks! [17:26:44] rzl do you know where / when I can view logs for the beta cluster jobs? [17:26:49] !log increasing stream throughput to 400mbit, aqs1011-{a,b} & aqs1013-{a,b} -- T307802 [17:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:53] T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 [17:27:17] kostajh: I don't, sorry -- https://wikitech.wikimedia.org/wiki/Maintenance_server gives the hostname at least, but you're on your own from there [17:27:48] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jclark-ctr) Host Removed from rack offline script ran in netbox [17:27:58] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Jclark-ctr) 05Open→03Resolved [17:28:07] rzl thanks! [17:28:19] !log restarting bootstrap of aqs1016-a -- T307802 [17:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:51] (03PS1) 10Volans: CHANGELOG: add changelogs for release v5.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/855658 [17:31:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P39043 and previous config saved to /var/cache/conftool/dbconfig/20221110-173153-marostegui.json [17:32:08] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10Jclark-ctr) 05Open→03Resolved Removed Server from racks and ran offline script [17:32:18] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v5.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/855658 (owner: 10Volans) [17:32:25] PROBLEM - cassandra-a service on aqs1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:32:35] kostajh: do you want the test run for growthexperiments-userImpactDelete in prod, or just call it good from there? [17:33:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P39044 and previous config saved to /var/cache/conftool/dbconfig/20221110-173300-marostegui.json [17:33:55] rzl: test run might be a good idea, sure [17:34:01] PROBLEM - Check systemd state on aqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:27] !log rzl@mwmaint1002:~$ sudo systemctl start mediawiki_job_growthexperiments-userImpactDelete.service # test run for T322706 T322541 [17:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:32] T322706: User impact API: Maintenance scripts should defer work to the job queue - https://phabricator.wikimedia.org/T322706 [17:34:32] T322541: UserImpact: Set up maintenance script to run in betalabs and production - https://phabricator.wikimedia.org/T322541 [17:34:37] (03PS1) 10Ladsgroup: wwwportals: Make portal assets also visible in wikivoyage vhost [puppet] - 10https://gerrit.wikimedia.org/r/855659 (https://phabricator.wikimedia.org/T273179) [17:35:19] rzl: lgtm [17:35:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs4008.ulsfo.wmnet with OS buster [17:35:25] 👍 [17:35:33] anything else I can do for you? [17:35:49] RECOVERY - Check systemd state on aqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:59] RECOVERY - cassandra-a service on aqs1016 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:36:18] !log running sukhe@cumin2002:~$ homer "cr*-ulsfo*" commit "Gerrit 855583: sites.yaml: add lvs4008 (ulsfo hardware refresh)" [17:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:14] !log [done] running sukhe@cumin2002:~$ homer "cr*-ulsfo*" commit "Gerrit 855583: sites.yaml: add lvs4008 (ulsfo hardware refresh)" [17:37:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [17:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:27] rzl: that is all, thank you! [17:37:38] rad, good luck with the beta part [17:38:37] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:39:06] ^ looking [17:39:25] (03CR) 10Ladsgroup: [C: 03+2] wwwportals: Make portal assets also visible in wikivoyage vhost [puppet] - 10https://gerrit.wikimedia.org/r/855659 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [17:41:51] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:42:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [17:43:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/855647 (owner: 10Volans) [17:43:29] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v5.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/855658 (owner: 10Volans) [17:44:58] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@84dd7b5]: T320656: image_suggestions: schedule ad hoc dataset to fix improper suggestions [17:45:03] T320656: [L] List articles appearing in articles with image suggestions - https://phabricator.wikimedia.org/T320656 [17:47:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P39045 and previous config saved to /var/cache/conftool/dbconfig/20221110-174659-marostegui.json [17:47:17] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@84dd7b5]: T320656: image_suggestions: schedule ad hoc dataset to fix improper suggestions (duration: 02m 18s) [17:48:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321123)', diff saved to https://phabricator.wikimedia.org/P39046 and previous config saved to /var/cache/conftool/dbconfig/20221110-174806-marostegui.json [17:48:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [17:48:11] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [17:48:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [17:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T321123)', diff saved to https://phabricator.wikimedia.org/P39047 and previous config saved to /var/cache/conftool/dbconfig/20221110-174828-marostegui.json [17:48:30] (03PS1) 10Volans: Upstream release v5.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/855663 [17:49:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321123)', diff saved to https://phabricator.wikimedia.org/P39048 and previous config saved to /var/cache/conftool/dbconfig/20221110-174935-marostegui.json [17:50:27] (03CR) 10Volans: [C: 03+2] Upstream release v5.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/855663 (owner: 10Volans) [17:56:56] (03CR) 10Volans: [C: 03+2] admin: update permissions for datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/855647 (owner: 10Volans) [17:57:38] (03PS1) 10Ssingh: P:pybal: add lvs4008 to bpg-peer-address [puppet] - 10https://gerrit.wikimedia.org/r/855665 [17:58:04] godog: ok to merge your change in labs/private? s/k8_dse/k8s_dse/ [17:58:41] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855666 (https://phabricator.wikimedia.org/T273179) [17:59:18] * volans is bold and assume yes being labs/private [18:00:04] bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T1800). [18:00:15] (03Merged) 10jenkins-bot: Upstream release v5.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/855663 (owner: 10Volans) [18:00:42] bd808: let me know if you're doing deployments or when it's done (if you're planning to) [18:01:10] (03CR) 10Jforrester: "🎉" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855666 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [18:01:13] volans: yes, thank you! totally forgot about the extra step [18:01:19] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/855665 (owner: 10Ssingh) [18:01:26] !log uploaded python3-gjson_0.3.0 to apt.wikimedia.org bullseye-wikimedia,unstable-wikimedia [18:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:30] no prob [18:01:56] (03PS1) 10Giuseppe Lavagetto: modules/base.kubernetes: add module [deployment-charts] - 10https://gerrit.wikimedia.org/r/855667 [18:01:58] (03PS1) 10Giuseppe Lavagetto: [WIP] Add rake task to perform basic conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/855668 [18:02:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P39049 and previous config saved to /var/cache/conftool/dbconfig/20221110-180206-marostegui.json [18:02:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2157.codfw.wmnet with reason: Maintenance [18:02:10] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:02:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2157.codfw.wmnet with reason: Maintenance [18:02:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T321130)', diff saved to https://phabricator.wikimedia.org/P39050 and previous config saved to /var/cache/conftool/dbconfig/20221110-180228-marostegui.json [18:04:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P39051 and previous config saved to /var/cache/conftool/dbconfig/20221110-180442-marostegui.json [18:05:04] (03PS1) 10Ebernhardson: cirrus: Increase small cluster heap memory from 8G to 10G [puppet] - 10https://gerrit.wikimedia.org/r/855673 [18:05:19] (03CR) 10Herron: dispatch: sync user role and info from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [18:05:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321130)', diff saved to https://phabricator.wikimedia.org/P39052 and previous config saved to /var/cache/conftool/dbconfig/20221110-180543-marostegui.json [18:05:56] !log uploaded spicerack_5.0.0 to apt.wikimedia.org bullseye-wikimedia [18:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [18:06:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [18:09:41] !log upgrading spicerack to 5.0.0 on cumin hosts [18:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855666 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [18:12:00] (03CR) 10Ssingh: [C: 03+2] P:pybal: add lvs4008 to bpg-peer-address [puppet] - 10https://gerrit.wikimedia.org/r/855665 (owner: 10Ssingh) [18:12:23] (03CR) 10Dzahn: [C: 03+2] "yea, phab* hosts generally don't have it running anymore. the dedicated VM aphlict1001 does. So far there is only one. But we should creat" [puppet] - 10https://gerrit.wikimedia.org/r/855542 (owner: 10Muehlenhoff) [18:12:43] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855666 (https://phabricator.wikimedia.org/T273179) (owner: 10Ladsgroup) [18:12:58] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:855666|Bump portals to HEAD (T273179)]] [18:13:02] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [18:13:19] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:855666|Bump portals to HEAD (T273179)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [18:13:58] on mwdebug https://usercontent.irccloud-cdn.com/file/HrGNGqqu/image.png [18:14:04] moving forward [18:15:11] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@a030f5f]: T320656: convert_to_esbulk: fix typo in config [18:15:15] T320656: [L] List articles appearing in articles with image suggestions - https://phabricator.wikimedia.org/T320656 [18:15:53] (03CR) 10Dzahn: [C: 03+1] "@Vgutierrez, wanna deploy this? The comment you left has been fixed." [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff) [18:16:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:16:45] (03CR) 10Dzahn: [C: 03+1] "Jelto, should we merge this or not yet?" [puppet] - 10https://gerrit.wikimedia.org/r/829747 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [18:17:33] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@a030f5f]: T320656: convert_to_esbulk: fix typo in config (duration: 02m 22s) [18:17:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:17:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:18:13] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:855666|Bump portals to HEAD (T273179)]] (duration: 05m 14s) [18:18:17] T273179: Update the front-page of Wikimedia projects - https://phabricator.wikimedia.org/T273179 [18:18:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:19:34] RECOVERY - IPMI Sensor Status on restbase1018 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:19:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P39053 and previous config saved to /var/cache/conftool/dbconfig/20221110-181948-marostegui.json [18:20:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P39054 and previous config saved to /var/cache/conftool/dbconfig/20221110-182049-marostegui.json [18:20:52] (03CR) 10Dzahn: "Hi Hannah, thanks for the review and comments. So..if wikimedia.wansec.com is no longer resolving (and if it was as well, I guess) then wh" [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [18:21:34] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:22:33] !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: (no justification provided) (duration: 03m 46s) [18:23:02] (03CR) 10BCornwall: [C: 03+2] prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/855494 (https://phabricator.wikimedia.org/T292815) (owner: 10Vgutierrez) [18:23:33] (03PS2) 10BCornwall: prometheus: Rename ats_ metrics to trafficserver_ [puppet] - 10https://gerrit.wikimedia.org/r/855494 (https://phabricator.wikimedia.org/T292815) (owner: 10Vgutierrez) [18:26:11] !log ladsgroup@deploy1002 Synchronized portals: (no justification provided) (duration: 03m 38s) [18:26:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P39055 and previous config saved to /var/cache/conftool/dbconfig/20221110-182627-ladsgroup.json [18:26:31] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:27:41] (03PS4) 10Dzahn: dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096 [18:30:49] (03CR) 10Dzahn: "Hi again, so let me go through the individual issues you mentioned:" [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [18:31:26] James_F: now we are left with wikimedia.org, what to do with this oddball? [18:32:58] (03CR) 10Dzahn: "re: hostname: 'wikimedia.wansec.com'. The third option is to leave it as it is now. For that to validate it doesn't have to actually resol" [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [18:33:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39056 and previous config saved to /var/cache/conftool/dbconfig/20221110-183340-ladsgroup.json [18:33:47] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:33:58] (03CR) 10David Caro: [C: 03+1] "ok from me, I'm curious though, is this a standard anywhere or just a pet peeve?" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855650 (owner: 10Arturo Borrero Gonzalez) [18:34:25] (03CR) 10Dzahn: [C: 03+2] "gotcha! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/855147 (https://phabricator.wikimedia.org/T294276) (owner: 10Dzahn) [18:34:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321123)', diff saved to https://phabricator.wikimedia.org/P39057 and previous config saved to /var/cache/conftool/dbconfig/20221110-183455-marostegui.json [18:34:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:35:00] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [18:35:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [18:35:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [18:35:25] (03CR) 10David Caro: [C: 03+1] wmcs: cleanup SAL log messages (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855650 (owner: 10Arturo Borrero Gonzalez) [18:35:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [18:35:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [18:35:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [18:35:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T321123)', diff saved to https://phabricator.wikimedia.org/P39058 and previous config saved to /var/cache/conftool/dbconfig/20221110-183548-marostegui.json [18:35:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P39059 and previous config saved to /var/cache/conftool/dbconfig/20221110-183556-marostegui.json [18:36:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on lvs4008.ulsfo.wmnet with reason: downtimed as we are resolving issues with LVS configuration [18:36:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs4008.ulsfo.wmnet with reason: downtimed as we are resolving issues with LVS configuration [18:36:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321123)', diff saved to https://phabricator.wikimedia.org/P39060 and previous config saved to /var/cache/conftool/dbconfig/20221110-183655-marostegui.json [18:39:15] (03PS4) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [18:39:17] (03PS1) 10David Caro: wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 [18:41:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P39061 and previous config saved to /var/cache/conftool/dbconfig/20221110-184133-ladsgroup.json [18:41:49] (03CR) 10David Caro: [C: 04-1] wmcs: add socks proxy support to wmcs cookbooks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [18:41:51] (03CR) 10Dzahn: dumps/distribution: add more data types to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [18:42:53] (03CR) 10CI reject: [V: 04-1] wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 (owner: 10David Caro) [18:42:57] (03PS3) 10Ottomata: Create platform-eng-deployers group for deploying airflow platform_eng [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) [18:43:14] (03CR) 10CI reject: [V: 04-1] wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [18:43:37] (03PS1) 10BBlack: [WIP] Arrays for lvs all_class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/855682 [18:46:15] (03CR) 10Ottomata: [C: 03+2] Create platform-eng-deployers group for deploying airflow platform_eng [puppet] - 10https://gerrit.wikimedia.org/r/855059 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata) [18:48:15] (03PS2) 10Ssingh: [WIP] Arrays for lvs all_class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/855682 (owner: 10BBlack) [18:48:25] (03PS1) 10Ottomata: Declare platform-eng-deployers in profile::admin::groups in kubernetes.yaml [puppet] - 10https://gerrit.wikimedia.org/r/855686 (https://phabricator.wikimedia.org/T321925) [18:48:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P39062 and previous config saved to /var/cache/conftool/dbconfig/20221110-184847-ladsgroup.json [18:48:59] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38090/console" [puppet] - 10https://gerrit.wikimedia.org/r/855682 (owner: 10BBlack) [18:49:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T318605)', diff saved to https://phabricator.wikimedia.org/P39063 and previous config saved to /var/cache/conftool/dbconfig/20221110-184917-ladsgroup.json [18:49:22] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:51:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T321130)', diff saved to https://phabricator.wikimedia.org/P39064 and previous config saved to /var/cache/conftool/dbconfig/20221110-185103-marostegui.json [18:51:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance [18:51:08] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:51:16] (03CR) 10CI reject: [V: 04-1] [WIP] Arrays for lvs all_class_hosts [puppet] - 10https://gerrit.wikimedia.org/r/855682 (owner: 10BBlack) [18:51:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance [18:51:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P39065 and previous config saved to /var/cache/conftool/dbconfig/20221110-185135-marostegui.json [18:52:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P39066 and previous config saved to /var/cache/conftool/dbconfig/20221110-185202-marostegui.json [18:52:17] (03CR) 10Ottomata: [C: 03+2] Declare platform-eng-deployers in profile::admin::groups in kubernetes.yaml [puppet] - 10https://gerrit.wikimedia.org/r/855686 (https://phabricator.wikimedia.org/T321925) (owner: 10Ottomata) [18:52:51] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/q/topic:data-types (I only did the custom data type "dumps::mirror" though after John suggested it)" [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [18:54:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P39067 and previous config saved to /var/cache/conftool/dbconfig/20221110-185450-marostegui.json [18:55:52] !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts phab2001.codfw.wmnet [18:56:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P39068 and previous config saved to /var/cache/conftool/dbconfig/20221110-185640-ladsgroup.json [18:57:46] (03PS1) 10Dzahn: phabricator: rm hierdata/hosts/phab2001.yaml [puppet] - 10https://gerrit.wikimedia.org/r/855688 (https://phabricator.wikimedia.org/T322250) [18:58:22] !log phabricator - running decom cookbook on phab2001 - T322250 [18:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:26] T322250: decom phab2001 - https://phabricator.wikimedia.org/T322250 [19:02:33] (KeyholderUnarmed) firing: 18 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [19:03:02] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:03:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P39069 and previous config saved to /var/cache/conftool/dbconfig/20221110-190353-ladsgroup.json [19:04:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P39070 and previous config saved to /var/cache/conftool/dbconfig/20221110-190424-ladsgroup.json [19:05:25] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:05:26] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts phab2001.codfw.wmnet [19:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P39071 and previous config saved to /var/cache/conftool/dbconfig/20221110-190708-marostegui.json [19:08:52] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:09:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P39072 and previous config saved to /var/cache/conftool/dbconfig/20221110-190957-marostegui.json [19:11:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P39073 and previous config saved to /var/cache/conftool/dbconfig/20221110-191146-ladsgroup.json [19:11:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [19:11:53] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:12:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [19:12:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P39074 and previous config saved to /var/cache/conftool/dbconfig/20221110-191208-ladsgroup.json [19:12:38] (03CR) 10Htriedman: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [19:14:06] (03CR) 10RLazarus: json-webrequests-stats: add -t/--time-range (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854521 (owner: 10Volans) [19:14:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P39075 and previous config saved to /var/cache/conftool/dbconfig/20221110-191418-ladsgroup.json [19:14:36] (03PS1) 10BBlack: Revert "P:cumin::master: drop low-traffic from PoP sites" [puppet] - 10https://gerrit.wikimedia.org/r/855691 (https://phabricator.wikimedia.org/T264132) [19:14:38] (03PS1) 10BBlack: Revert "P:cumin::master: Add aliases for lvs traffic classes" [puppet] - 10https://gerrit.wikimedia.org/r/855692 (https://phabricator.wikimedia.org/T264132) [19:14:40] (03PS1) 10BBlack: Revert "P:lvs::configuration: Store all site data in an accessible structure" [puppet] - 10https://gerrit.wikimedia.org/r/855693 (https://phabricator.wikimedia.org/T264132) [19:14:42] (03PS1) 10BBlack: Revert "profile::lvs::configuration: Fix typo" [puppet] - 10https://gerrit.wikimedia.org/r/855694 (https://phabricator.wikimedia.org/T264132) [19:14:44] (03PS1) 10BBlack: Revert "P:lvs::configuration: Dont alert for missing lvs definitions" [puppet] - 10https://gerrit.wikimedia.org/r/855695 (https://phabricator.wikimedia.org/T264132) [19:14:50] (03PS1) 10BBlack: Revert "P:lvs::configueration: move classification to hiera and add error checks" [puppet] - 10https://gerrit.wikimedia.org/r/855696 (https://phabricator.wikimedia.org/T264132) [19:15:11] win 14 [19:15:47] (03PS3) 10Andrea Denisse: netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) [19:16:02] (03PS5) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [19:16:04] (03PS2) 10David Caro: wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 [19:17:46] (03CR) 10Dzahn: [C: 03+2] "cookbook finished" [puppet] - 10https://gerrit.wikimedia.org/r/855688 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [19:18:32] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1033.eqiad.wmnet with OS bullseye [19:18:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host ganeti1033.eqiad.wmnet with OS bullseye [19:19:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T318605)', diff saved to https://phabricator.wikimedia.org/P39076 and previous config saved to /var/cache/conftool/dbconfig/20221110-191900-ladsgroup.json [19:19:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:19:07] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:19:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:19:30] (03CR) 10CI reject: [V: 04-1] netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [19:19:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P39077 and previous config saved to /var/cache/conftool/dbconfig/20221110-191930-ladsgroup.json [19:20:38] (03PS6) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [19:20:40] (03CR) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [19:20:42] (03PS3) 10David Caro: wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 [19:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321123)', diff saved to https://phabricator.wikimedia.org/P39078 and previous config saved to /var/cache/conftool/dbconfig/20221110-192215-marostegui.json [19:22:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [19:22:20] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [19:22:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [19:22:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T321123)', diff saved to https://phabricator.wikimedia.org/P39079 and previous config saved to /var/cache/conftool/dbconfig/20221110-192236-marostegui.json [19:22:42] (03PS1) 10Dzahn: site: remove phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/855697 (https://phabricator.wikimedia.org/T322250) [19:23:26] (03CR) 10Dzahn: [C: 03+2] site: remove phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/855697 (https://phabricator.wikimedia.org/T322250) (owner: 10Dzahn) [19:23:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321123)', diff saved to https://phabricator.wikimedia.org/P39080 and previous config saved to /var/cache/conftool/dbconfig/20221110-192343-marostegui.json [19:24:10] (03CR) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [19:25:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P39081 and previous config saved to /var/cache/conftool/dbconfig/20221110-192503-marostegui.json [19:25:06] (03CR) 10BBlack: [C: 03+2] Revert "P:cumin::master: drop low-traffic from PoP sites" [puppet] - 10https://gerrit.wikimedia.org/r/855691 (https://phabricator.wikimedia.org/T264132) (owner: 10BBlack) [19:25:11] (03CR) 10BBlack: [C: 03+2] Revert "P:cumin::master: Add aliases for lvs traffic classes" [puppet] - 10https://gerrit.wikimedia.org/r/855692 (https://phabricator.wikimedia.org/T264132) (owner: 10BBlack) [19:25:14] (03CR) 10BBlack: [C: 03+2] Revert "P:lvs::configuration: Store all site data in an accessible structure" [puppet] - 10https://gerrit.wikimedia.org/r/855693 (https://phabricator.wikimedia.org/T264132) (owner: 10BBlack) [19:25:17] (03CR) 10BBlack: [C: 03+2] Revert "profile::lvs::configuration: Fix typo" [puppet] - 10https://gerrit.wikimedia.org/r/855694 (https://phabricator.wikimedia.org/T264132) (owner: 10BBlack) [19:25:20] (03CR) 10BBlack: [C: 03+2] Revert "P:lvs::configuration: Dont alert for missing lvs definitions" [puppet] - 10https://gerrit.wikimedia.org/r/855695 (https://phabricator.wikimedia.org/T264132) (owner: 10BBlack) [19:25:24] (03CR) 10BBlack: [C: 03+2] Revert "P:lvs::configueration: move classification to hiera and add error checks" [puppet] - 10https://gerrit.wikimedia.org/r/855696 (https://phabricator.wikimedia.org/T264132) (owner: 10BBlack) [19:27:52] (03CR) 10Volans: "Thanks for the replies, I will make the changes. Some immediate reply inline." [puppet] - 10https://gerrit.wikimedia.org/r/854521 (owner: 10Volans) [19:29:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P39082 and previous config saved to /var/cache/conftool/dbconfig/20221110-192925-ladsgroup.json [19:29:41] (03CR) 10CI reject: [V: 04-1] wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [19:29:43] (03CR) 10CI reject: [V: 04-1] wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 (owner: 10David Caro) [19:31:28] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1033.eqiad.wmnet with reason: host reimage [19:34:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T318605)', diff saved to https://phabricator.wikimedia.org/P39083 and previous config saved to /var/cache/conftool/dbconfig/20221110-193437-ladsgroup.json [19:34:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [19:34:43] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [19:34:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [19:34:54] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1033.eqiad.wmnet with reason: host reimage [19:35:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T318605)', diff saved to https://phabricator.wikimedia.org/P39084 and previous config saved to /var/cache/conftool/dbconfig/20221110-193459-ladsgroup.json [19:35:50] (03PS1) 10Ssingh: P::lvs::configuration: update lvs4008 config for recent reverts [puppet] - 10https://gerrit.wikimedia.org/r/855700 [19:37:31] (03PS5) 10CDanis: haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) [19:38:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P39085 and previous config saved to /var/cache/conftool/dbconfig/20221110-193850-marostegui.json [19:40:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T321130)', diff saved to https://phabricator.wikimedia.org/P39086 and previous config saved to /var/cache/conftool/dbconfig/20221110-194009-marostegui.json [19:40:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2178.codfw.wmnet with reason: Maintenance [19:40:14] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:40:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2178.codfw.wmnet with reason: Maintenance [19:40:30] (03PS6) 10CDanis: haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) [19:40:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T321130)', diff saved to https://phabricator.wikimedia.org/P39087 and previous config saved to /var/cache/conftool/dbconfig/20221110-194031-marostegui.json [19:41:07] (03CR) 10BBlack: [C: 03+1] P::lvs::configuration: update lvs4008 config for recent reverts [puppet] - 10https://gerrit.wikimedia.org/r/855700 (owner: 10Ssingh) [19:41:15] (03CR) 10Ssingh: [C: 03+2] P::lvs::configuration: update lvs4008 config for recent reverts [puppet] - 10https://gerrit.wikimedia.org/r/855700 (owner: 10Ssingh) [19:42:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T321130)', diff saved to https://phabricator.wikimedia.org/P39088 and previous config saved to /var/cache/conftool/dbconfig/20221110-194224-marostegui.json [19:44:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P39089 and previous config saved to /var/cache/conftool/dbconfig/20221110-194431-ladsgroup.json [19:45:02] (03PS2) 10Ryan Kemper: [elastic,open]search: rip out unnecessary jvm options [puppet] - 10https://gerrit.wikimedia.org/r/838253 [19:47:21] 10ops-codfw, 10decommission-hardware: decommission phab2001.codfw.wmnet - https://phabricator.wikimedia.org/T322880 (10Dzahn) [19:48:11] 10ops-codfw, 10decommission-hardware: decommission phab2001.codfw.wmnet - https://phabricator.wikimedia.org/T322880 (10Dzahn) decom cookbook has finished. this is https://netbox.wikimedia.org/dcim/devices/1543/ [19:49:32] 10ops-codfw, 10decommission-hardware: decommission phab2001.codfw.wmnet - https://phabricator.wikimedia.org/T322880 (10Dzahn) [19:51:09] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:51:51] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1033.eqiad.wmnet with OS bullseye [19:51:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host ganeti1033.eqiad.wmnet with OS bullseye completed: - ganeti1033 (**PASS*... [19:52:19] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:52:22] (03PS7) 10CDanis: haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) [19:52:32] (03CR) 10RLazarus: json-webrequests-stats: add -t/--time-range (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854521 (owner: 10Volans) [19:53:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P39090 and previous config saved to /var/cache/conftool/dbconfig/20221110-195357-marostegui.json [19:54:44] !log netbox - deleting special case phab2001-vcs.codfw.wmnet IPv4 (10.192.32.149) and IPv6 (2620:0:860:103:10:192:32:149) - T296022 - T322250 [19:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:49] T322250: decom phab2001 - https://phabricator.wikimedia.org/T322250 [19:54:50] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [19:55:09] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:55:22] (03PS3) 10Ryan Kemper: [elastic,open]search: rip out unnecessary jvm options [puppet] - 10https://gerrit.wikimedia.org/r/838253 [19:55:44] (03PS7) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [19:55:46] (03PS4) 10David Caro: wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 [19:57:25] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:57:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P39091 and previous config saved to /var/cache/conftool/dbconfig/20221110-195731-marostegui.json [19:58:56] 10ops-codfw, 10decommission-hardware: decommission phab2001.codfw.wmnet - https://phabricator.wikimedia.org/T322880 (10Dzahn) purchase date 2016-03-24 [19:59:16] 10ops-codfw, 10decommission-hardware, 10serviceops-collab: decommission phab2001.codfw.wmnet - https://phabricator.wikimedia.org/T322880 (10Dzahn) [19:59:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T322618)', diff saved to https://phabricator.wikimedia.org/P39092 and previous config saved to /var/cache/conftool/dbconfig/20221110-195938-ladsgroup.json [19:59:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:59:44] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:59:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:00:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [20:00:18] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Dzahn) a:03Astinson [20:00:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [20:00:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [20:00:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [20:00:57] 10SRE, 10ops-eqiad, 10decommission-hardware, 10serviceops-radar: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 (10Jclark-ctr) [20:01:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [20:01:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2108.codfw.wmnet with reason: Maintenance [20:01:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P39093 and previous config saved to /var/cache/conftool/dbconfig/20221110-200125-ladsgroup.json [20:02:04] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 90, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:03:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P39094 and previous config saved to /var/cache/conftool/dbconfig/20221110-200344-ladsgroup.json [20:04:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) [20:05:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:05:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10RobH) 05Open→03Resolved @MoritzMuehlenhoff : ganeti1033 is all ready for you, resolving this setup task. [20:06:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:06:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:07:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:09:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321123)', diff saved to https://phabricator.wikimedia.org/P39095 and previous config saved to /var/cache/conftool/dbconfig/20221110-200903-marostegui.json [20:09:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [20:09:08] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [20:09:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1186.eqiad.wmnet with reason: Maintenance [20:09:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T321123)', diff saved to https://phabricator.wikimedia.org/P39096 and previous config saved to /var/cache/conftool/dbconfig/20221110-200924-marostegui.json [20:10:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T321123)', diff saved to https://phabricator.wikimedia.org/P39097 and previous config saved to /var/cache/conftool/dbconfig/20221110-201032-marostegui.json [20:11:18] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 201 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:11:33] (03PS4) 10Ryan Kemper: [elastic,open]search: rip out unnecessary jvm options [puppet] - 10https://gerrit.wikimedia.org/r/838253 [20:11:35] (03PS7) 10Ryan Kemper: elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking) [20:12:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P39098 and previous config saved to /var/cache/conftool/dbconfig/20221110-201237-marostegui.json [20:12:42] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 20 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:13:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:18:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:18:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P39099 and previous config saved to /var/cache/conftool/dbconfig/20221110-201851-ladsgroup.json [20:25:19] (03PS10) 10CDanis: No-op change. Replace the idea of stickycounters with actions [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) [20:25:21] (03PS8) 10CDanis: haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) [20:25:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P39100 and previous config saved to /var/cache/conftool/dbconfig/20221110-202539-marostegui.json [20:27:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T321130)', diff saved to https://phabricator.wikimedia.org/P39101 and previous config saved to /var/cache/conftool/dbconfig/20221110-202744-marostegui.json [20:27:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:28:40] (03PS9) 10CDanis: haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) [20:32:08] (03CR) 10CDanis: [C: 03+2] haproxy: concurrency tracking as discussed [puppet] - 10https://gerrit.wikimedia.org/r/855098 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [20:32:15] (03CR) 10CDanis: [C: 03+2] No-op change. Replace the idea of stickycounters with actions [puppet] - 10https://gerrit.wikimedia.org/r/855089 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [20:32:41] !log dancy@deploy1002 Installing scap version "4.28.0" for 559 hosts [20:32:50] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:10] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕞🍵 sudo cumin A:cp 'disable-puppet T306580' [20:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P39102 and previous config saved to /var/cache/conftool/dbconfig/20221110-203357-ladsgroup.json [20:34:08] (03PS12) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) [20:35:01] !log ✔️ cdanis@cp2027.codfw.wmnet ~ 🕞🍵 sudo run-puppet-agent --enable T306580 [20:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:16] !log ✔️ cdanis@cp3052.esams.wmnet ~ 🕞🍵 sudo run-puppet-agent --enable T306580 [20:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:54] !log ✔️ cdanis@cp3053.esams.wmnet ~ 🕞🍵 sudo run-puppet-agent --enable T306580 [20:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:17] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [20:40:25] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:40:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P39103 and previous config saved to /var/cache/conftool/dbconfig/20221110-204045-marostegui.json [20:41:31] !log dancy@deploy1002 Installing scap version "4.28.0" for 559 hosts [20:42:06] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [20:42:42] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:47:10] (03PS5) 10Ryan Kemper: [elastic,open]search: rip out unnecessary jvm options [puppet] - 10https://gerrit.wikimedia.org/r/838253 [20:47:12] (03PS8) 10Ryan Kemper: elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking) [20:49:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T322618)', diff saved to https://phabricator.wikimedia.org/P39104 and previous config saved to /var/cache/conftool/dbconfig/20221110-204904-ladsgroup.json [20:49:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [20:49:09] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [20:49:16] (03CR) 10Ryan Kemper: [elastic,open]search: rip out unnecessary jvm options (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/838253 (owner: 10Ryan Kemper) [20:49:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2120.codfw.wmnet with reason: Maintenance [20:49:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P39105 and previous config saved to /var/cache/conftool/dbconfig/20221110-204925-ladsgroup.json [20:51:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P39106 and previous config saved to /var/cache/conftool/dbconfig/20221110-205145-ladsgroup.json [20:52:04] (03PS1) 10Ebernhardson: DNM: Test gerrit auto-replies [puppet] - 10https://gerrit.wikimedia.org/r/855710 [20:53:25] (03PS2) 10Ebernhardson: DNM: Test gerrit auto-replies [puppet] - 10https://gerrit.wikimedia.org/r/855710 [20:53:27] (03CR) 10Ebernhardson: DNM: Test gerrit auto-replies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/855710 (owner: 10Ebernhardson) [20:54:16] (03Abandoned) 10Ebernhardson: DNM: Test gerrit auto-replies [puppet] - 10https://gerrit.wikimedia.org/r/855710 (owner: 10Ebernhardson) [20:55:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T321123)', diff saved to https://phabricator.wikimedia.org/P39107 and previous config saved to /var/cache/conftool/dbconfig/20221110-205552-marostegui.json [20:55:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [20:55:56] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [20:56:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1196.eqiad.wmnet with reason: Maintenance [20:56:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T321123)', diff saved to https://phabricator.wikimedia.org/P39108 and previous config saved to /var/cache/conftool/dbconfig/20221110-205613-marostegui.json [20:57:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321123)', diff saved to https://phabricator.wikimedia.org/P39109 and previous config saved to /var/cache/conftool/dbconfig/20221110-205720-marostegui.json [21:00:04] brennen and TheresNoTime: Dear deployers, time to do the UTC late backport and config training deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T2100). [21:03:25] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕞🍵 sudo cumin -b 8 A:cp 'run-puppet-agent --enable T306580' [21:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:33] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [21:04:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:06:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P39110 and previous config saved to /var/cache/conftool/dbconfig/20221110-210651-ladsgroup.json [21:10:21] !log deploy1002 - armed the keyholder with deployment keys - 2 hours ago alerts started that it was not armed (does it notify people?) - got pinged that deployers got scap problems - unknown why it was disarmed - now it is armed again [21:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:43] (03CR) 10Htriedman: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [21:12:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P39111 and previous config saved to /var/cache/conftool/dbconfig/20221110-211227-marostegui.json [21:12:33] (KeyholderUnarmed) resolved: 18 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:17:16] (03CR) 10Dzahn: [C: 03+2] "ah, I see the Icinga alerts about this now. I know where the issues come from.. because once aphlict was part of phab but then it wasn't.." [puppet] - 10https://gerrit.wikimedia.org/r/855542 (owner: 10Muehlenhoff) [21:20:29] Amir1: Re-use the visual design but make it at the project level? [21:20:44] jouncebot: now [21:20:44] For the next 0 hour(s) and 39 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221110T2100) [21:21:33] deployers, if you had problems deploying within the last 2 hours..the deployent keys were not loaded on the deployment server [21:21:36] but now they are again [21:21:56] Yeah. I probably gonna copy paste the html to a file [21:21:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P39112 and previous config saved to /var/cache/conftool/dbconfig/20221110-212158-ladsgroup.json [21:22:36] I don't think it even had any css or js to begin with [21:26:26] !log dancy@deploy1002 Installing scap version "4.28.0" for 559 hosts [21:27:01] !log dancy@deploy1002 Installation of scap version "4.28.0" completed for 559 hosts [21:27:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P39113 and previous config saved to /var/cache/conftool/dbconfig/20221110-212734-marostegui.json [21:29:20] oh, it's scap deploying scap and ..fast [21:37:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T322618)', diff saved to https://phabricator.wikimedia.org/P39114 and previous config saved to /var/cache/conftool/dbconfig/20221110-213704-ladsgroup.json [21:37:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [21:37:09] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [21:37:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [21:37:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P39115 and previous config saved to /var/cache/conftool/dbconfig/20221110-213726-ladsgroup.json [21:39:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P39116 and previous config saved to /var/cache/conftool/dbconfig/20221110-213945-ladsgroup.json [21:40:58] (03PS1) 10QChris: Add .gitreview [software/liberica] - 10https://gerrit.wikimedia.org/r/855717 [21:41:00] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/liberica] - 10https://gerrit.wikimedia.org/r/855717 (owner: 10QChris) [21:42:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321123)', diff saved to https://phabricator.wikimedia.org/P39117 and previous config saved to /var/cache/conftool/dbconfig/20221110-214240-marostegui.json [21:42:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [21:42:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [21:42:45] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [21:42:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [21:43:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [21:43:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance [21:43:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2102.codfw.wmnet with reason: Maintenance [21:47:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [21:47:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [21:47:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T321123)', diff saved to https://phabricator.wikimedia.org/P39118 and previous config saved to /var/cache/conftool/dbconfig/20221110-214746-marostegui.json [21:47:51] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [21:49:23] (03CR) 10Bking: [C: 03+1] elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319021) (owner: 10Bking) [21:49:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T321123)', diff saved to https://phabricator.wikimedia.org/P39119 and previous config saved to /var/cache/conftool/dbconfig/20221110-214956-marostegui.json [21:52:00] (03PS6) 10Ryan Kemper: [elastic,open]search: rip out unnecessary jvm options [puppet] - 10https://gerrit.wikimedia.org/r/838253 [21:52:02] (03PS9) 10Ryan Kemper: elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319020) (owner: 10Bking) [21:52:04] (03PS1) 10Ryan Kemper: opensearch/logstash: make default gc options same as ES 7 [puppet] - 10https://gerrit.wikimedia.org/r/855719 (https://phabricator.wikimedia.org/T319020) [21:54:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P39120 and previous config saved to /var/cache/conftool/dbconfig/20221110-215452-ladsgroup.json [21:55:32] (03CR) 10Ryan Kemper: "These changes were previously in https://gerrit.wikimedia.org/r/c/operations/puppet/+/838248, but we split off the opensearch / logstash s" [puppet] - 10https://gerrit.wikimedia.org/r/855719 (https://phabricator.wikimedia.org/T319020) (owner: 10Ryan Kemper) [21:56:38] (03PS1) 10Dzahn: phabricator/aphlict: pass through ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/855720 (https://phabricator.wikimedia.org/T135991) [21:58:04] (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/855720/" [puppet] - 10https://gerrit.wikimedia.org/r/855542 (owner: 10Muehlenhoff) [22:00:57] (03PS2) 10Dzahn: phabricator/aphlict: pass through ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/855720 (https://phabricator.wikimedia.org/T135991) [22:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P39121 and previous config saved to /var/cache/conftool/dbconfig/20221110-220502-marostegui.json [22:09:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P39122 and previous config saved to /var/cache/conftool/dbconfig/20221110-220958-ladsgroup.json [22:17:04] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/38101/ I expected a change on the phab hosts though. code is a mess. I need to clean i" [puppet] - 10https://gerrit.wikimedia.org/r/855720 (https://phabricator.wikimedia.org/T135991) (owner: 10Dzahn) [22:20:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P39123 and previous config saved to /var/cache/conftool/dbconfig/20221110-222009-marostegui.json [22:21:04] ACKNOWLEDGEMENT - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_aphlict.service daniel_zahn false positives. debugging in progress. alerts will be removed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:04] ACKNOWLEDGEMENT - Check systemd state on phab1004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_aphlict.service daniel_zahn false positives. debugging in progress. alerts will be removed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:04] ACKNOWLEDGEMENT - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_aphlict.service daniel_zahn false positives. debugging in progress. alerts will be removed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T322618)', diff saved to https://phabricator.wikimedia.org/P39124 and previous config saved to /var/cache/conftool/dbconfig/20221110-222505-ladsgroup.json [22:25:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [22:25:09] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [22:25:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2122.codfw.wmnet with reason: Maintenance [22:25:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P39125 and previous config saved to /var/cache/conftool/dbconfig/20221110-222526-ladsgroup.json [22:27:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P39126 and previous config saved to /var/cache/conftool/dbconfig/20221110-222746-ladsgroup.json [22:35:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T321123)', diff saved to https://phabricator.wikimedia.org/P39127 and previous config saved to /var/cache/conftool/dbconfig/20221110-223515-marostegui.json [22:35:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [22:35:20] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [22:35:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2116.codfw.wmnet with reason: Maintenance [22:35:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T321123)', diff saved to https://phabricator.wikimedia.org/P39128 and previous config saved to /var/cache/conftool/dbconfig/20221110-223537-marostegui.json [22:37:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321123)', diff saved to https://phabricator.wikimedia.org/P39129 and previous config saved to /var/cache/conftool/dbconfig/20221110-223746-marostegui.json [22:42:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P39130 and previous config saved to /var/cache/conftool/dbconfig/20221110-224253-ladsgroup.json [22:52:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P39131 and previous config saved to /var/cache/conftool/dbconfig/20221110-225253-marostegui.json [22:58:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P39132 and previous config saved to /var/cache/conftool/dbconfig/20221110-225759-ladsgroup.json [23:08:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P39133 and previous config saved to /var/cache/conftool/dbconfig/20221110-230759-marostegui.json [23:13:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T322618)', diff saved to https://phabricator.wikimedia.org/P39134 and previous config saved to /var/cache/conftool/dbconfig/20221110-231306-ladsgroup.json [23:13:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [23:13:11] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [23:13:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [23:13:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P39135 and previous config saved to /var/cache/conftool/dbconfig/20221110-231339-ladsgroup.json [23:15:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T322618)', diff saved to https://phabricator.wikimedia.org/P39136 and previous config saved to /var/cache/conftool/dbconfig/20221110-231558-ladsgroup.json [23:23:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321123)', diff saved to https://phabricator.wikimedia.org/P39137 and previous config saved to /var/cache/conftool/dbconfig/20221110-232305-marostegui.json [23:23:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [23:23:10] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [23:23:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2130.codfw.wmnet with reason: Maintenance [23:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T321123)', diff saved to https://phabricator.wikimedia.org/P39138 and previous config saved to /var/cache/conftool/dbconfig/20221110-232327-marostegui.json [23:25:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321123)', diff saved to https://phabricator.wikimedia.org/P39139 and previous config saved to /var/cache/conftool/dbconfig/20221110-232536-marostegui.json [23:31:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P39140 and previous config saved to /var/cache/conftool/dbconfig/20221110-233105-ladsgroup.json [23:40:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P39141 and previous config saved to /var/cache/conftool/dbconfig/20221110-234043-marostegui.json [23:46:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P39142 and previous config saved to /var/cache/conftool/dbconfig/20221110-234612-ladsgroup.json [23:55:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P39143 and previous config saved to /var/cache/conftool/dbconfig/20221110-235549-marostegui.json