[00:00:05] tgr: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for growthexperiments_user_impact DB table creation (T317534) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T0000). [00:00:05] T317534: Create growthexperiments_user_impact table in Wikimedia production - https://phabricator.wikimedia.org/T317534 [00:02:40] !log running foreachwikiindblist growthexperiments.dblist extensions/WikimediaMaintenance/createExtensionTables.php growthexperiments [00:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:20] quiddity: ought to be all good, let me know if you still see errors anywhere [00:07:44] yup, seems good! Thanks for checking :> [00:07:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T318605)', diff saved to https://phabricator.wikimedia.org/P38516 and previous config saved to /var/cache/conftool/dbconfig/20221108-000757-ladsgroup.json [00:08:02] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [00:08:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10KFrancis) The contractor is working for Tumult Labs, Inc, correct? If so, they are covered until the parent contract. [00:09:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P38517 and previous config saved to /var/cache/conftool/dbconfig/20221108-000922-marostegui.json [00:10:18] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv4: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:12:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10wiki_willy) Touched base with @Cmjohnson, who will work on this later today. Thanks, Willy [00:17:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T318605)', diff saved to https://phabricator.wikimedia.org/P38518 and previous config saved to /var/cache/conftool/dbconfig/20221108-001704-ladsgroup.json [00:17:09] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [00:23:00] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) p:05Triage→03Medium [00:23:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P38519 and previous config saved to /var/cache/conftool/dbconfig/20221108-002304-ladsgroup.json [00:23:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) [00:24:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P38520 and previous config saved to /var/cache/conftool/dbconfig/20221108-002428-marostegui.json [00:24:31] 10SRE, 10ops-codfw, 10DC-Ops: Q2:rack/setup/install puppetdb2003 - https://phabricator.wikimedia.org/T317894 (10Papaul) [00:26:13] (03CR) 10Cwhite: [C: 03+1] alertmanager: use 'site' label to route tasks for dcops [puppet] - 10https://gerrit.wikimedia.org/r/854040 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [00:27:22] (03PS1) 10Cwhite: beta-logs: transition jobs host assignment to bullseye host [puppet] - 10https://gerrit.wikimedia.org/r/854111 (https://phabricator.wikimedia.org/T321410) [00:29:36] (03PS1) 10Gergő Tisza: createExtensionTables.php: Remove closeConnection() [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854069 [00:32:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P38521 and previous config saved to /var/cache/conftool/dbconfig/20221108-003210-ladsgroup.json [00:33:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854069 (owner: 10Gergő Tisza) [00:33:32] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:33:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [00:33:58] RECOVERY - SSH on an-coord1002.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:34:28] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [00:34:57] (03Merged) 10jenkins-bot: createExtensionTables.php: Remove closeConnection() [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854069 (owner: 10Gergő Tisza) [00:35:09] !log tgr@deploy1002 Started scap: Backport for [[gerrit:854069|createExtensionTables.php: Remove closeConnection()]] [00:35:31] !log tgr@deploy1002 tgr and tgr: Backport for [[gerrit:854069|createExtensionTables.php: Remove closeConnection()]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [00:36:34] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:37:28] (03CR) 10Dzahn: [C: 03+2] "@Cwhite I made https://wikitech.wikimedia.org/wiki/Logstash#Getting_logs_from_misc_systems_into_logstash as an example for other users to " [puppet] - 10https://gerrit.wikimedia.org/r/849169 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [00:38:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P38522 and previous config saved to /var/cache/conftool/dbconfig/20221108-003810-ladsgroup.json [00:38:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [00:39:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [00:39:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [00:39:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T321123)', diff saved to https://phabricator.wikimedia.org/P38523 and previous config saved to /var/cache/conftool/dbconfig/20221108-003934-marostegui.json [00:39:39] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [00:39:53] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:854069|createExtensionTables.php: Remove closeConnection()]] (duration: 04m 43s) [00:40:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [00:41:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Dzahn) @KFrancis Alright, thank you. Yes, looks like that's the case, based on email address and the comment on this ticket. @Haltriedman Hi, so we need your approval for this... [00:42:41] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Dzahn) a:03Htriedman [00:47:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED [00:47:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P38524 and previous config saved to /var/cache/conftool/dbconfig/20221108-004717-ladsgroup.json [00:47:30] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T318605)', diff saved to https://phabricator.wikimedia.org/P38525 and previous config saved to /var/cache/conftool/dbconfig/20221108-005317-ladsgroup.json [00:53:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [00:53:21] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [00:53:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [00:53:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T318605)', diff saved to https://phabricator.wikimedia.org/P38526 and previous config saved to /var/cache/conftool/dbconfig/20221108-005338-ladsgroup.json [00:55:40] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1033.mgmt.eqiad.wmnet with reboot policy FORCED [00:55:40] I'll extend the deploy window a bit to do a backport I forgot, there isn't anything interesting happening in the next two hours anyway. [00:56:18] (03PS1) 10Gergő Tisza: Add UserRegistrationLookupHelper [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854070 (https://phabricator.wikimedia.org/T313395) [00:59:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10Cmjohnson) @Jclark-ctr I did the netbox provisioning script, I am not ale to ping the mgmt IP for either server. Can you verify that the mgmt cables are connected? [00:59:18] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:00:08] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:00:35] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) Trying to get some numbers on thumbnails and how much of them actually reach swift vs. how much need thumbnailing got... [01:02:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T318605)', diff saved to https://phabricator.wikimedia.org/P38527 and previous config saved to /var/cache/conftool/dbconfig/20221108-010224-ladsgroup.json [01:02:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [01:02:28] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [01:02:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [01:02:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38528 and previous config saved to /var/cache/conftool/dbconfig/20221108-010245-ladsgroup.json [01:22:59] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [01:23:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854070 (https://phabricator.wikimedia.org/T313395) (owner: 10Gergő Tisza) [01:25:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:37:03] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [01:38:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov2004.mgmt.codfw.wmnet with reboot policy FORCED [01:38:30] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:38:46] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:39:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) [01:40:34] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:34] 10SRE, 10ops-codfw, 10DC-Ops: Q2:rack/setup/install puppetdb2003 - https://phabricator.wikimedia.org/T317894 (10Papaul) [01:42:36] (03Merged) 10jenkins-bot: Add UserRegistrationLookupHelper [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854070 (https://phabricator.wikimedia.org/T313395) (owner: 10Gergő Tisza) [01:42:52] !log tgr@deploy1002 Started scap: Backport for [[gerrit:854070|Add UserRegistrationLookupHelper]] [01:43:11] !log tgr@deploy1002 tgr and tgr: Backport for [[gerrit:854070|Add UserRegistrationLookupHelper]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [01:46:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [01:46:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [01:46:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [01:47:28] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:854070|Add UserRegistrationLookupHelper]] (duration: 04m 36s) [01:47:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [01:48:46] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:40] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:00:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T318605)', diff saved to https://phabricator.wikimedia.org/P38529 and previous config saved to /var/cache/conftool/dbconfig/20221108-020019-ladsgroup.json [02:00:23] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:05:04] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Quiddity) @Ladsgroup Quick reply to a few details, with Enwiki examples: * 23px = flags (e.g. [[ https://en.wikipedia.org/w/inde... [02:08:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P38530 and previous config saved to /var/cache/conftool/dbconfig/20221108-021525-ladsgroup.json [02:18:46] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:25:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38531 and previous config saved to /var/cache/conftool/dbconfig/20221108-022520-ladsgroup.json [02:25:25] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:30:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P38532 and previous config saved to /var/cache/conftool/dbconfig/20221108-023032-ladsgroup.json [02:37:10] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P38533 and previous config saved to /var/cache/conftool/dbconfig/20221108-024027-ladsgroup.json [02:44:10] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:44:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED [02:45:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T318605)', diff saved to https://phabricator.wikimedia.org/P38534 and previous config saved to /var/cache/conftool/dbconfig/20221108-024539-ladsgroup.json [02:45:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [02:45:42] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [02:45:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [02:46:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T318605)', diff saved to https://phabricator.wikimedia.org/P38535 and previous config saved to /var/cache/conftool/dbconfig/20221108-024600-ladsgroup.json [02:47:42] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint1002.mgmt.eqiad.wmnet with reboot policy FORCED [02:55:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P38536 and previous config saved to /var/cache/conftool/dbconfig/20221108-025533-ladsgroup.json [02:57:28] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T0300) [03:00:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T318605)', diff saved to https://phabricator.wikimedia.org/P38537 and previous config saved to /var/cache/conftool/dbconfig/20221108-030043-ladsgroup.json [03:00:48] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [03:03:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [03:04:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [03:04:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [03:05:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [03:07:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.9 [core] (wmf/1.40.0-wmf.9) - 10https://gerrit.wikimedia.org/r/854112 (https://phabricator.wikimedia.org/T320514) [03:07:39] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.9 [core] (wmf/1.40.0-wmf.9) - 10https://gerrit.wikimedia.org/r/854112 (https://phabricator.wikimedia.org/T320514) (owner: 10TrainBranchBot) [03:10:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38538 and previous config saved to /var/cache/conftool/dbconfig/20221108-031041-ladsgroup.json [03:10:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [03:10:46] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [03:10:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [03:11:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38539 and previous config saved to /var/cache/conftool/dbconfig/20221108-031102-ladsgroup.json [03:11:49] (03CR) 10Bartosz Dziewoński: [C: 04-1] Keep DiscussionTools "Share feedback..." links on WMF wikis for now (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853991 (https://phabricator.wikimedia.org/T322494) (owner: 10Esanders) [03:15:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221108-031550-ladsgroup.json [03:16:10] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [03:16:18] (ProbeDown) firing: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:17:19] (ProbeDown) firing: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:17:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:17:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:18:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:18:06] PROBLEM - PyBal backends health check on lvs5001 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: testlb6_443: Servers cp5009.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5012.eq [03:18:06] t, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:18:32] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [03:19:53] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/1.40.0-wmf.9 [core] (wmf/1.40.0-wmf.9) - 10https://gerrit.wikimedia.org/r/854112 (https://phabricator.wikimedia.org/T320514) (owner: 10TrainBranchBot) [03:20:00] PROBLEM - Check systemd state on cp4037 is CRITICAL: CRITICAL - degraded: The following units failed: varnishmtail@default.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:06] RECOVERY - PyBal backends health check on lvs5001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:20:24] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [03:21:18] (ProbeDown) resolved: (15) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:22:18] (ProbeDown) resolved: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:22:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [03:22:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:23:56] RECOVERY - Check systemd state on cp4037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P38540 and previous config saved to /var/cache/conftool/dbconfig/20221108-033101-ladsgroup.json [03:36:16] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:37:40] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T318605)', diff saved to https://phabricator.wikimedia.org/P38541 and previous config saved to /var/cache/conftool/dbconfig/20221108-034607-ladsgroup.json [03:46:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [03:46:12] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [03:46:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [03:48:04] (03PS1) 10Gergő Tisza: Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) [03:48:43] (03CR) 10CI reject: [V: 04-1] Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [03:50:58] (03PS2) 10Gergő Tisza: Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) [03:51:32] (03CR) 10CI reject: [V: 04-1] Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [03:52:51] (03PS1) 10KartikMistry: Enable Content and Section translation in Bambara and Goan Konkani Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854143 (https://phabricator.wikimedia.org/T314557) [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T0400) [04:00:34] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:39] (03PS3) 10Gergő Tisza: Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) [04:07:44] (03CR) 10CI reject: [V: 04-1] Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [04:15:39] (03PS1) 10KartikMistry: Enable Content and Section translation on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854144 (https://phabricator.wikimedia.org/T319175) [04:34:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38542 and previous config saved to /var/cache/conftool/dbconfig/20221108-043444-ladsgroup.json [04:34:49] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [04:49:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P38543 and previous config saved to /var/cache/conftool/dbconfig/20221108-044951-ladsgroup.json [05:04:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P38544 and previous config saved to /var/cache/conftool/dbconfig/20221108-050457-ladsgroup.json [05:16:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:56] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:20:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T318605)', diff saved to https://phabricator.wikimedia.org/P38545 and previous config saved to /var/cache/conftool/dbconfig/20221108-052004-ladsgroup.json [05:20:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [05:20:08] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [05:20:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [05:20:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T318605)', diff saved to https://phabricator.wikimedia.org/P38546 and previous config saved to /var/cache/conftool/dbconfig/20221108-052025-ladsgroup.json [05:38:07] (03PS4) 10Gergő Tisza: Add GrowthExperiments periodic maintenance scripts for user impact [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) [05:47:03] (03CR) 10Vgutierrez: [C: 03+1] Release 1.5.3-3 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [06:07:54] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:02] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:27:29] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 46416 [06:27:32] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [06:27:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 46416 [06:29:11] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 30058 [06:30:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 30058 [06:31:50] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13150 [06:32:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13150 [06:33:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1138.eqiad.wmnet with reason: Maintenance [06:33:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1138.eqiad.wmnet with reason: Maintenance [06:33:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2140.codfw.wmnet with reason: Maintenance [06:33:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 4817 [06:33:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2140.codfw.wmnet with reason: Maintenance [06:33:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:34:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1173.eqiad.wmnet with reason: Maintenance [06:34:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2114.codfw.wmnet with reason: Maintenance [06:34:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2114.codfw.wmnet with reason: Maintenance [06:34:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 4817 [06:35:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 22381 [06:35:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 22381 [06:36:19] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 63199 [06:38:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 [06:39:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:39:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:39:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:40:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:40:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:41:04] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:41:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T318605)', diff saved to https://phabricator.wikimedia.org/P38547 and previous config saved to /var/cache/conftool/dbconfig/20221108-064129-ladsgroup.json [06:41:33] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [06:42:03] (03PS2) 10Phuedx: EditAttemptStep sampling rate to 1 for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854005 (https://phabricator.wikimedia.org/T312016) [06:43:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1112.eqiad.wmnet with reason: Maintenance [06:43:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1112.eqiad.wmnet with reason: Maintenance [06:44:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:44:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:44:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T321123)', diff saved to https://phabricator.wikimedia.org/P38548 and previous config saved to /var/cache/conftool/dbconfig/20221108-064422-marostegui.json [06:44:26] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [06:44:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:44:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:44:46] (03PS1) 10Vgutierrez: deployment-prep: Move swift expirer duties to ms-be07 [puppet] - 10https://gerrit.wikimedia.org/r/854416 (https://phabricator.wikimedia.org/T322231) [06:44:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38549 and previous config saved to /var/cache/conftool/dbconfig/20221108-064447-marostegui.json [06:44:51] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [06:47:34] (03CR) 10Vgutierrez: [C: 03+2] deployment-prep: Move swift expirer duties to ms-be07 [puppet] - 10https://gerrit.wikimedia.org/r/854416 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [06:50:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T321123)', diff saved to https://phabricator.wikimedia.org/P38550 and previous config saved to /var/cache/conftool/dbconfig/20221108-065029-marostegui.json [06:50:34] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [06:51:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38551 and previous config saved to /var/cache/conftool/dbconfig/20221108-065130-marostegui.json [06:51:33] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [06:56:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P38552 and previous config saved to /var/cache/conftool/dbconfig/20221108-065635-ladsgroup.json [07:00:04] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Primary database switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T0700). [07:04:18] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:20] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:05:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P38553 and previous config saved to /var/cache/conftool/dbconfig/20221108-070536-marostegui.json [07:06:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P38554 and previous config saved to /var/cache/conftool/dbconfig/20221108-070636-marostegui.json [07:11:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P38555 and previous config saved to /var/cache/conftool/dbconfig/20221108-071142-ladsgroup.json [07:20:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P38556 and previous config saved to /var/cache/conftool/dbconfig/20221108-072042-marostegui.json [07:21:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P38557 and previous config saved to /var/cache/conftool/dbconfig/20221108-072143-marostegui.json [07:22:32] !log push pfw policies - T322613 [07:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T318605)', diff saved to https://phabricator.wikimedia.org/P38558 and previous config saved to /var/cache/conftool/dbconfig/20221108-072648-ladsgroup.json [07:26:53] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:35:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T321123)', diff saved to https://phabricator.wikimedia.org/P38559 and previous config saved to /var/cache/conftool/dbconfig/20221108-073549-marostegui.json [07:35:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:35:53] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [07:36:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:36:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38560 and previous config saved to /var/cache/conftool/dbconfig/20221108-073649-marostegui.json [07:36:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1129.eqiad.wmnet with reason: Maintenance [07:36:53] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [07:37:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1129.eqiad.wmnet with reason: Maintenance [07:37:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T321130)', diff saved to https://phabricator.wikimedia.org/P38561 and previous config saved to /var/cache/conftool/dbconfig/20221108-073711-marostegui.json [07:40:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:40:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:40:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T321123)', diff saved to https://phabricator.wikimedia.org/P38562 and previous config saved to /var/cache/conftool/dbconfig/20221108-074022-marostegui.json [07:40:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T321130)', diff saved to https://phabricator.wikimedia.org/P38563 and previous config saved to /var/cache/conftool/dbconfig/20221108-074027-marostegui.json [07:46:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T321123)', diff saved to https://phabricator.wikimedia.org/P38564 and previous config saved to /var/cache/conftool/dbconfig/20221108-074628-marostegui.json [07:46:33] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [07:55:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P38565 and previous config saved to /var/cache/conftool/dbconfig/20221108-075533-marostegui.json [08:00:04] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T0800). [08:00:04] kart_ and phuedx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:21] * kart_ is here [08:01:15] I'll go ahead with self-deploy of my patches.. [08:01:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P38566 and previous config saved to /var/cache/conftool/dbconfig/20221108-080135-marostegui.json [08:01:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854143 (https://phabricator.wikimedia.org/T314557) (owner: 10KartikMistry) [08:02:43] (03Merged) 10jenkins-bot: Enable Content and Section translation in Bambara and Goan Konkani Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854143 (https://phabricator.wikimedia.org/T314557) (owner: 10KartikMistry) [08:03:03] !log kartik@deploy1002 Started scap: Backport for [[gerrit:854143|Enable Content and Section translation in Bambara and Goan Konkani Wikipedias (T314557)]] [08:03:07] T314557: Enable Content and Section translation on wikipedias with new MT support from Google for languages once it is working - https://phabricator.wikimedia.org/T314557 [08:03:25] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:854143|Enable Content and Section translation in Bambara and Goan Konkani Wikipedias (T314557)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:07:08] o/ [08:07:31] o/ kart_ [08:07:53] Hello phuedx [08:09:47] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:854143|Enable Content and Section translation in Bambara and Goan Konkani Wikipedias (T314557)]] (duration: 06m 44s) [08:09:51] T314557: Enable Content and Section translation on wikipedias with new MT support from Google for languages once it is working - https://phabricator.wikimedia.org/T314557 [08:10:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:10:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P38567 and previous config saved to /var/cache/conftool/dbconfig/20221108-081040-marostegui.json [08:10:55] (03PS2) 10KartikMistry: Enable Content and Section translation on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854144 (https://phabricator.wikimedia.org/T319175) [08:10:57] First patch done. [08:11:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:11:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:12:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:14:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubetcd1006.eqiad.wmnet to drbd [08:15:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854144 (https://phabricator.wikimedia.org/T319175) (owner: 10KartikMistry) [08:16:11] (03Merged) 10jenkins-bot: Enable Content and Section translation on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854144 (https://phabricator.wikimedia.org/T319175) (owner: 10KartikMistry) [08:16:26] !log kartik@deploy1002 Started scap: Backport for [[gerrit:854144|Enable Content and Section translation on 6 Wikipedias (T319175)]] [08:16:29] T319175: Enable Content and Section translation on 6 more Wikipedias - https://phabricator.wikimedia.org/T319175 [08:16:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P38568 and previous config saved to /var/cache/conftool/dbconfig/20221108-081641-marostegui.json [08:16:45] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:854144|Enable Content and Section translation on 6 Wikipedias (T319175)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [08:21:18] Deploying 2nd patch.. [08:22:07] phuedx: You want to self deploy or anyone else available to deploy it? [08:22:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:22:38] kart_: I haven't refreshed my production keys since getting a new laptop so I can't deploy right now [08:22:50] Could you deploy or is there anyone else? [08:23:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:23:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:23:41] phuedx: Can you rebase it, while my deployment is going on..? [08:23:47] Can do [08:23:49] I can deploy. [08:24:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:24:09] (03PS4) 10Phuedx: Update Metrics Platform streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277) [08:24:27] (03PS3) 10Phuedx: EditAttemptStep sampling rate to 1 for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854005 (https://phabricator.wikimedia.org/T312016) [08:24:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubetcd1006.eqiad.wmnet to drbd [08:24:44] PROBLEM - Host kubetcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:46] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:854144|Enable Content and Section translation on 6 Wikipedias (T319175)]] (duration: 08m 20s) [08:24:49] T319175: Enable Content and Section translation on 6 more Wikipedias - https://phabricator.wikimedia.org/T319175 [08:25:00] RECOVERY - Host kubetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [08:25:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T321130)', diff saved to https://phabricator.wikimedia.org/P38569 and previous config saved to /var/cache/conftool/dbconfig/20221108-082546-marostegui.json [08:25:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance [08:25:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [08:26:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance [08:28:19] phuedx: Thanks. Would you able to test the patch using usual mwdebug? [08:28:35] give me a minute.. [08:29:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854005 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [08:30:01] (03Merged) 10jenkins-bot: EditAttemptStep sampling rate to 1 for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854005 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [08:30:13] !log kartik@deploy1002 Started scap: Backport for [[gerrit:854005|EditAttemptStep sampling rate to 1 for group1 wikis (T312016)]] [08:30:17] T312016: Increase EditAttemptStep sampling rate(s) to 100% - https://phabricator.wikimedia.org/T312016 [08:30:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:30:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:30:32] !log kartik@deploy1002 kartik and phuedx: Backport for [[gerrit:854005|EditAttemptStep sampling rate to 1 for group1 wikis (T312016)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:30:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38570 and previous config saved to /var/cache/conftool/dbconfig/20221108-083037-marostegui.json [08:31:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T321123)', diff saved to https://phabricator.wikimedia.org/P38571 and previous config saved to /var/cache/conftool/dbconfig/20221108-083148-marostegui.json [08:31:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:31:52] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [08:31:55] phuedx: can you test it? [08:31:59] kart_: Yes :) [08:32:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T321123)', diff saved to https://phabricator.wikimedia.org/P38572 and previous config saved to /var/cache/conftool/dbconfig/20221108-083210-marostegui.json [08:32:25] phuedx: mwdebug1001/1002/2001/2002 [08:33:26] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubetcd1006.eqiad.wmnet to plain [08:34:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubetcd1006.eqiad.wmnet to plain [08:34:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:35:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:35:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:36:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:37:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38573 and previous config saved to /var/cache/conftool/dbconfig/20221108-083709-marostegui.json [08:37:13] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [08:37:16] kart_: I've missed a config variable in that patch so it isn't having the full effect. To be clear, it's not breaking anything, there are no errors, but the change is a NOP [08:37:25] Happy to revert and try again later [08:37:35] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic-Icebox, and 2 others: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10hashar) [08:37:45] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic-Icebox, and 2 others: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10hashar) Gerrit kept reporting `org.apache.http.client.protocol.ResponseProcessCookies :... [08:37:46] OK. I'm aborting sync.. [08:37:55] !log kartik@deploy1002 Sync cancelled. [08:38:37] (03PS1) 10TrainBranchBot: Revert "EditAttemptStep sampling rate to 1 for group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854472 [08:38:39] (03CR) 10TrainBranchBot: "kartik@deploy1002 created a revert of this change as Iaced4b89d71123de84e06217721b5ca44489f6d2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854005 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [08:38:57] phuedx: I'm reverting it. [08:39:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854472 (owner: 10TrainBranchBot) [08:39:31] Thanks [08:39:46] (03Merged) 10jenkins-bot: Revert "EditAttemptStep sampling rate to 1 for group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854472 (owner: 10TrainBranchBot) [08:39:58] !log kartik@deploy1002 Started scap: Backport for [[gerrit:854472|Revert "EditAttemptStep sampling rate to 1 for group1 wikis"]] [08:40:18] !log kartik@deploy1002 kartik and trainbranchbot: Backport for [[gerrit:854472|Revert "EditAttemptStep sampling rate to 1 for group1 wikis"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:40:38] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:42:36] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [08:44:25] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:854472|Revert "EditAttemptStep sampling rate to 1 for group1 wikis"]] (duration: 04m 26s) [08:44:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubetcd1004.eqiad.wmnet to drbd [08:46:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:47:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:47:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:48:11] phuedx: revert is done. [08:48:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:50:06] Thanks kart_ [08:51:25] kart_: Do you have bandwidth to do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/852838? [08:51:45] phuedx: sadly, I've to go :/ [08:52:00] phuedx: is it time sensitive? Or can wait few hours? [08:52:08] No worries. Thanks for your help with the other one. I'll queue it up for the next window :) [08:52:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P38574 and previous config saved to /var/cache/conftool/dbconfig/20221108-085216-marostegui.json [08:52:18] phuedx: cool. [08:54:17] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/852815 (owner: 10L10n-bot) [08:54:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubetcd1004.eqiad.wmnet to drbd [08:54:50] PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:55:18] RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [08:56:20] (03CR) 10Giuseppe Lavagetto: New organization of templates (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [08:57:24] (03PS14) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [08:58:31] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/853973 (owner: 10L10n-bot) [09:07:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P38575 and previous config saved to /var/cache/conftool/dbconfig/20221108-090722-marostegui.json [09:08:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubetcd1004.eqiad.wmnet to plain [09:09:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubetcd1004.eqiad.wmnet to plain [09:15:51] 10SRE, 10ops-codfw: Troubleshoot why latest idrac version is not working on Dell servers - https://phabricator.wikimedia.org/T322419 (10Aklapper) [09:17:26] !log drain ganeti1018 for eventual reimage to bullseye T311687 [09:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:29] (03Abandoned) 10Vgutierrez: ATS: Add timing request information to ats-tls log [puppet] - 10https://gerrit.wikimedia.org/r/542317 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [09:17:30] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [09:19:42] (03PS4) 10Filippo Giunchedi: access_new_install: use install-console and compat symlink [puppet] - 10https://gerrit.wikimedia.org/r/854058 [09:19:50] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37999/console" [puppet] - 10https://gerrit.wikimedia.org/r/854040 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [09:19:56] (03CR) 10Filippo Giunchedi: access_new_install: use install-console and compat symlink (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854058 (owner: 10Filippo Giunchedi) [09:20:03] (03CR) 10Vgutierrez: prometheus: Handle inactive trafficserver service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [09:22:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/854058 (owner: 10Filippo Giunchedi) [09:22:25] !log installing ffmpeg security updates on buster [09:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38576 and previous config saved to /var/cache/conftool/dbconfig/20221108-092229-marostegui.json [09:22:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:22:33] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:22:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:22:45] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM %request for dispatch-be2001 - https://phabricator.wikimedia.org/T322556 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi VM is up and running, for reference I've created it with: ` root@cumin1001:~# cookbook sre.ganeti.makevm --vcpus 2... [09:22:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:22:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:22:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T321130)', diff saved to https://phabricator.wikimedia.org/P38577 and previous config saved to /var/cache/conftool/dbconfig/20221108-092256-marostegui.json [09:23:42] (03CR) 10Filippo Giunchedi: [C: 03+2] access_new_install: use install-console and compat symlink [puppet] - 10https://gerrit.wikimedia.org/r/854058 (owner: 10Filippo Giunchedi) [09:23:47] (03PS5) 10Filippo Giunchedi: access_new_install: use install-console and compat symlink [puppet] - 10https://gerrit.wikimedia.org/r/854058 [09:23:58] PROBLEM - Ganeti memory on ganeti1013 is CRITICAL: CRIT Memory 95% used. Largest process: qemu-system-x86 (14505) = 25.6% https://wikitech.wikimedia.org/wiki/Ganeti%23Memory_pressure [09:25:54] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KMorgan - https://phabricator.wikimedia.org/T322154 (10Aklapper) @jbond: Hi, this task isn't resolved yet; see the Phab part in https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#LDAP_access - thanks. [09:25:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10fgiunchedi) p:05Triage→03Medium [09:27:33] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) (owner: 10BCornwall) [09:28:25] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] alertmanager: use 'site' label to route tasks for dcops [puppet] - 10https://gerrit.wikimedia.org/r/854040 (https://phabricator.wikimedia.org/T225140) (owner: 10Filippo Giunchedi) [09:29:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T321130)', diff saved to https://phabricator.wikimedia.org/P38578 and previous config saved to /var/cache/conftool/dbconfig/20221108-092915-marostegui.json [09:29:19] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:32:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321123)', diff saved to https://phabricator.wikimedia.org/P38579 and previous config saved to /var/cache/conftool/dbconfig/20221108-093226-marostegui.json [09:32:30] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [09:33:29] (03PS2) 10Clément Goubert: P:kubernetes::deployment_server: fix absenting [puppet] - 10https://gerrit.wikimedia.org/r/854059 (https://phabricator.wikimedia.org/T322298) [09:34:06] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.39-notes, and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10ArielGlenn) [09:34:28] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:13] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38000/console" [puppet] - 10https://gerrit.wikimedia.org/r/854059 (https://phabricator.wikimedia.org/T322298) (owner: 10Clément Goubert) [09:39:40] (03CR) 10Clément Goubert: [V: 03+1] "I think the approach in PS2 is better than the one in PS1, but I'd like to know what y'all think" [puppet] - 10https://gerrit.wikimedia.org/r/854059 (https://phabricator.wikimedia.org/T322298) (owner: 10Clément Goubert) [09:40:51] (03PS1) 10Phuedx: EditAttemptStep sampling rate to 1 for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854475 (https://phabricator.wikimedia.org/T312016) [09:44:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P38580 and previous config saved to /var/cache/conftool/dbconfig/20221108-094422-marostegui.json [09:46:17] (03PS2) 10Phuedx: EditAttemptStep sampling rate to 1 for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854475 (https://phabricator.wikimedia.org/T312016) [09:46:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [09:46:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [09:46:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T322618)', diff saved to https://phabricator.wikimedia.org/P38581 and previous config saved to /var/cache/conftool/dbconfig/20221108-094655-ladsgroup.json [09:46:59] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:47:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38582 and previous config saved to /var/cache/conftool/dbconfig/20221108-094732-marostegui.json [09:49:22] (03PS1) 10Vgutierrez: prometheus: Fix ATS TTFB by backend/crc rate expression [puppet] - 10https://gerrit.wikimedia.org/r/854476 (https://phabricator.wikimedia.org/T321484) [09:49:48] (03PS10) 10Jbond: dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [09:49:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Partially done', diff saved to https://phabricator.wikimedia.org/P38583 and previous config saved to /var/cache/conftool/dbconfig/20221108-094950-ladsgroup.json [09:50:01] (03CR) 10Jbond: dumps/distribution: add more data types to parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [09:50:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [09:50:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [09:50:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:50:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T322618)', diff saved to https://phabricator.wikimedia.org/P38584 and previous config saved to /var/cache/conftool/dbconfig/20221108-095026-ladsgroup.json [09:50:30] (03CR) 10CI reject: [V: 04-1] dumps/distribution: add more data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/852260 (owner: 10Dzahn) [09:50:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:52:04] (03CR) 10Vgutierrez: [C: 03+2] prometheus: Fix ATS TTFB by backend/crc rate expression [puppet] - 10https://gerrit.wikimedia.org/r/854476 (https://phabricator.wikimedia.org/T321484) (owner: 10Vgutierrez) [09:52:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48976 bytes in 1.820 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:52:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.974 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:52:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T322618)', diff saved to https://phabricator.wikimedia.org/P38585 and previous config saved to /var/cache/conftool/dbconfig/20221108-095236-ladsgroup.json [09:52:41] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:59:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P38586 and previous config saved to /var/cache/conftool/dbconfig/20221108-095928-marostegui.json [10:02:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P38587 and previous config saved to /var/cache/conftool/dbconfig/20221108-100239-marostegui.json [10:04:18] (03CR) 10JMeybohm: [C: 03+2] Add --service-account* flags for TokenRequest [puppet] - 10https://gerrit.wikimedia.org/r/854011 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:04:21] (03CR) 10JMeybohm: [C: 03+2] k8s: Add version switching where needed [puppet] - 10https://gerrit.wikimedia.org/r/853995 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:04:57] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Use the K8s::Core::V1Taint type [puppet] - 10https://gerrit.wikimedia.org/r/853996 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:05:41] 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10Joe) After a discussion with @fgiunchedi - given we're going to stream apache logs in json format to kafka, we can just use benthos to... [10:07:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P38588 and previous config saved to /var/cache/conftool/dbconfig/20221108-100743-ladsgroup.json [10:08:49] (03PS2) 10Clément Goubert: mediawiki: Create new mw-web deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) [10:09:34] (03PS2) 10Clément Goubert: mediawiki: Create new mw-jobrunner deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853958 (https://phabricator.wikimedia.org/T321897) [10:10:01] !log installing glibc security updates on buster [10:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:26] (03PS2) 10Clément Goubert: mediawiki: Create new mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853952 (https://phabricator.wikimedia.org/T321896) [10:11:23] (03PS2) 10Muehlenhoff: Retire generic insetup role [puppet] - 10https://gerrit.wikimedia.org/r/852223 [10:14:33] moritzm: ^^ do we have a task for that CR? [10:14:36] * vgutierrez curious about it [10:14:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T321130)', diff saved to https://phabricator.wikimedia.org/P38589 and previous config saved to /var/cache/conftool/dbconfig/20221108-101435-marostegui.json [10:14:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:14:40] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:14:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:14:55] (03CR) 10Hnowlan: [C: 03+2] Encode before using hashlib [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854029 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [10:14:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T321130)', diff saved to https://phabricator.wikimedia.org/P38590 and previous config saved to /var/cache/conftool/dbconfig/20221108-101457-marostegui.json [10:16:14] (03CR) 10JMeybohm: [C: 03+1] Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [10:17:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T321130)', diff saved to https://phabricator.wikimedia.org/P38591 and previous config saved to /var/cache/conftool/dbconfig/20221108-101713-marostegui.json [10:17:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321123)', diff saved to https://phabricator.wikimedia.org/P38592 and previous config saved to /var/cache/conftool/dbconfig/20221108-101745-marostegui.json [10:17:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:17:49] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:18:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:18:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T321123)', diff saved to https://phabricator.wikimedia.org/P38593 and previous config saved to /var/cache/conftool/dbconfig/20221108-101806-marostegui.json [10:18:55] (03PS1) 10Majavah: Drop old wikilabels database roles [puppet] - 10https://gerrit.wikimedia.org/r/854478 (https://phabricator.wikimedia.org/T307389) [10:18:57] (03PS1) 10Majavah: openstack: drop wikilabels global dns entries [puppet] - 10https://gerrit.wikimedia.org/r/854479 (https://phabricator.wikimedia.org/T307389) [10:20:03] (03CR) 10ArielGlenn: "What would be the testing procedure for db config reloads, if this feature is not enabled in e.g. deployment-prep? Just making sure we hav" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [10:22:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P38594 and previous config saved to /var/cache/conftool/dbconfig/20221108-102249-ladsgroup.json [10:23:28] 10SRE, 10SRE-Access-Requests: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10BTullis) 05Resolved→03Open @fgiunchedi - Are you happy for me to add Steve to the `wmf` and `ops` LDAP groups? I realise that we didn't specify them above, but I think they ar... [10:24:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T321123)', diff saved to https://phabricator.wikimedia.org/P38595 and previous config saved to /var/cache/conftool/dbconfig/20221108-102415-marostegui.json [10:24:19] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:25:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall LGTM; some minor fixes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) (owner: 10Clément Goubert) [10:27:17] (03Merged) 10jenkins-bot: Encode before using hashlib [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854029 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [10:30:47] vgutierrez: there's no task, the idea is to make it easier to map the servers being setup to a specific team, the hardware request template will be extended to list the team in question. this will help DC ops in creating the puppet-the-server-in-prod task after it has been racked [10:32:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P38596 and previous config saved to /var/cache/conftool/dbconfig/20221108-103219-marostegui.json [10:37:05] (03CR) 10Clément Goubert: "Thanks, will update the rest of the CRs for mw-*" [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) (owner: 10Clément Goubert) [10:37:25] (03PS3) 10Clément Goubert: mediawiki: Create new mw-web deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) [10:37:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T322618)', diff saved to https://phabricator.wikimedia.org/P38597 and previous config saved to /var/cache/conftool/dbconfig/20221108-103756-ladsgroup.json [10:37:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [10:38:00] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:38:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [10:38:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T322618)', diff saved to https://phabricator.wikimedia.org/P38598 and previous config saved to /var/cache/conftool/dbconfig/20221108-103817-ladsgroup.json [10:39:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P38599 and previous config saved to /var/cache/conftool/dbconfig/20221108-103921-marostegui.json [10:39:34] (03PS5) 10Phuedx: Update Metrics Platform streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/852838 (https://phabricator.wikimedia.org/T322277) [10:40:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T322618)', diff saved to https://phabricator.wikimedia.org/P38600 and previous config saved to /var/cache/conftool/dbconfig/20221108-104031-ladsgroup.json [10:42:40] !log installing ntfs-3g security updates [10:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:17] (03PS3) 10Clément Goubert: mediawiki: Create new mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853952 (https://phabricator.wikimedia.org/T321896) [10:44:28] (03PS3) 10Clément Goubert: mediawiki: Create new mw-api-int deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853933 (https://phabricator.wikimedia.org/T321895) [10:45:00] 10SRE, 10SRE-Access-Requests: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10fgiunchedi) >>! In T322339#8378750, @BTullis wrote: > @fgiunchedi - Are you happy for me to add Steve to the `wmf` and `ops` LDAP groups? I realise that we didn't specify them abo... [10:46:13] (03PS3) 10Clément Goubert: mediawiki: Create new mw-jobrunner deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853958 (https://phabricator.wikimedia.org/T321897) [10:47:07] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KMorgan - https://phabricator.wikimedia.org/T322154 (10fgiunchedi) 05Resolved→03Open Thank you @Aklapper, I'll reopen the task and finish the steps [10:47:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P38601 and previous config saved to /var/cache/conftool/dbconfig/20221108-104726-marostegui.json [10:50:23] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KMorgan - https://phabricator.wikimedia.org/T322154 (10fgiunchedi) If I'm not mistaken wmf-nda phab group membership was the only missing bit, does that seem correct @Aklapper ? [10:51:13] !log installing batik security updates [10:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:29] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KMorgan - https://phabricator.wikimedia.org/T322154 (10fgiunchedi) p:05Triage→03Medium [10:52:37] !log added stevemunene to wmf and ops LDAP groups T322339 [10:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:29] (03PS1) 10JMeybohm: calico: Allow alternative dnsConfig and disabling of IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/854481 (https://phabricator.wikimedia.org/T307943) [10:54:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:54:21] (03CR) 10CI reject: [V: 04-1] calico: Allow alternative dnsConfig and disabling of IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/854481 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:54:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P38602 and previous config saved to /var/cache/conftool/dbconfig/20221108-105428-marostegui.json [10:55:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P38603 and previous config saved to /var/cache/conftool/dbconfig/20221108-105538-ladsgroup.json [10:55:57] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [10:56:47] (03PS2) 10JMeybohm: calico: Allow alternative dnsConfig and disabling of IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/854481 (https://phabricator.wikimedia.org/T307943) [10:57:00] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [10:57:14] 10SRE, 10SRE-Access-Requests: Requesting access to ops and analytics for stevemunene - https://phabricator.wikimedia.org/T322339 (10BTullis) 05Open→03Resolved Great! Thanks. I've added `uid=stevemunene ` to the `ops` and `wmf` LDAP groups, as per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_r... [10:58:44] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [10:59:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:00:00] (03CR) 10Jbond: [C: 03+1] "lgtm and i also prefer ps2" [puppet] - 10https://gerrit.wikimedia.org/r/854059 (https://phabricator.wikimedia.org/T322298) (owner: 10Clément Goubert) [11:00:05] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [11:00:36] !log drain ganeti1024 for eventual reimage to bullseye T311687 [11:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:39] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [11:01:10] (03PS1) 10Muehlenhoff: puppetdb::database: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/854485 (https://phabricator.wikimedia.org/T321783) [11:02:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T321130)', diff saved to https://phabricator.wikimedia.org/P38604 and previous config saved to /var/cache/conftool/dbconfig/20221108-110232-marostegui.json [11:02:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:02:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [11:02:37] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:02:41] (03PS4) 10Filippo Giunchedi: dispatch: sync user role and info from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) [11:02:43] (03PS1) 10Filippo Giunchedi: hieradata: set dispatch-be2001 as replica [puppet] - 10https://gerrit.wikimedia.org/r/854486 (https://phabricator.wikimedia.org/T313229) [11:02:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38605 and previous config saved to /var/cache/conftool/dbconfig/20221108-110243-marostegui.json [11:03:31] (03PS1) 10Elukey: [WIP] - Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 [11:04:49] (03CR) 10Filippo Giunchedi: "Thank you for the reviews -- I'm for merging this and swap the org/project once that's a thing, thoughts ?" [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [11:04:59] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set dispatch-be2001 as replica [puppet] - 10https://gerrit.wikimedia.org/r/854486 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [11:05:05] (03PS2) 10Filippo Giunchedi: hieradata: set dispatch-be2001 as replica [puppet] - 10https://gerrit.wikimedia.org/r/854486 (https://phabricator.wikimedia.org/T313229) [11:06:06] (03Abandoned) 10Filippo Giunchedi: timer::job: remove monitoring_enabled [puppet] - 10https://gerrit.wikimedia.org/r/849088 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [11:06:14] (03Abandoned) 10Filippo Giunchedi: systemd: drop timer-specific alert in favor of generic alert [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [11:09:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38606 and previous config saved to /var/cache/conftool/dbconfig/20221108-110911-marostegui.json [11:09:15] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:09:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T321123)', diff saved to https://phabricator.wikimedia.org/P38607 and previous config saved to /var/cache/conftool/dbconfig/20221108-110934-marostegui.json [11:09:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:09:39] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [11:09:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:09:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T321123)', diff saved to https://phabricator.wikimedia.org/P38608 and previous config saved to /var/cache/conftool/dbconfig/20221108-110956-marostegui.json [11:10:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P38609 and previous config saved to /var/cache/conftool/dbconfig/20221108-111044-ladsgroup.json [11:14:40] (03CR) 10Nikerabbit: [C: 03+1] Enable logging for UpdateMessageBundleJob (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853357 (https://phabricator.wikimedia.org/T322430) (owner: 10Abijeet Patro) [11:14:41] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KMorgan - https://phabricator.wikimedia.org/T322154 (10Aklapper) Yes. (On a meta level, I wonder how to make folks rely less on their memory but establish checking docs as docs may have changed over the years.) [11:14:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit." [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [11:16:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T321123)', diff saved to https://phabricator.wikimedia.org/P38610 and previous config saved to /var/cache/conftool/dbconfig/20221108-111602-marostegui.json [11:16:06] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [11:17:06] (03PS3) 10Hnowlan: thumbor: enable setting log level, set staging to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/854026 (https://phabricator.wikimedia.org/T233196) [11:17:26] (03CR) 10Majavah: [C: 03+1] add wmcs-securitygroup-backfill [puppet] - 10https://gerrit.wikimedia.org/r/850592 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [11:18:11] 10SRE, 10Commons, 10Data-Persistence (work done), 10MediaWiki-extensions-WikibaseClient, and 6 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Manuel) [11:18:22] 10SRE-swift-storage: pristine-tar handles complex filenames badly - https://phabricator.wikimedia.org/T322549 (10MatthewVernon) Fixed with this upload: ` Format: 1.8 Date: Fri, 04 Nov 2022 11:23:44 +0000 Source: pristine-tar Binary: pristine-tar Architecture: source Version: 1.50 Distribution: unstable Urgency:... [11:18:24] 10SRE-swift-storage: pristine-tar handles complex filenames badly - https://phabricator.wikimedia.org/T322549 (10MatthewVernon) 05Open→03Resolved [11:18:26] 10SRE-swift-storage: Update Debian rclone package to 1.60.0 - https://phabricator.wikimedia.org/T322547 (10MatthewVernon) [11:18:48] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KMorgan - https://phabricator.wikimedia.org/T322154 (10Urbanecm) >>! In T322154#8378965, @Aklapper wrote: > Yes. (On a meta level, I wonder how to make folks rely less on their memory but establish checking docs as docs may have changed over the years... [11:19:46] 10SRE, 10Thumbor, 10serviceops, 10Security: Filter potentially harmful PostScript commands in Commons upload/thumbor - https://phabricator.wikimedia.org/T210833 (10jijiki) [11:22:40] (03CR) 10Hnowlan: [C: 03+2] thumbor: enable setting log level, set staging to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/854026 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:24:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P38611 and previous config saved to /var/cache/conftool/dbconfig/20221108-112417-marostegui.json [11:25:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T322618)', diff saved to https://phabricator.wikimedia.org/P38612 and previous config saved to /var/cache/conftool/dbconfig/20221108-112551-ladsgroup.json [11:25:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [11:25:57] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:26:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [11:26:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T322618)', diff saved to https://phabricator.wikimedia.org/P38613 and previous config saved to /var/cache/conftool/dbconfig/20221108-112612-ladsgroup.json [11:26:24] (03Merged) 10jenkins-bot: thumbor: enable setting log level, set staging to debug [deployment-charts] - 10https://gerrit.wikimedia.org/r/854026 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:27:15] (03PS1) 10Filippo Giunchedi: postgresql: move CLI to pg- namespace and dashes [puppet] - 10https://gerrit.wikimedia.org/r/854493 [11:28:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T322618)', diff saved to https://phabricator.wikimedia.org/P38614 and previous config saved to /var/cache/conftool/dbconfig/20221108-112825-ladsgroup.json [11:28:29] (03PS1) 10Majavah: P:metricsinfra::alertmanager: update default target [puppet] - 10https://gerrit.wikimedia.org/r/854494 [11:31:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P38615 and previous config saved to /var/cache/conftool/dbconfig/20221108-113109-marostegui.json [11:31:22] (03CR) 10Filippo Giunchedi: "Adding a few folks as reviewers based on git history, feel free to add/remove as needed" [puppet] - 10https://gerrit.wikimedia.org/r/854493 (owner: 10Filippo Giunchedi) [11:32:36] (03PS1) 10Stevemunene: Add Stevemunene to icinga [puppet] - 10https://gerrit.wikimedia.org/r/854495 (https://phabricator.wikimedia.org/T322339) [11:36:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:38:25] (03CR) 10Giuseppe Lavagetto: New organization of templates (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [11:38:31] (03PS15) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [11:38:54] (03CR) 10CI reject: [V: 04-1] New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [11:39:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P38616 and previous config saved to /var/cache/conftool/dbconfig/20221108-113924-marostegui.json [11:41:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:43:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P38617 and previous config saved to /var/cache/conftool/dbconfig/20221108-114333-ladsgroup.json [11:44:39] (03PS1) 10Btullis: Add namespaces for spark and spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/854498 (https://phabricator.wikimedia.org/T321686) [11:46:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P38618 and previous config saved to /var/cache/conftool/dbconfig/20221108-114615-marostegui.json [11:50:01] (03PS1) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [11:50:52] (03CR) 10Btullis: [C: 03+2] "Looks great. Thanks Steve. I'll +2 and merge this, since you don't have +2 rights on this repository." [puppet] - 10https://gerrit.wikimedia.org/r/854495 (https://phabricator.wikimedia.org/T322339) (owner: 10Stevemunene) [11:51:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38002/console" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (owner: 10Elukey) [11:52:15] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [11:53:14] (03PS2) 10Hokwelum: Add poincare.acc.umu.se to ipv4 and ipv6 config [puppet] - 10https://gerrit.wikimedia.org/r/853965 [11:54:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38619 and previous config saved to /var/cache/conftool/dbconfig/20221108-115430-marostegui.json [11:54:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:54:34] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:54:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:54:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T321130)', diff saved to https://phabricator.wikimedia.org/P38620 and previous config saved to /var/cache/conftool/dbconfig/20221108-115452-marostegui.json [11:55:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [11:55:48] (03CR) 10ArielGlenn: [C: 03+2] Add poincare.acc.umu.se to ipv4 and ipv6 config [puppet] - 10https://gerrit.wikimedia.org/r/853965 (owner: 10Hokwelum) [11:56:32] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [11:57:52] (03PS2) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [11:58:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P38621 and previous config saved to /var/cache/conftool/dbconfig/20221108-115840-ladsgroup.json [11:58:45] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38003/console" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (owner: 10Elukey) [12:00:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/852223 (owner: 10Muehlenhoff) [12:00:57] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [12:01:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T321130)', diff saved to https://phabricator.wikimedia.org/P38622 and previous config saved to /var/cache/conftool/dbconfig/20221108-120117-marostegui.json [12:01:21] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:01:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T321123)', diff saved to https://phabricator.wikimedia.org/P38623 and previous config saved to /var/cache/conftool/dbconfig/20221108-120122-marostegui.json [12:01:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:01:26] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:01:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:01:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T321123)', diff saved to https://phabricator.wikimedia.org/P38624 and previous config saved to /var/cache/conftool/dbconfig/20221108-120143-marostegui.json [12:02:19] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:03:01] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:07:07] (03PS2) 10Esanders: Keep DiscussionTools "Share feedback..." links on WMF wikis for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853991 (https://phabricator.wikimedia.org/T322494) [12:07:22] (03PS3) 10Esanders: Keep DiscussionTools "Share feedback..." links on WMF wikis for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853991 (https://phabricator.wikimedia.org/T322494) [12:08:17] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:09:15] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:09:49] (03CR) 10Esanders: [C: 03+1] Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (owner: 10Esanders) [12:09:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T321123)', diff saved to https://phabricator.wikimedia.org/P38625 and previous config saved to /var/cache/conftool/dbconfig/20221108-120949-marostegui.json [12:09:52] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:10:17] (03PS1) 10Muehlenhoff: apereo_cas: Install libmemcached-tools [puppet] - 10https://gerrit.wikimedia.org/r/854504 [12:10:45] (03CR) 10Jbond: [C: 03+1] "lgtm there may be some other config tweaks needed if parames have been renamed/deprecated etc but shld be able to catch them as we progres" [puppet] - 10https://gerrit.wikimedia.org/r/854485 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:13:05] (03PS1) 10Btullis: Configure the kube_env file for the spark-operator namespace [puppet] - 10https://gerrit.wikimedia.org/r/854505 (https://phabricator.wikimedia.org/T321686) [12:13:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T322618)', diff saved to https://phabricator.wikimedia.org/P38626 and previous config saved to /var/cache/conftool/dbconfig/20221108-121347-ladsgroup.json [12:13:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [12:13:51] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:14:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [12:14:10] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:14:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:14:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:14:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [12:14:25] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:14:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [12:14:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T322618)', diff saved to https://phabricator.wikimedia.org/P38627 and previous config saved to /var/cache/conftool/dbconfig/20221108-121433-ladsgroup.json [12:14:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/854493 (owner: 10Filippo Giunchedi) [12:14:50] jouncebot: nowandnext [12:14:50] No deployments scheduled for the next 1 hour(s) and 45 minute(s) [12:14:50] In 1 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T1400) [12:14:50] In 1 hour(s) and 45 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T1400) [12:14:54] coool [12:15:00] (03PS1) 10Hnowlan: Preserve message when setting HTTP status [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854506 [12:15:03] (03CR) 10Ladsgroup: [C: 03+2] "This change is ready for review." [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854077 (https://phabricator.wikimedia.org/T274041) (owner: 10Ladsgroup) [12:15:06] (03PS1) 10Clément Goubert: eventgate: Fix canary release routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 [12:15:21] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38004/console" [puppet] - 10https://gerrit.wikimedia.org/r/854505 (https://phabricator.wikimedia.org/T321686) (owner: 10Btullis) [12:16:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P38628 and previous config saved to /var/cache/conftool/dbconfig/20221108-121623-marostegui.json [12:16:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T322618)', diff saved to https://phabricator.wikimedia.org/P38629 and previous config saved to /var/cache/conftool/dbconfig/20221108-121646-ladsgroup.json [12:18:09] (03CR) 10Muehlenhoff: puppetdb::database: Add support for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854485 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:18:11] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb::database: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/854485 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:19:40] (03CR) 10Jbond: [C: 03+1] "lgtm, sorry must have missed this one" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841864 (owner: 10Muehlenhoff) [12:19:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/854504 (owner: 10Muehlenhoff) [12:21:00] (03PS4) 10Clément Goubert: mediawiki: Create new mw-jobrunner deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853958 (https://phabricator.wikimedia.org/T321897) [12:21:21] (03CR) 10Muehlenhoff: [C: 03+2] apereo_cas: Install libmemcached-tools [puppet] - 10https://gerrit.wikimedia.org/r/854504 (owner: 10Muehlenhoff) [12:21:23] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) Some ideas on to move forward here: - {T322621} - {T322622} - Remove 1024px and 1920px from pre-generated thumbs, t... [12:22:28] (03CR) 10Ssingh: [C: 03+2] Release 1.5.3-3 [software/varnish/libvmod-re2] (debian-6.0) - 10https://gerrit.wikimedia.org/r/854063 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [12:23:40] (03PS4) 10Clément Goubert: mediawiki: Create new mw-api-int deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853933 (https://phabricator.wikimedia.org/T321895) [12:23:58] (03PS1) 10Ladsgroup: Re-add s11 in db config reload callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854509 (https://phabricator.wikimedia.org/T322598) [12:24:11] (03PS4) 10Clément Goubert: mediawiki: Create new mw-api-ext deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853952 (https://phabricator.wikimedia.org/T321896) [12:24:35] (03PS4) 10Clément Goubert: mediawiki: Create new mw-web deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/853975 (https://phabricator.wikimedia.org/T321900) [12:24:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P38630 and previous config saved to /var/cache/conftool/dbconfig/20221108-122455-marostegui.json [12:25:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp1002.wikimedia.org [12:27:09] (03PS4) 10Jbond: nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 [12:27:11] (03PS1) 10Jbond: tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 [12:27:19] !log reprepro -C main include bullseye-wikimedia libvmod-re2_1.5.3-3_amd64.changes: T321309 [12:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:23] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [12:27:49] (03CR) 10Ladsgroup: Only Enable LBFactory config callback in CLI in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854090 (https://phabricator.wikimedia.org/T298485) (owner: 10Ahmon Dancy) [12:28:23] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:28:40] (03CR) 10CI reject: [V: 04-1] nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 (owner: 10Jbond) [12:29:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:29:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [12:29:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38631 and previous config saved to /var/cache/conftool/dbconfig/20221108-122923-ladsgroup.json [12:29:26] (03CR) 10CI reject: [V: 04-1] tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 (owner: 10Jbond) [12:29:27] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:29:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp1002.wikimedia.org [12:31:24] (03Merged) 10jenkins-bot: Include core PSR-4 classes in the generated classmap [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854077 (https://phabricator.wikimedia.org/T274041) (owner: 10Ladsgroup) [12:31:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P38632 and previous config saved to /var/cache/conftool/dbconfig/20221108-123130-marostegui.json [12:31:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P38633 and previous config saved to /var/cache/conftool/dbconfig/20221108-123152-ladsgroup.json [12:32:01] (03PS5) 10Jbond: nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 [12:33:01] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [12:33:30] (03CR) 10CI reject: [V: 04-1] nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 (owner: 10Jbond) [12:34:27] (03PS2) 10Jbond: tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 [12:34:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38634 and previous config saved to /var/cache/conftool/dbconfig/20221108-123433-ladsgroup.json [12:34:37] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:34:39] (03PS6) 10Jbond: nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 [12:35:48] (03CR) 10CI reject: [V: 04-1] tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 (owner: 10Jbond) [12:36:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki2002.codfw.wmnet [12:36:27] (03CR) 10JMeybohm: [C: 03+2] calico: Allow alternative dnsConfig and disabling of IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/854481 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:36:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:37:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854077 (https://phabricator.wikimedia.org/T274041) (owner: 10Ladsgroup) [12:37:22] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:854077|Include core PSR-4 classes in the generated classmap (T274041)]] [12:37:25] T274041: Reduce performance impact of HookRunner.php loading 500+ interfaces - https://phabricator.wikimedia.org/T274041 [12:37:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:37:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:37:42] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:854077|Include core PSR-4 classes in the generated classmap (T274041)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [12:38:00] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update README.Debian to reflect latest changes for U2F/6.6/OIDC [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841864 (owner: 10Muehlenhoff) [12:38:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:38:27] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:38:33] (03CR) 10Muehlenhoff: [C: 03+2] Retire generic insetup role [puppet] - 10https://gerrit.wikimedia.org/r/852223 (owner: 10Muehlenhoff) [12:38:45] (03PS3) 10Jbond: tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 [12:39:22] (03PS7) 10Jbond: nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 [12:39:30] (03CR) 10Hnowlan: [C: 03+1] "lgtm, nice change. Don't forget to update the postgres-init cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/854493 (owner: 10Filippo Giunchedi) [12:39:57] (03Merged) 10jenkins-bot: calico: Allow alternative dnsConfig and disabling of IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/854481 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [12:40:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P38635 and previous config saved to /var/cache/conftool/dbconfig/20221108-124001-marostegui.json [12:40:17] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:40:30] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:40:45] (03CR) 10Btullis: [C: 03+2] Add namespaces for spark and spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/854498 (https://phabricator.wikimedia.org/T321686) (owner: 10Btullis) [12:41:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki2002.codfw.wmnet [12:42:51] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:854077|Include core PSR-4 classes in the generated classmap (T274041)]] (duration: 05m 29s) [12:42:54] T274041: Reduce performance impact of HookRunner.php loading 500+ interfaces - https://phabricator.wikimedia.org/T274041 [12:43:22] (03CR) 10Jbond: [C: 03+2] tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 (owner: 10Jbond) [12:43:26] (03CR) 10Jbond: [C: 03+2] nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 (owner: 10Jbond) [12:43:55] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the quick reviews! Good point re: the cookbook, I'll followup" [puppet] - 10https://gerrit.wikimedia.org/r/854493 (owner: 10Filippo Giunchedi) [12:44:38] (03Merged) 10jenkins-bot: Add namespaces for spark and spark-operator [deployment-charts] - 10https://gerrit.wikimedia.org/r/854498 (https://phabricator.wikimedia.org/T321686) (owner: 10Btullis) [12:45:54] (03PS1) 10Filippo Giunchedi: sre: use pg-resync-replica in postgres-init [cookbooks] - 10https://gerrit.wikimedia.org/r/854511 [12:46:09] (03CR) 10CI reject: [V: 04-1] tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 (owner: 10Jbond) [12:46:11] (03CR) 10CI reject: [V: 04-1] nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 (owner: 10Jbond) [12:46:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T321130)', diff saved to https://phabricator.wikimedia.org/P38636 and previous config saved to /var/cache/conftool/dbconfig/20221108-124636-marostegui.json [12:46:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:46:41] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:46:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:46:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T321130)', diff saved to https://phabricator.wikimedia.org/P38637 and previous config saved to /var/cache/conftool/dbconfig/20221108-124658-marostegui.json [12:46:59] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:46:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P38638 and previous config saved to /var/cache/conftool/dbconfig/20221108-124659-ladsgroup.json [12:47:16] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:48:26] https://usercontent.irccloud-cdn.com/file/2dDLrxtb/image.png [12:48:40] _joe_: ^ context T274041 [12:48:41] T274041: Reduce performance impact of HookRunner.php loading 500+ interfaces - https://phabricator.wikimedia.org/T274041 [12:48:42] (03PS4) 10Jbond: tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 [12:49:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T321130)', diff saved to https://phabricator.wikimedia.org/P38639 and previous config saved to /var/cache/conftool/dbconfig/20221108-124914-marostegui.json [12:49:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P38640 and previous config saved to /var/cache/conftool/dbconfig/20221108-124939-ladsgroup.json [12:49:55] (03PS5) 10Jbond: tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 [12:51:02] (03PS1) 10Hnowlan: thumbor: Use environment variables for config [deployment-charts] - 10https://gerrit.wikimedia.org/r/854512 (https://phabricator.wikimedia.org/T233196) [12:51:57] (03CR) 10CI reject: [V: 04-1] tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 (owner: 10Jbond) [12:52:03] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [12:54:03] (03PS6) 10Jbond: tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 [12:55:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T321123)', diff saved to https://phabricator.wikimedia.org/P38641 and previous config saved to /var/cache/conftool/dbconfig/20221108-125508-marostegui.json [12:55:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:55:12] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:55:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:55:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T321123)', diff saved to https://phabricator.wikimedia.org/P38642 and previous config saved to /var/cache/conftool/dbconfig/20221108-125529-marostegui.json [12:55:55] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 3.174 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:56:25] (03CR) 10CI reject: [V: 04-1] tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 (owner: 10Jbond) [12:58:07] (03PS7) 10Jbond: tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 [12:58:24] <_joe_> Amir1: oh that was already deployed? [12:58:43] I just did :D [12:59:49] (03PS2) 10Filippo Giunchedi: sre: use pg-resync-replica in postgres-init [cookbooks] - 10https://gerrit.wikimedia.org/r/854511 [13:01:11] (03CR) 10Jbond: [C: 03+2] tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 (owner: 10Jbond) [13:01:19] (03PS8) 10Jbond: nodegen: only analyse manifets files for auto selector [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854002 [13:01:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T321123)', diff saved to https://phabricator.wikimedia.org/P38643 and previous config saved to /var/cache/conftool/dbconfig/20221108-130136-marostegui.json [13:01:41] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [13:02:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T322618)', diff saved to https://phabricator.wikimedia.org/P38644 and previous config saved to /var/cache/conftool/dbconfig/20221108-130205-ladsgroup.json [13:02:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:02:09] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:02:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:02:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38645 and previous config saved to /var/cache/conftool/dbconfig/20221108-130216-ladsgroup.json [13:03:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/854511 (owner: 10Filippo Giunchedi) [13:03:35] (03Merged) 10jenkins-bot: tests: fix pip backtracking [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/854510 (owner: 10Jbond) [13:04:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P38646 and previous config saved to /var/cache/conftool/dbconfig/20221108-130420-marostegui.json [13:04:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38647 and previous config saved to /var/cache/conftool/dbconfig/20221108-130429-ladsgroup.json [13:04:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P38648 and previous config saved to /var/cache/conftool/dbconfig/20221108-130446-ladsgroup.json [13:05:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15557 [13:07:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15557 [13:10:06] (03CR) 10Svantje Lilienthal: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854513 (https://phabricator.wikimedia.org/T321548) (owner: 10Svantje Lilienthal) [13:11:25] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for KMorgan - https://phabricator.wikimedia.org/T322154 (10fgiunchedi) 05Open→03Resolved >>! In T322154#8378981, @Urbanecm wrote: >>>! In T322154#8378965, @Aklapper wrote: >> Yes. (On a meta level, I wonder how to make folks rely less on their memory... [13:11:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [13:15:43] (03PS5) 10Jbond: directories: add change id to the output dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 [13:15:45] (03PS13) 10Jbond: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (https://phabricator.wikimedia.org/T245828) [13:15:47] (03PS7) 10Jbond: worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 (https://phabricator.wikimedia.org/T222075) [13:15:49] (03PS9) 10Jbond: controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) [13:15:51] (03PS4) 10Jbond: differ: add support for concat_fragment [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852984 (https://phabricator.wikimedia.org/T286255) [13:15:53] (03PS7) 10Jbond: prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) [13:15:55] (03PS3) 10Jbond: controller: Add option for basic pcc run [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853382 (https://phabricator.wikimedia.org/T289666) [13:15:57] (03PS6) 10Jbond: 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 [13:16:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P38649 and previous config saved to /var/cache/conftool/dbconfig/20221108-131643-marostegui.json [13:16:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [13:17:06] (03PS1) 10Hashar: gerrit: remove git gc aggressive [puppet] - 10https://gerrit.wikimedia.org/r/854514 (https://phabricator.wikimedia.org/T237807) [13:17:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3002.esams.wmnet [13:18:26] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable show nearby feature for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854513 (https://phabricator.wikimedia.org/T321548) (owner: 10Svantje Lilienthal) [13:19:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P38650 and previous config saved to /var/cache/conftool/dbconfig/20221108-131927-marostegui.json [13:19:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P38651 and previous config saved to /var/cache/conftool/dbconfig/20221108-131936-ladsgroup.json [13:19:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38652 and previous config saved to /var/cache/conftool/dbconfig/20221108-131952-ladsgroup.json [13:19:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:19:56] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:20:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:20:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38653 and previous config saved to /var/cache/conftool/dbconfig/20221108-132014-ladsgroup.json [13:22:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38654 and previous config saved to /var/cache/conftool/dbconfig/20221108-132223-ladsgroup.json [13:23:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3002.esams.wmnet [13:31:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P38655 and previous config saved to /var/cache/conftool/dbconfig/20221108-133149-marostegui.json [13:34:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T321130)', diff saved to https://phabricator.wikimedia.org/P38656 and previous config saved to /var/cache/conftool/dbconfig/20221108-133433-marostegui.json [13:34:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:34:37] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:34:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P38657 and previous config saved to /var/cache/conftool/dbconfig/20221108-133442-ladsgroup.json [13:34:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:35:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T321130)', diff saved to https://phabricator.wikimedia.org/P38658 and previous config saved to /var/cache/conftool/dbconfig/20221108-133505-marostegui.json [13:37:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T321130)', diff saved to https://phabricator.wikimedia.org/P38659 and previous config saved to /var/cache/conftool/dbconfig/20221108-133721-marostegui.json [13:37:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P38660 and previous config saved to /var/cache/conftool/dbconfig/20221108-133730-ladsgroup.json [13:44:53] (03PS2) 10Elukey: [WIP] - Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 [13:44:55] (03PS3) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [13:46:15] (03CR) 10Elukey: [C: 03+1] cfssl-issuer: Move from single to multiple files for CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/838135 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [13:46:48] (03CR) 10Elukey: [C: 03+1] cfssl-issuer: Bump CRD chart version for cfssl-issuer update (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/838136 (owner: 10JMeybohm) [13:46:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T321123)', diff saved to https://phabricator.wikimedia.org/P38661 and previous config saved to /var/cache/conftool/dbconfig/20221108-134656-marostegui.json [13:46:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:47:02] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [13:47:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:47:12] (03CR) 10Elukey: [C: 03+1] cfssl-issuer: Bump version and fix dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/838137 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [13:49:34] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:49:39] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:49:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38662 and previous config saved to /var/cache/conftool/dbconfig/20221108-134949-ladsgroup.json [13:49:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [13:49:53] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:50:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [13:50:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38663 and previous config saved to /var/cache/conftool/dbconfig/20221108-135011-ladsgroup.json [13:50:21] (03CR) 10Filippo Giunchedi: "LGTM overall! See inline too" [puppet] - 10https://gerrit.wikimedia.org/r/854487 (owner: 10Elukey) [13:51:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:51:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:51:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T321123)', diff saved to https://phabricator.wikimedia.org/P38664 and previous config saved to /var/cache/conftool/dbconfig/20221108-135129-marostegui.json [13:52:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38665 and previous config saved to /var/cache/conftool/dbconfig/20221108-135224-ladsgroup.json [13:52:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P38666 and previous config saved to /var/cache/conftool/dbconfig/20221108-135234-marostegui.json [13:53:14] (03CR) 10Btullis: [C: 03+1] "Looks good to me. Can we please arrange to do a deployment under test conditions and verify that traffic is passed to the canary after the" [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [13:54:17] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@248d897]: import_cirrus_indexes: increase driver mem [13:56:40] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@248d897]: import_cirrus_indexes: increase driver mem (duration: 02m 23s) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T1400) [14:00:04] duesen: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:04] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T1400) [14:00:26] o/ [14:00:26] I can deploy today [14:00:28] duesen: hi [14:00:37] (03PS1) 10JMeybohm: calico: Typha does no longer ship with dedicated RBAC rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943) [14:01:43] urbanecm: hi! [14:01:48] let's get started [14:02:01] urbanecm: ok. can i do it? [14:02:10] duesen: certainly [14:02:16] ok [14:02:32] duesen: fwiw there's a new `scap backport` , in case you haven't heard back [14:02:35] 10SRE-OnFire, 10observability: Provide mechanism to join/leave oncall - https://phabricator.wikimedia.org/T322636 (10ayounsi) [14:02:44] *that [14:02:50] (it takes the change number as its parameter, ie. `scap backport 854073`) [14:03:18] (03PS1) 10Volans: json-webrequests-stats: add -t/--time-range [puppet] - 10https://gerrit.wikimedia.org/r/854521 [14:04:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854073 (https://phabricator.wikimedia.org/T321862) (owner: 10Daniel Kinzler) [14:05:07] (03PS2) 10JMeybohm: calico: Typha does no longer ship with dedicated RBAC rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943) [14:05:23] urbanecm: yes, scap backport is awesome :) [14:05:27] indeed it is [14:05:40] mergin will taker a while, i'll go and fix myself a coffee [14:06:19] btw, what would happen if my connection breaks while scap backport is running? [14:06:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T321123)', diff saved to https://phabricator.wikimedia.org/P38667 and previous config saved to /var/cache/conftool/dbconfig/20221108-140628-marostegui.json [14:06:29] 10SRE-OnFire, 10observability: Provide mechanism to join/leave oncall - https://phabricator.wikimedia.org/T322636 (10ayounsi) [14:06:31] I'm using screen now, but i sometimes forget [14:06:33] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [14:07:15] duesen: i'm not sure. someone from releng would know that. [14:07:24] it's ok if it happens before the patch is merged, not sure after that [14:07:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P38668 and previous config saved to /var/cache/conftool/dbconfig/20221108-140730-ladsgroup.json [14:07:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P38669 and previous config saved to /var/cache/conftool/dbconfig/20221108-140741-marostegui.json [14:07:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:07:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [14:08:02] fwiw it happened to me in the past, and i experienced both results (deployment finished w/o me seeing the output, and deployment stopped half way through) [14:08:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38670 and previous config saved to /var/cache/conftool/dbconfig/20221108-140803-ladsgroup.json [14:08:07] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:08:08] and I'm not sure what's the expected result [14:08:23] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/854511 (owner: 10Filippo Giunchedi) [14:08:56] yea, not sure either [14:09:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38671 and previous config saved to /var/cache/conftool/dbconfig/20221108-140912-ladsgroup.json [14:10:50] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: use pg-resync-replica in postgres-init [cookbooks] - 10https://gerrit.wikimedia.org/r/854511 (owner: 10Filippo Giunchedi) [14:11:01] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:11:10] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:13:11] urbanecm: crud, I see a bunch of test failures [14:13:18] :-/ [14:13:37] that's...a lot of failures [14:13:43] AutoLoaderStructureTest [14:13:48] it's just one failure [14:13:59] but why didn't it fail in CI?... [14:14:06] it's an easy fix [14:14:15] and why did it pass in master [14:14:37] you probably needed to rebase I think [14:14:58] (03CR) 10JMeybohm: [C: 03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/826810 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:14:58] I deployed something about that an hour ago or so [14:15:02] aha [14:15:19] (03CR) 10Klausman: [C: 03+1] Upgrade to 1.15.3 [debs/istio] - 10https://gerrit.wikimedia.org/r/853936 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [14:15:20] T274041 [14:15:20] T274041: Reduce performance impact of HookRunner.php loading 500+ interfaces - https://phabricator.wikimedia.org/T274041 [14:15:24] (03PS4) 10Urbanecm: Stash original wikitext when rendering unsaved content. [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854073 (https://phabricator.wikimedia.org/T321862) (owner: 10Daniel Kinzler) [14:15:28] (03CR) 10Urbanecm: [C: 03+2] Stash original wikitext when rendering unsaved content. [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854073 (https://phabricator.wikimedia.org/T321862) (owner: 10Daniel Kinzler) [14:15:41] Amir1: oh, I see [14:15:42] ok [14:15:44] duesen: i rebased & kicked the jobs, scap backport should survive that. [14:15:46] (03CR) 10Vlad.shapik: [C: 03+1] thumbor: Use environment variables for config [deployment-charts] - 10https://gerrit.wikimedia.org/r/854512 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:17:09] (03PS1) 10Marostegui: pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/854523 [14:17:20] urbanecm: ah, that's why my attempt to rebase failed ;) thanks! [14:17:38] heh, sorry for that :)) [14:18:07] (03CR) 10Marostegui: [C: 03+2] pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/854523 (owner: 10Marostegui) [14:21:08] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [14:21:24] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [14:21:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P38672 and previous config saved to /var/cache/conftool/dbconfig/20221108-142135-marostegui.json [14:22:11] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [14:22:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [14:22:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P38673 and previous config saved to /var/cache/conftool/dbconfig/20221108-142236-ladsgroup.json [14:22:39] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/838137 (https://phabricator.wikimedia.org/T310486) (owner: 10JMeybohm) [14:22:46] (03PS1) 10Marostegui: Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/854078 [14:22:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T321130)', diff saved to https://phabricator.wikimedia.org/P38674 and previous config saved to /var/cache/conftool/dbconfig/20221108-142247-marostegui.json [14:22:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:22:52] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:22:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:23:05] (03CR) 10Volans: "Given that two people suggested the same feature request when using it for the first time, I thought to add it." [puppet] - 10https://gerrit.wikimedia.org/r/854521 (owner: 10Volans) [14:23:50] (03CR) 10Marostegui: [C: 03+2] Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/854078 (owner: 10Marostegui) [14:24:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P38675 and previous config saved to /var/cache/conftool/dbconfig/20221108-142419-ladsgroup.json [14:24:28] urbanecm: I still see the same failures... is the old zuul job still running? or did the rebase not help? [14:24:58] duesen: https://integration.wikimedia.org/zuul/ says it runs for 854073,4, and PS4 is post-rebase [14:25:04] so it looks like rebase didn't do the trick :/ [14:25:11] I'm confused about the error as well... [14:25:15] what gives?... [14:25:56] (03CR) 10Awight: [C: 03+1] Enable show nearby feature for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854513 (https://phabricator.wikimedia.org/T321548) (owner: 10Svantje Lilienthal) [14:27:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2097.codfw.wmnet with reason: Maintenance [14:27:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2097.codfw.wmnet with reason: Maintenance [14:27:47] duesen: load the patch locally and run the maint script to regenerate autoload.php [14:28:39] (03CR) 10Hnowlan: [C: 03+2] thumbor: Use environment variables for config [deployment-charts] - 10https://gerrit.wikimedia.org/r/854512 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:29:03] urbanecm: sorry, i got myself into a complete mess with gerrit [14:29:10] duesen: can i help somehow? [14:29:20] i keep forgetting that i have to be very careful about the extension submodules when working with deployment branches :( [14:29:53] (03CR) 10CI reject: [V: 04-1] Stash original wikitext when rendering unsaved content. [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854073 (https://phabricator.wikimedia.org/T321862) (owner: 10Daniel Kinzler) [14:30:07] urbanecm: i'll just start over [14:30:09] * urbanecm usually just does a fresh clone to not mess my main clone [14:30:10] ack [14:30:39] (03PS5) 10Daniel Kinzler: Stash original wikitext when rendering unsaved content. [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854073 (https://phabricator.wikimedia.org/T321862) [14:31:24] urbanecm: gah, found another issue :( [14:31:52] :( [14:32:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:32:12] (03Merged) 10jenkins-bot: thumbor: Use environment variables for config [deployment-charts] - 10https://gerrit.wikimedia.org/r/854512 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:32:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:32:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T321130)', diff saved to https://phabricator.wikimedia.org/P38676 and previous config saved to /var/cache/conftool/dbconfig/20221108-143220-marostegui.json [14:32:24] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:33:18] urbanecm: autoload generation seems to be completely messed up, no idea what's happening... [14:33:47] 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Keep calculating latencies for MediaWiki requests in the WikiKube environment - https://phabricator.wikimedia.org/T276095 (10akosiaris) [14:33:48] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [14:33:53] :( [14:33:57] * urbanecm neither [14:34:41] (03PS3) 10Elukey: [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 [14:34:43] (03PS4) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [14:34:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321130)', diff saved to https://phabricator.wikimedia.org/P38677 and previous config saved to /var/cache/conftool/dbconfig/20221108-143457-marostegui.json [14:35:25] 10SRE: Access request to run bulk operations in phabricator for user lmata - https://phabricator.wikimedia.org/T322638 (10lmata) [14:35:38] (03CR) 10CI reject: [V: 04-1] [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (owner: 10Elukey) [14:36:22] RECOVERY - Check systemd state on dispatch-be2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P38678 and previous config saved to /var/cache/conftool/dbconfig/20221108-143642-marostegui.json [14:37:15] (03PS4) 10Elukey: [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 [14:37:17] (03PS5) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [14:37:28] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:37:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38679 and previous config saved to /var/cache/conftool/dbconfig/20221108-143743-ladsgroup.json [14:37:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [14:37:48] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:37:51] (03CR) 10CI reject: [V: 04-1] [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (owner: 10Elukey) [14:37:56] (03PS6) 10Daniel Kinzler: Stash original wikitext when rendering unsaved content. [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854073 (https://phabricator.wikimedia.org/T321862) [14:38:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [14:38:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T322618)', diff saved to https://phabricator.wikimedia.org/P38680 and previous config saved to /var/cache/conftool/dbconfig/20221108-143815-ladsgroup.json [14:39:07] (03CR) 10Gmodena: "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [14:39:16] (03PS1) 10Hnowlan: thumbor: correct use-environment parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/854527 [14:39:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P38681 and previous config saved to /var/cache/conftool/dbconfig/20221108-143925-ladsgroup.json [14:39:43] urbanecm: ok I *think* it might be ok now [14:40:11] this was three minor screw-ups coming together, making me really really confused :( [14:40:23] makes sense [14:40:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T322618)', diff saved to https://phabricator.wikimedia.org/P38682 and previous config saved to /var/cache/conftool/dbconfig/20221108-144028-ladsgroup.json [14:40:35] duesen: do you want me to look at the diff and re-+2, so there's some review? [14:40:42] urbanecm: am i good to try again? [14:40:58] urbanecm: yes, sure. The diff is big though. [14:41:19] I was going to let this ride the train, but there is no train this week, and we need the fix... [14:41:21] (03CR) 10Urbanecm: [C: 03+2] Stash original wikitext when rendering unsaved content. [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854073 (https://phabricator.wikimedia.org/T321862) (owner: 10Daniel Kinzler) [14:41:24] yeah [14:41:30] spot-checked, +2'ed, and let's hope [14:41:38] thank you! [14:41:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854073 (https://phabricator.wikimedia.org/T321862) (owner: 10Daniel Kinzler) [14:42:04] 10SRE-OnFire, 10observability: Provide mechanism to join/leave oncall - https://phabricator.wikimedia.org/T322636 (10Joe) I would think that we might want to keep the VO rotation shifts automated, but allow people an easy way to create an override to devnull, and remove it. Ideally the bot should check if an... [14:43:52] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [14:44:01] (03PS5) 10Elukey: [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 [14:44:03] (03PS6) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [14:44:37] (03CR) 10CI reject: [V: 04-1] [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (owner: 10Elukey) [14:45:59] mmm weird run_ci_locally didn't return me any issue [14:50:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P38683 and previous config saved to /var/cache/conftool/dbconfig/20221108-145003-marostegui.json [14:50:32] urbanecm: jenkins seems happy now [14:50:40] yep [14:50:43] fingers crossed [14:51:11] (03CR) 10Hnowlan: [C: 03+2] thumbor: correct use-environment parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/854527 (owner: 10Hnowlan) [14:51:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T321123)', diff saved to https://phabricator.wikimedia.org/P38684 and previous config saved to /var/cache/conftool/dbconfig/20221108-145148-marostegui.json [14:51:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:51:53] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [14:52:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:52:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T321123)', diff saved to https://phabricator.wikimedia.org/P38685 and previous config saved to /var/cache/conftool/dbconfig/20221108-145210-marostegui.json [14:54:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P38686 and previous config saved to /var/cache/conftool/dbconfig/20221108-145432-ladsgroup.json [14:54:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [14:54:36] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:54:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [14:54:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T322618)', diff saved to https://phabricator.wikimedia.org/P38687 and previous config saved to /var/cache/conftool/dbconfig/20221108-145453-ladsgroup.json [14:54:54] (03Merged) 10jenkins-bot: thumbor: correct use-environment parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/854527 (owner: 10Hnowlan) [14:55:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P38688 and previous config saved to /var/cache/conftool/dbconfig/20221108-145535-ladsgroup.json [14:56:59] (03Merged) 10jenkins-bot: Stash original wikitext when rendering unsaved content. [core] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854073 (https://phabricator.wikimedia.org/T321862) (owner: 10Daniel Kinzler) [14:57:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T322618)', diff saved to https://phabricator.wikimedia.org/P38689 and previous config saved to /var/cache/conftool/dbconfig/20221108-145702-ladsgroup.json [14:57:15] !log daniel@deploy1002 Started scap: Backport for [[gerrit:854073|Stash original wikitext when rendering unsaved content. (T321862)]] [14:57:18] T321862: Switching from source to VE loses edit (un-savable and un-recoverable) - https://phabricator.wikimedia.org/T321862 [14:57:35] !log daniel@deploy1002 daniel and daniel: Backport for [[gerrit:854073|Stash original wikitext when rendering unsaved content. (T321862)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:57:51] checking on debug now [14:59:11] (03PS6) 10Elukey: [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 [14:59:13] (03PS7) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [14:59:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Volans) >>! In T321128#8376809, @Papaul wrote: > to make the provission cookbook works , login to the idrac using the secure password and change the password to our mgmt... [14:59:46] (03CR) 10CI reject: [V: 04-1] [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (owner: 10Elukey) [15:00:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:00:44] confirmed, merging [15:00:55] err, syncing [15:01:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:01:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:01:26] (03PS7) 10Elukey: [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 [15:01:28] (03PS8) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [15:01:36] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:01:52] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:01:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:03:00] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:03:13] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:04:40] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:854073|Stash original wikitext when rendering unsaved content. (T321862)]] (duration: 07m 25s) [15:04:46] T321862: Switching from source to VE loses edit (un-savable and un-recoverable) - https://phabricator.wikimedia.org/T321862 [15:05:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P38690 and previous config saved to /var/cache/conftool/dbconfig/20221108-150509-marostegui.json [15:06:07] (03PS1) 10Hnowlan: thumbor: move SWIFT_RETRIES to within define [deployment-charts] - 10https://gerrit.wikimedia.org/r/854531 [15:07:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T321123)', diff saved to https://phabricator.wikimedia.org/P38691 and previous config saved to /var/cache/conftool/dbconfig/20221108-150709-marostegui.json [15:07:15] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [15:07:59] urbanecm: scap is done. i'll do some more manual testing [15:09:26] all good! [15:10:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P38692 and previous config saved to /var/cache/conftool/dbconfig/20221108-151041-ladsgroup.json [15:10:52] (03CR) 10Hnowlan: [C: 03+2] thumbor: move SWIFT_RETRIES to within define [deployment-charts] - 10https://gerrit.wikimedia.org/r/854531 (owner: 10Hnowlan) [15:12:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P38693 and previous config saved to /var/cache/conftool/dbconfig/20221108-151208-ladsgroup.json [15:14:11] (03Merged) 10jenkins-bot: thumbor: move SWIFT_RETRIES to within define [deployment-charts] - 10https://gerrit.wikimedia.org/r/854531 (owner: 10Hnowlan) [15:14:37] (03PS1) 10Vgutierrez: deployment-prep: Remove ms-be06 from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/854533 (https://phabricator.wikimedia.org/T322231) [15:16:42] (03CR) 10Vgutierrez: [C: 03+2] deployment-prep: Remove ms-be06 from swift::storagehosts [puppet] - 10https://gerrit.wikimedia.org/r/854533 (https://phabricator.wikimedia.org/T322231) (owner: 10Vgutierrez) [15:16:50] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:17:09] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:18:20] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:18:27] 10SRE, 10Observability-Metrics, 10serviceops, 10Patch-For-Review: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10jijiki) 05Open→03Declined Strongswan is going away because we do not need it anymore. We were using it for redis_sessions T... [15:18:33] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:19:26] (03CR) 10Ottomata: [C: 03+1] "TY" [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [15:20:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321130)', diff saved to https://phabricator.wikimedia.org/P38694 and previous config saved to /var/cache/conftool/dbconfig/20221108-152016-marostegui.json [15:20:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2125.codfw.wmnet with reason: Maintenance [15:20:20] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:20:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2125.codfw.wmnet with reason: Maintenance [15:20:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T321130)', diff saved to https://phabricator.wikimedia.org/P38695 and previous config saved to /var/cache/conftool/dbconfig/20221108-152037-marostegui.json [15:22:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P38696 and previous config saved to /var/cache/conftool/dbconfig/20221108-152216-marostegui.json [15:24:48] (03CR) 10Vlad.shapik: [C: 03+1] "I've checked. Everything looks good to me." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854506 (owner: 10Hnowlan) [15:25:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T322618)', diff saved to https://phabricator.wikimedia.org/P38697 and previous config saved to /var/cache/conftool/dbconfig/20221108-152548-ladsgroup.json [15:25:54] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:26:34] (03PS8) 10Elukey: [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 [15:26:37] (03PS9) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [15:27:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P38698 and previous config saved to /var/cache/conftool/dbconfig/20221108-152715-ladsgroup.json [15:27:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321130)', diff saved to https://phabricator.wikimedia.org/P38699 and previous config saved to /var/cache/conftool/dbconfig/20221108-152752-marostegui.json [15:27:56] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:29:00] jouncebot: nowandnext [15:29:00] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [15:29:00] In 1 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T1700) [15:31:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Open magnum and heat apis to the greater internet [puppet] - 10https://gerrit.wikimedia.org/r/854092 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [15:34:43] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:34:54] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:35:18] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [15:35:39] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [15:37:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P38700 and previous config saved to /var/cache/conftool/dbconfig/20221108-153722-marostegui.json [15:38:10] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:kubernetes::deployment_server: fix absenting [puppet] - 10https://gerrit.wikimedia.org/r/854059 (https://phabricator.wikimedia.org/T322298) (owner: 10Clément Goubert) [15:42:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T322618)', diff saved to https://phabricator.wikimedia.org/P38701 and previous config saved to /var/cache/conftool/dbconfig/20221108-154221-ladsgroup.json [15:42:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:42:26] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:42:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:42:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [15:42:44] (03PS1) 10Majavah: P:openstack::designate: remove separate profile for firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/854539 [15:42:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [15:42:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:43:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P38702 and previous config saved to /var/cache/conftool/dbconfig/20221108-154259-marostegui.json [15:43:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:43:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T322618)', diff saved to https://phabricator.wikimedia.org/P38703 and previous config saved to /var/cache/conftool/dbconfig/20221108-154317-ladsgroup.json [15:43:29] (03PS2) 10Hnowlan: Preserve message when setting HTTP status, install wmf-certificates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854506 [15:43:37] (03PS1) 10Vgutierrez: swift: Ensure that bzip2 is installed [puppet] - 10https://gerrit.wikimedia.org/r/854541 [15:44:45] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38014/console" [puppet] - 10https://gerrit.wikimedia.org/r/854539 (owner: 10Majavah) [15:45:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T322618)', diff saved to https://phabricator.wikimedia.org/P38704 and previous config saved to /var/cache/conftool/dbconfig/20221108-154525-ladsgroup.json [15:45:54] (03PS2) 10Majavah: P:openstack::designate: remove separate profile for firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/854539 [15:46:53] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38016/console" [puppet] - 10https://gerrit.wikimedia.org/r/854539 (owner: 10Majavah) [15:51:57] (03PS9) 10Elukey: [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 [15:51:59] (03PS10) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [15:52:01] (03PS1) 10Vgutierrez: labs: Add profile::swift::account_keys data [labs/private] - 10https://gerrit.wikimedia.org/r/854544 [15:52:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T321123)', diff saved to https://phabricator.wikimedia.org/P38705 and previous config saved to /var/cache/conftool/dbconfig/20221108-155229-marostegui.json [15:52:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:52:34] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [15:52:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:52:47] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38017/console" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (owner: 10Elukey) [15:55:53] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] labs: Add profile::swift::account_keys data [labs/private] - 10https://gerrit.wikimedia.org/r/854544 (owner: 10Vgutierrez) [15:56:41] (03PS10) 10Elukey: [WIP] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 [15:56:43] (03PS11) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [15:57:42] (03PS1) 10Volans: sre.hosts.provision: use default if in UEFI mode [cookbooks] - 10https://gerrit.wikimedia.org/r/854545 (https://phabricator.wikimedia.org/T321128) [15:58:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P38706 and previous config saved to /var/cache/conftool/dbconfig/20221108-155805-marostegui.json [15:58:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38019/console" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (owner: 10Elukey) [15:58:22] (03PS1) 10Vgutierrez: labs: add profile::swift::replication_keys data [labs/private] - 10https://gerrit.wikimedia.org/r/854546 [15:58:37] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] labs: add profile::swift::replication_keys data [labs/private] - 10https://gerrit.wikimedia.org/r/854546 (owner: 10Vgutierrez) [15:59:37] (03CR) 10Hnowlan: [C: 03+2] Preserve message when setting HTTP status, install wmf-certificates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854506 (owner: 10Hnowlan) [15:59:46] (03CR) 10Elukey: "Tried to implement the suggestions, the only thing left is geoip, lemme know your thoughts!" [puppet] - 10https://gerrit.wikimedia.org/r/854487 (owner: 10Elukey) [16:00:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P38707 and previous config saved to /var/cache/conftool/dbconfig/20221108-160032-ladsgroup.json [16:02:37] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38021/console" [puppet] - 10https://gerrit.wikimedia.org/r/854541 (owner: 10Vgutierrez) [16:03:15] (03CR) 10Vgutierrez: [V: 03+1] "PCC is happy: https://puppet-compiler.wmflabs.org/pcc-worker1001/38015/ (pcc error in deployment-ms-fe04 is unrelated to this)" [puppet] - 10https://gerrit.wikimedia.org/r/854541 (owner: 10Vgutierrez) [16:06:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:06:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [16:06:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T321123)', diff saved to https://phabricator.wikimedia.org/P38708 and previous config saved to /var/cache/conftool/dbconfig/20221108-160632-marostegui.json [16:06:36] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [16:07:55] (03Merged) 10jenkins-bot: Preserve message when setting HTTP status, install wmf-certificates [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854506 (owner: 10Hnowlan) [16:08:33] 10SRE: Access request to run bulk operations in phabricator for user lmata - https://phabricator.wikimedia.org/T322638 (10RhinosF1) That looks like a file permissions issue. Can you check what group the file is owned by and run it as that? [16:10:16] !log upload wmf-beamer-style version 0.4 to apt [16:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:56] 10SRE: Access request to run bulk operations in phabricator for user lmata - https://phabricator.wikimedia.org/T322638 (10lmata) yup sudo worked, thanks! [16:12:18] 10SRE: Access request to run bulk operations in phabricator for user lmata - https://phabricator.wikimedia.org/T322638 (10lmata) 05Open→03Resolved a:03lmata [16:13:01] (03CR) 10MVernon: [C: 03+1] "Thanks for this, looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/854541 (owner: 10Vgutierrez) [16:13:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321130)', diff saved to https://phabricator.wikimedia.org/P38709 and previous config saved to /var/cache/conftool/dbconfig/20221108-161312-marostegui.json [16:13:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2126.codfw.wmnet with reason: Maintenance [16:13:17] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:13:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2126.codfw.wmnet with reason: Maintenance [16:13:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [16:13:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [16:13:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T321130)', diff saved to https://phabricator.wikimedia.org/P38710 and previous config saved to /var/cache/conftool/dbconfig/20221108-161338-marostegui.json [16:13:53] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] swift: Ensure that bzip2 is installed [puppet] - 10https://gerrit.wikimedia.org/r/854541 (owner: 10Vgutierrez) [16:15:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P38711 and previous config saved to /var/cache/conftool/dbconfig/20221108-161538-ladsgroup.json [16:16:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T321130)', diff saved to https://phabricator.wikimedia.org/P38712 and previous config saved to /var/cache/conftool/dbconfig/20221108-161616-marostegui.json [16:19:33] (03PS1) 10Hnowlan: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854547 [16:21:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T321123)', diff saved to https://phabricator.wikimedia.org/P38713 and previous config saved to /var/cache/conftool/dbconfig/20221108-162155-marostegui.json [16:22:02] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [16:22:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [16:22:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1018.eqiad.wmnet with reason: Remove from cluster for eventual reimage [16:22:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1018.eqiad.wmnet with reason: Remove from cluster for eventual reimage [16:24:09] !log drain ganeti1024 for eventual reimage to bullseye T311687 [16:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:13] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [16:24:22] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854547 (owner: 10Hnowlan) [16:26:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [16:26:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10ILooremeta-WMF) Hello, I am yet to receive the kerberos credentials on email. Please assist. [16:26:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) Hi, I have not received an e-mail with the Kerberos credentials. [16:27:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [16:27:18] 10SRE: Access request to run bulk operations in phabricator for user lmata - https://phabricator.wikimedia.org/T322638 (10RhinosF1) Can you share which doc you looked at so it can be updated? [16:27:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hibashaath - https://phabricator.wikimedia.org/T322146 (10HShaath-WMF) @jbond I have not received the kerberos credentials by e-mail [16:27:41] 10SRE, 10Phabricator: Access request to run bulk operations in phabricator for user lmata - https://phabricator.wikimedia.org/T322638 (10RhinosF1) [16:28:07] (03Merged) 10jenkins-bot: thumbor: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/854547 (owner: 10Hnowlan) [16:28:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) Hi, I have not received an e-mail with the Kerberos credentials. [16:29:19] 10SRE, 10Phabricator: Access request to run bulk operations in phabricator for user lmata - https://phabricator.wikimedia.org/T322638 (10lmata) This was a dialogue/modal out of phabricator’s confirmation page for bulk change. I can find one and take a screengrab if that’s helpful. [16:30:39] 10SRE, 10Phabricator: Access request to run bulk operations in phabricator for user lmata - https://phabricator.wikimedia.org/T322638 (10RhinosF1) >>! In T322638#8380193, @lmata wrote: > This was a dialogue/modal out of phabricator’s confirmation page for bulk change. I can find one and take a screengrab if th... [16:30:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T322618)', diff saved to https://phabricator.wikimedia.org/P38714 and previous config saved to /var/cache/conftool/dbconfig/20221108-163045-ladsgroup.json [16:30:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [16:30:50] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [16:30:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:31:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [16:31:06] lmata: thanks! [16:31:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T322618)', diff saved to https://phabricator.wikimedia.org/P38715 and previous config saved to /var/cache/conftool/dbconfig/20221108-163107-ladsgroup.json [16:31:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P38716 and previous config saved to /var/cache/conftool/dbconfig/20221108-163122-marostegui.json [16:31:35] RhinosF1: :D [16:33:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T322618)', diff saved to https://phabricator.wikimedia.org/P38717 and previous config saved to /var/cache/conftool/dbconfig/20221108-163316-ladsgroup.json [16:33:18] I added it to my long to do list of things to wonder about [16:33:27] I have an install I can play with [16:33:35] (03CR) 10Volans: "I forgot to mention that if you want to try it, the version of this CR is available on centrallog2002 in /home/volans/stats.py" [puppet] - 10https://gerrit.wikimedia.org/r/854521 (owner: 10Volans) [16:34:35] (03PS1) 10Filippo Giunchedi: toil: add bandaid for ifupdown race [puppet] - 10https://gerrit.wikimedia.org/r/854553 (https://phabricator.wikimedia.org/T273026) [16:34:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [16:35:10] (03CR) 10CI reject: [V: 04-1] toil: add bandaid for ifupdown race [puppet] - 10https://gerrit.wikimedia.org/r/854553 (https://phabricator.wikimedia.org/T273026) (owner: 10Filippo Giunchedi) [16:35:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:36:14] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [16:36:49] (03PS2) 10Filippo Giunchedi: toil: add bandaid for ifupdown race [puppet] - 10https://gerrit.wikimedia.org/r/854553 (https://phabricator.wikimedia.org/T273026) [16:37:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P38718 and previous config saved to /var/cache/conftool/dbconfig/20221108-163702-marostegui.json [16:37:07] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [16:38:53] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [16:38:55] (03CR) 10CI reject: [V: 04-1] toil: add bandaid for ifupdown race [puppet] - 10https://gerrit.wikimedia.org/r/854553 (https://phabricator.wikimedia.org/T273026) (owner: 10Filippo Giunchedi) [16:39:08] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [16:40:42] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Sustainability (Incident Followup): Review sizing of maps cluster - https://phabricator.wikimedia.org/T228497 (10Jgiannelos) We didn't have any load specific issues lately and some of our infra concerns for maps are already tracked in other tickets. O... [16:41:12] (03PS3) 10Filippo Giunchedi: toil: add bandaid for ifupdown race [puppet] - 10https://gerrit.wikimedia.org/r/854553 (https://phabricator.wikimedia.org/T273026) [16:41:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2002.codfw.wmnet [16:44:31] (03PS1) 10Muehlenhoff: profile::java: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/854555 (https://phabricator.wikimedia.org/T321783) [16:45:41] (03CR) 10Filippo Giunchedi: "I kept the bandaid scoped to ganeti/production as we've seen failures only there (even though technically it can happen on any host as per" [puppet] - 10https://gerrit.wikimedia.org/r/854553 (https://phabricator.wikimedia.org/T273026) (owner: 10Filippo Giunchedi) [16:46:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P38719 and previous config saved to /var/cache/conftool/dbconfig/20221108-164629-marostegui.json [16:46:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2002.codfw.wmnet [16:48:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P38720 and previous config saved to /var/cache/conftool/dbconfig/20221108-164822-ladsgroup.json [16:49:00] (03CR) 10Filippo Giunchedi: "Untested but LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/854487 (owner: 10Elukey) [16:49:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet [16:52:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P38721 and previous config saved to /var/cache/conftool/dbconfig/20221108-165208-marostegui.json [16:53:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet [16:56:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [16:58:52] (JobUnavailable) firing: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:59:48] (03PS3) 10JMeybohm: calico: More calico 3.23.3 additions. [deployment-charts] - 10https://gerrit.wikimedia.org/r/854520 (https://phabricator.wikimedia.org/T307943) [17:00:04] jbond and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/854553 (https://phabricator.wikimedia.org/T273026) (owner: 10Filippo Giunchedi) [17:00:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [17:01:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T321130)', diff saved to https://phabricator.wikimedia.org/P38722 and previous config saved to /var/cache/conftool/dbconfig/20221108-170136-marostegui.json [17:01:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2138.codfw.wmnet with reason: Maintenance [17:01:43] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:01:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2138.codfw.wmnet with reason: Maintenance [17:01:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38723 and previous config saved to /var/cache/conftool/dbconfig/20221108-170157-marostegui.json [17:03:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P38724 and previous config saved to /var/cache/conftool/dbconfig/20221108-170329-ladsgroup.json [17:03:45] (03PS1) 10Clément Goubert: mwdebug: Final cleanup [puppet] - 10https://gerrit.wikimedia.org/r/854559 (https://phabricator.wikimedia.org/T321201) [17:03:52] (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:05:12] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38022/console" [puppet] - 10https://gerrit.wikimedia.org/r/854559 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [17:07:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T321123)', diff saved to https://phabricator.wikimedia.org/P38725 and previous config saved to /var/cache/conftool/dbconfig/20221108-170715-marostegui.json [17:07:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:07:21] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [17:07:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [17:07:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [17:07:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [17:07:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T321123)', diff saved to https://phabricator.wikimedia.org/P38726 and previous config saved to /var/cache/conftool/dbconfig/20221108-170752-marostegui.json [17:08:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38727 and previous config saved to /var/cache/conftool/dbconfig/20221108-170844-marostegui.json [17:08:51] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:12:55] (03PS1) 10JMeybohm: calico: Align formatting with k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/854562 (https://phabricator.wikimedia.org/T307943) [17:13:52] (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:15:27] (03PS1) 10Hnowlan: Make swift connector aware of cacert file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854563 (https://phabricator.wikimedia.org/T312104) [17:15:58] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38023/console" [puppet] - 10https://gerrit.wikimedia.org/r/854562 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [17:16:36] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38024/console" [puppet] - 10https://gerrit.wikimedia.org/r/854553 (https://phabricator.wikimedia.org/T273026) (owner: 10Filippo Giunchedi) [17:17:49] (03CR) 10JMeybohm: calico: Align formatting with k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/854562 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [17:18:00] (03CR) 10Clément Goubert: eventgate: Fix canary release routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/854507 (owner: 10Clément Goubert) [17:18:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T322618)', diff saved to https://phabricator.wikimedia.org/P38728 and previous config saved to /var/cache/conftool/dbconfig/20221108-171835-ladsgroup.json [17:18:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [17:18:46] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [17:18:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [17:18:52] (JobUnavailable) resolved: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:18:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T322618)', diff saved to https://phabricator.wikimedia.org/P38729 and previous config saved to /var/cache/conftool/dbconfig/20221108-171857-ladsgroup.json [17:20:00] (03CR) 10JMeybohm: [V: 03+1] calico: Align formatting with k8s module and profiles [puppet] - 10https://gerrit.wikimedia.org/r/854562 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [17:20:37] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] toil: add bandaid for ifupdown race [puppet] - 10https://gerrit.wikimedia.org/r/854553 (https://phabricator.wikimedia.org/T273026) (owner: 10Filippo Giunchedi) [17:21:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T322618)', diff saved to https://phabricator.wikimedia.org/P38730 and previous config saved to /var/cache/conftool/dbconfig/20221108-172107-ladsgroup.json [17:22:12] 10SRE, 10MW-on-K8s, 10serviceops: Sandbox/limit child processes within a container runtime - https://phabricator.wikimedia.org/T252745 (10Joe) 05Open→03Resolved a:03Joe This task can be considered resolved given we've deployed shellbox. [17:22:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T321123)', diff saved to https://phabricator.wikimedia.org/P38731 and previous config saved to /var/cache/conftool/dbconfig/20221108-172227-marostegui.json [17:22:33] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [17:23:03] (03PS12) 10Elukey: [WIP] First prototype of webrequest-live [puppet] - 10https://gerrit.wikimedia.org/r/854499 [17:23:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1013.eqiad.wmnet [17:23:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P38732 and previous config saved to /var/cache/conftool/dbconfig/20221108-172351-marostegui.json [17:24:33] (03PS1) 10Muehlenhoff: Buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/854565 [17:26:51] (03CR) 10Vlad.shapik: [C: 03+1] Make swift connector aware of cacert file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/854563 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [17:30:43] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:31:32] (03CR) 10Muehlenhoff: [C: 03+2] Buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/854565 (owner: 10Muehlenhoff) [17:31:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1013.eqiad.wmnet [17:31:38] (03PS2) 10Muehlenhoff: Buster tracking updates [puppet] - 10https://gerrit.wikimedia.org/r/854565 [17:32:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1014.eqiad.wmnet [17:33:32] (03CR) 10Ladsgroup: "The points I mentioned in https://gerrit.wikimedia.org/r/c/operations/puppet/+/850446/1#message-a40974d6599bca858b4037de7bb5dc699b8c33a4 s" [puppet] - 10https://gerrit.wikimedia.org/r/854142 (https://phabricator.wikimedia.org/T322541) (owner: 10Gergő Tisza) [17:34:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:36:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P38733 and previous config saved to /var/cache/conftool/dbconfig/20221108-173613-ladsgroup.json [17:37:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P38734 and previous config saved to /var/cache/conftool/dbconfig/20221108-173734-marostegui.json [17:38:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P38735 and previous config saved to /var/cache/conftool/dbconfig/20221108-173857-marostegui.json [17:39:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1014.eqiad.wmnet [17:39:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:39:41] PROBLEM - Check systemd state on serpens is CRITICAL: CRITICAL - degraded: The following units failed: ganeti-ifupdown-bandaid.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:58] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for haveged [puppet] - 10https://gerrit.wikimedia.org/r/852830 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:40:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-presto1015.eqiad.wmnet [17:45:48] 10SRE-swift-storage, 10ConfirmEdit (CAPTCHA extension), 10Beta-Cluster-reproducible: Beta: Create an account pops up with an Internal Error - https://phabricator.wikimedia.org/T322667 (10GMikesell-WMF) p:05Triage→03Unbreak! [17:46:11] PROBLEM - Check systemd state on seaborgium is CRITICAL: CRITICAL - degraded: The following units failed: ganeti-ifupdown-bandaid.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10Dzahn) 05Resolved→03Open [17:48:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1015.eqiad.wmnet [17:50:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10Dzahn) @ILooremeta-WMF Ah, sure, I reopened the ticket. @jbond Could you check for the kerberos part of this? [17:51:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Dzahn) 05Resolved→03Open [17:51:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P38736 and previous config saved to /var/cache/conftool/dbconfig/20221108-175120-ladsgroup.json [17:52:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P38737 and previous config saved to /var/cache/conftool/dbconfig/20221108-175240-marostegui.json [17:54:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38738 and previous config saved to /var/cache/conftool/dbconfig/20221108-175404-marostegui.json [17:54:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2148.codfw.wmnet with reason: Maintenance [17:54:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2148.codfw.wmnet with reason: Maintenance [17:54:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T321130)', diff saved to https://phabricator.wikimedia.org/P38739 and previous config saved to /var/cache/conftool/dbconfig/20221108-175425-marostegui.json [18:01:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321130)', diff saved to https://phabricator.wikimedia.org/P38740 and previous config saved to /var/cache/conftool/dbconfig/20221108-180101-marostegui.json [18:05:13] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-usersfor David.pujol - https://phabricator.wikimedia.org/T322670 (10David.pujol) [18:06:08] (03PS1) 10Phuedx: EditAttemptStep sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854570 (https://phabricator.wikimedia.org/T312016) [18:06:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T322618)', diff saved to https://phabricator.wikimedia.org/P38741 and previous config saved to /var/cache/conftool/dbconfig/20221108-180626-ladsgroup.json [18:06:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [18:06:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [18:06:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T322618)', diff saved to https://phabricator.wikimedia.org/P38742 and previous config saved to /var/cache/conftool/dbconfig/20221108-180648-ladsgroup.json [18:06:53] (03PS1) 10Jbond: deployment-prep: use FQDN so pcc can resolve [puppet] - 10https://gerrit.wikimedia.org/r/854571 [18:07:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38025/console" [puppet] - 10https://gerrit.wikimedia.org/r/854571 (owner: 10Jbond) [18:07:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T321123)', diff saved to https://phabricator.wikimedia.org/P38743 and previous config saved to /var/cache/conftool/dbconfig/20221108-180747-marostegui.json [18:07:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [18:08:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [18:08:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T321123)', diff saved to https://phabricator.wikimedia.org/P38744 and previous config saved to /var/cache/conftool/dbconfig/20221108-180808-marostegui.json [18:08:40] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:08:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T322618)', diff saved to https://phabricator.wikimedia.org/P38745 and previous config saved to /var/cache/conftool/dbconfig/20221108-180856-ladsgroup.json [18:09:27] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [18:12:45] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10jijiki) [18:16:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P38746 and previous config saved to /var/cache/conftool/dbconfig/20221108-181607-marostegui.json [18:23:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T321123)', diff saved to https://phabricator.wikimedia.org/P38747 and previous config saved to /var/cache/conftool/dbconfig/20221108-182307-marostegui.json [18:23:31] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [18:24:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P38748 and previous config saved to /var/cache/conftool/dbconfig/20221108-182403-ladsgroup.json [18:31:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P38749 and previous config saved to /var/cache/conftool/dbconfig/20221108-183114-marostegui.json [18:38:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P38750 and previous config saved to /var/cache/conftool/dbconfig/20221108-183814-marostegui.json [18:39:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P38751 and previous config saved to /var/cache/conftool/dbconfig/20221108-183909-ladsgroup.json [18:46:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321130)', diff saved to https://phabricator.wikimedia.org/P38752 and previous config saved to /var/cache/conftool/dbconfig/20221108-184620-marostegui.json [18:46:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance [18:46:28] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:46:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance [18:46:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38753 and previous config saved to /var/cache/conftool/dbconfig/20221108-184642-marostegui.json [18:53:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P38754 and previous config saved to /var/cache/conftool/dbconfig/20221108-185320-marostegui.json [18:53:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38755 and previous config saved to /var/cache/conftool/dbconfig/20221108-185326-marostegui.json [18:53:31] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:54:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T322618)', diff saved to https://phabricator.wikimedia.org/P38756 and previous config saved to /var/cache/conftool/dbconfig/20221108-185416-ladsgroup.json [18:54:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [18:54:22] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:54:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [18:54:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T322618)', diff saved to https://phabricator.wikimedia.org/P38757 and previous config saved to /var/cache/conftool/dbconfig/20221108-185437-ladsgroup.json [18:56:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T322618)', diff saved to https://phabricator.wikimedia.org/P38758 and previous config saved to /var/cache/conftool/dbconfig/20221108-185646-ladsgroup.json [18:58:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1024.eqiad.wmnet with reason: Remove from cluster for eventual reimage [18:58:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1024.eqiad.wmnet with reason: Remove from cluster for eventual reimage [19:08:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T321123)', diff saved to https://phabricator.wikimedia.org/P38759 and previous config saved to /var/cache/conftool/dbconfig/20221108-190827-marostegui.json [19:08:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P38760 and previous config saved to /var/cache/conftool/dbconfig/20221108-190832-marostegui.json [19:08:34] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [19:11:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P38761 and previous config saved to /var/cache/conftool/dbconfig/20221108-191152-ladsgroup.json [19:13:18] (03PS1) 10Muehlenhoff: grafana: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/854573 (https://phabricator.wikimedia.org/T308013) [19:13:20] (03PS1) 10Muehlenhoff: archiva/piwik: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/854574 (https://phabricator.wikimedia.org/T308013) [19:13:22] (03PS1) 10Muehlenhoff: Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/854575 (https://phabricator.wikimedia.org/T308013) [19:14:48] (03CR) 10CI reject: [V: 04-1] Add SPDX headers to various IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/854575 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [19:23:24] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for prometheus-ganeti-exporter [puppet] - 10https://gerrit.wikimedia.org/r/854578 (https://phabricator.wikimedia.org/T135991) [19:23:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P38762 and previous config saved to /var/cache/conftool/dbconfig/20221108-192339-marostegui.json [19:25:01] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/854575 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [19:26:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P38763 and previous config saved to /var/cache/conftool/dbconfig/20221108-192659-ladsgroup.json [19:27:46] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/854578 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [19:31:37] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:33:03] (03PS4) 10Bartosz Dziewoński: Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (https://phabricator.wikimedia.org/T314588) (owner: 10Esanders) [19:35:32] 10SRE, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Sprint-24): Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Daimona) [19:36:21] 10SRE, 10CampaignEvents, 10Wikimedia-Site-requests, 10Campaign-Registration, 10Campaign-Tools (Campaign-Tools-Sprint-24): Run the timezone update script periodically in prod and in beta - https://phabricator.wikimedia.org/T320403 (10Daimona) Hi SREs! I'm not sure if these are the right tag, please feel f... [19:38:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T321130)', diff saved to https://phabricator.wikimedia.org/P38764 and previous config saved to /var/cache/conftool/dbconfig/20221108-193845-marostegui.json [19:38:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2175.codfw.wmnet with reason: Maintenance [19:38:52] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:39:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2175.codfw.wmnet with reason: Maintenance [19:39:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T321130)', diff saved to https://phabricator.wikimedia.org/P38765 and previous config saved to /var/cache/conftool/dbconfig/20221108-193907-marostegui.json [19:42:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T322618)', diff saved to https://phabricator.wikimedia.org/P38766 and previous config saved to /var/cache/conftool/dbconfig/20221108-194206-ladsgroup.json [19:42:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [19:42:11] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:42:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [19:45:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321130)', diff saved to https://phabricator.wikimedia.org/P38767 and previous config saved to /var/cache/conftool/dbconfig/20221108-194551-marostegui.json [19:46:00] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:50:33] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:57:41] (03PS1) 10Bartosz Dziewoński: ThreadItemStore: Fix setting parent IDs when parent already existed [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854592 (https://phabricator.wikimedia.org/T322599) [20:00:45] (03PS1) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854606 (https://phabricator.wikimedia.org/T315353) [20:00:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P38768 and previous config saved to /var/cache/conftool/dbconfig/20221108-200058-marostegui.json [20:02:00] (03PS1) 10Ssingh: varnish::common: set Python version for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/854607 (https://phabricator.wikimedia.org/T321309) [20:03:28] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38026/console" [puppet] - 10https://gerrit.wikimedia.org/r/854607 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:08:09] (03CR) 10Ssingh: [V: 03+1] "Resolves the following error on cp hosts bullseye:" [puppet] - 10https://gerrit.wikimedia.org/r/854607 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:16:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P38769 and previous config saved to /var/cache/conftool/dbconfig/20221108-201604-marostegui.json [20:20:22] (03PS1) 10Ssingh: sslcert: refactor update-ocsp.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/854608 (https://phabricator.wikimedia.org/T321309) [20:21:13] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38027/console" [puppet] - 10https://gerrit.wikimedia.org/r/854608 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:24:13] (03PS2) 10DLynch: Bump sampling rate to 0.2 for various editing schemas on a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851182 (https://phabricator.wikimedia.org/T321734) [20:31:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321130)', diff saved to https://phabricator.wikimedia.org/P38770 and previous config saved to /var/cache/conftool/dbconfig/20221108-203111-marostegui.json [20:32:20] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:37:50] (03PS1) 10DLynch: ABtest for mobile, logged in users [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854593 (https://phabricator.wikimedia.org/T320993) [20:38:26] (03PS1) 10DLynch: ABtest for mobile, logged in users [extensions/DiscussionTools] (wmf/1.40.0-wmf.9) - 10https://gerrit.wikimedia.org/r/854594 (https://phabricator.wikimedia.org/T320993) [20:38:54] (03Abandoned) 10DLynch: ABtest for mobile, logged in users [extensions/DiscussionTools] (wmf/1.40.0-wmf.9) - 10https://gerrit.wikimedia.org/r/854594 (https://phabricator.wikimedia.org/T320993) (owner: 10DLynch) [20:39:37] (03PS1) 10DLynch: ABtest for mobile, logged out users [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854595 (https://phabricator.wikimedia.org/T320993) [20:42:24] (03PS2) 10Ssingh: varnish::common: set Python version for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/854607 (https://phabricator.wikimedia.org/T321309) [20:43:14] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38028/console" [puppet] - 10https://gerrit.wikimedia.org/r/854607 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [20:49:45] (03PS1) 10Reedy: LabsServices: Update ms-fe host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854611 (https://phabricator.wikimedia.org/T322667) [20:50:36] (03CR) 10Zabe: [C: 03+1] LabsServices: Update ms-fe host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854611 (https://phabricator.wikimedia.org/T322667) (owner: 10Reedy) [20:51:23] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:51:56] (03CR) 10Reedy: [C: 03+2] LabsServices: Update ms-fe host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854611 (https://phabricator.wikimedia.org/T322667) (owner: 10Reedy) [20:52:41] (03Merged) 10jenkins-bot: LabsServices: Update ms-fe host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854611 (https://phabricator.wikimedia.org/T322667) (owner: 10Reedy) [20:56:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:57:20] !log reedy@deploy1002 Synchronized wmf-config/LabsServices.php: T322667 (duration: 04m 02s) [20:57:27] T322667: Beta: Create an account pops up with an Internal Error - https://phabricator.wikimedia.org/T322667 [20:57:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:57:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:58:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221108T2100). [21:00:05] MatmaRex and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] hello! probably better if we do David first, then my changes with the maintenance script at the end [21:00:20] I can deploy today! [21:00:33] wow, this is a full window! Not sure if we can do all changes, but I'll try. [21:00:34] (although my first 3 config changes are all no-ops or beta-only, so you could deploy them while we wait for the backports to merge) [21:01:04] MatmaRex: I can start with David's changes. Should we wait for him (Kemayo ), or can you help with those too? [21:01:10] I'm here, too. [21:01:19] okay, great! [21:01:27] 10SRE-swift-storage, 10ConfirmEdit (CAPTCHA extension), 10Beta-Cluster-reproducible, 10Patch-For-Review: Beta: Create an account pops up with an Internal Error - https://phabricator.wikimedia.org/T322667 (10Zabe) 05Open→03Resolved a:03Reedy [21:01:39] let's start then :) [21:01:47] My two non-config ones shouldn't have any major effect until a config change that'll go out tomorrow or Thursday anyway, so they'll just be checking for errors. [21:02:40] (i hate filling up backport windows with no-ops and beta cluster config changes, but if i don't put them in a window, i can never convince anyone to click that +2 button :/ ) [21:03:04] (03CR) 10Urbanecm: [C: 03+2] ABtest for mobile, logged in users [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854593 (https://phabricator.wikimedia.org/T320993) (owner: 10DLynch) [21:03:17] (03CR) 10Urbanecm: [C: 03+2] ABtest for mobile, logged out users [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854595 (https://phabricator.wikimedia.org/T320993) (owner: 10DLynch) [21:03:43] ack Kemayo [21:04:40] (03PS5) 10Urbanecm: Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (https://phabricator.wikimedia.org/T314588) (owner: 10Esanders) [21:04:45] (03CR) 10Urbanecm: [C: 03+2] Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (https://phabricator.wikimedia.org/T314588) (owner: 10Esanders) [21:04:58] (03PS2) 10Urbanecm: Update wgSpecialContributeSkinsDisabled → wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851132 (https://phabricator.wikimedia.org/T319327) (owner: 10Bartosz Dziewoński) [21:05:04] (03CR) 10Urbanecm: [C: 03+2] Update wgSpecialContributeSkinsDisabled → wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851132 (https://phabricator.wikimedia.org/T319327) (owner: 10Bartosz Dziewoński) [21:05:09] "max 6 patches". >Window has nine patches/scripts recorded. [21:05:34] (03Merged) 10jenkins-bot: Enable history page visual diffs on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (https://phabricator.wikimedia.org/T314588) (owner: 10Esanders) [21:05:37] it happens D: [21:05:38] perryprog: I'm aware of that, but thank you for noting it. [21:05:46] i am happy to drop some of them if it's a problem [21:06:07] just teasing, of course. Didn't mean it in a negative way. :) [21:06:15] :) [21:06:16] me neither, just acknowledging the comment :) [21:06:28] MatmaRex: the only risk is that we won't have time for all. if there are less urgent, perhaps you can note which ones are those? [21:06:54] (03PS3) 10Urbanecm: Update wgSpecialContributeSkinsDisabled → wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851132 (https://phabricator.wikimedia.org/T319327) (owner: 10Bartosz Dziewoński) [21:07:04] none are urgent. we can just go in order and see where we end up [21:07:20] (03CR) 10Urbanecm: Update wgSpecialContributeSkinsDisabled → wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851132 (https://phabricator.wikimedia.org/T319327) (owner: 10Bartosz Dziewoński) [21:07:27] (03CR) 10Urbanecm: [C: 03+2] Update wgSpecialContributeSkinsDisabled → wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851132 (https://phabricator.wikimedia.org/T319327) (owner: 10Bartosz Dziewoński) [21:07:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (https://phabricator.wikimedia.org/T314588) (owner: 10Esanders) [21:07:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851132 (https://phabricator.wikimedia.org/T319327) (owner: 10Bartosz Dziewoński) [21:07:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853991 (https://phabricator.wikimedia.org/T322494) (owner: 10Esanders) [21:07:49] ack [21:08:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833831 (https://phabricator.wikimedia.org/T314588) (owner: 10Esanders) [21:08:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851132 (https://phabricator.wikimedia.org/T319327) (owner: 10Bartosz Dziewoński) [21:08:13] (03Merged) 10jenkins-bot: Update wgSpecialContributeSkinsDisabled → wgSpecialContributeSkinsEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851132 (https://phabricator.wikimedia.org/T319327) (owner: 10Bartosz Dziewoński) [21:08:27] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:833831|Enable history page visual diffs on beta cluster (T314588)]], [[gerrit:851132|Update wgSpecialContributeSkinsDisabled → wgSpecialContributeSkinsEnabled (T319327)]] [21:08:34] T314588: Launch visual diffs on history pages out of beta and provide it to all users - https://phabricator.wikimedia.org/T314588 [21:08:34] T319327: [S] Make Special:Contribute the default entry point in the menu - https://phabricator.wikimedia.org/T319327 [21:08:45] doing the first two now [21:09:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:09:03] (i have to sync them, as IS.php gets changed two) [21:09:04] *too [21:09:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:09:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:10:21] (03Merged) 10jenkins-bot: ABtest for mobile, logged in users [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854593 (https://phabricator.wikimedia.org/T320993) (owner: 10DLynch) [21:10:23] (03Merged) 10jenkins-bot: ABtest for mobile, logged out users [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854595 (https://phabricator.wikimedia.org/T320993) (owner: 10DLynch) [21:10:53] just in time, sync's almost finished [21:10:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:11:16] Kemayo: since those two are no-ops now, should i put it at mwdebug first anyway? [21:11:31] No real need. [21:11:39] or should i just sync, and you'll monitor the logs and shout if it needs to be reverted? [21:11:39] (I'm fine with both, fwiw) [21:11:52] ack, I'll skip mwdebug thne [21:12:00] urbanecm: Yeah, go ahead and sync and I'll watch out for errors. [21:12:10] perfect [21:13:00] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:833831|Enable history page visual diffs on beta cluster (T314588)]], [[gerrit:851132|Update wgSpecialContributeSkinsDisabled → wgSpecialContributeSkinsEnabled (T319327)]] (duration: 04m 33s) [21:13:09] !log urbanecm@deploy1002 backport aborted: (duration: 00m 01s) [21:13:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854593 (https://phabricator.wikimedia.org/T320993) (owner: 10DLynch) [21:13:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854595 (https://phabricator.wikimedia.org/T320993) (owner: 10DLynch) [21:13:31] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:854593|ABtest for mobile, logged in users (T320993)]], [[gerrit:854595|ABtest for mobile, logged out users (T320993)]] [21:13:36] Kemayo: it's syncing now ^^ [21:13:40] T320993: Implement mobile DiscussionTools A/B test bucketing - https://phabricator.wikimedia.org/T320993 [21:15:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:16:52] (03CR) 10Urbanecm: [C: 03+2] ThreadItemStore: Fix setting parent IDs when parent already existed [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854592 (https://phabricator.wikimedia.org/T322599) (owner: 10Bartosz Dziewoński) [21:16:54] eh, forgot to +2 this one :D [21:16:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:16:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:17:42] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:854593|ABtest for mobile, logged in users (T320993)]], [[gerrit:854595|ABtest for mobile, logged out users (T320993)]] (duration: 04m 10s) [21:17:44] MatmaRex: just double-checking, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/853991 is a no-op two? [21:17:46] *too [21:17:48] (03CR) 10Volans: [C: 03+1] "LGTM, I've added some optional additional possible changes inline." [puppet] - 10https://gerrit.wikimedia.org/r/854608 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [21:17:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:18:06] urbanecm: yes. the code using that is not deployed yet [21:18:10] ack [21:18:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853991 (https://phabricator.wikimedia.org/T322494) (owner: 10Esanders) [21:18:25] (03PS4) 10Urbanecm: Keep DiscussionTools "Share feedback..." links on WMF wikis for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853991 (https://phabricator.wikimedia.org/T322494) (owner: 10Esanders) [21:18:27] (03CR) 10TrainBranchBot: "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853991 (https://phabricator.wikimedia.org/T322494) (owner: 10Esanders) [21:18:32] syncing wo mwdebug [21:19:11] (03Merged) 10jenkins-bot: Keep DiscussionTools "Share feedback..." links on WMF wikis for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/853991 (https://phabricator.wikimedia.org/T322494) (owner: 10Esanders) [21:19:23] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:853991|Keep DiscussionTools "Share feedback..." links on WMF wikis for now (T322494)]] [21:19:27] (03PS3) 10Urbanecm: Bump sampling rate to 0.2 for various editing schemas on a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851182 (https://phabricator.wikimedia.org/T321734) (owner: 10DLynch) [21:19:30] T322494: Remove the "Share feedback about this feature" link - https://phabricator.wikimedia.org/T322494 [21:20:04] (03CR) 10Urbanecm: [C: 03+2] Bump sampling rate to 0.2 for various editing schemas on a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851182 (https://phabricator.wikimedia.org/T321734) (owner: 10DLynch) [21:20:50] (03Merged) 10jenkins-bot: Bump sampling rate to 0.2 for various editing schemas on a/b test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851182 (https://phabricator.wikimedia.org/T321734) (owner: 10DLynch) [21:22:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:23:26] (03Merged) 10jenkins-bot: ThreadItemStore: Fix setting parent IDs when parent already existed [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854592 (https://phabricator.wikimedia.org/T322599) (owner: 10Bartosz Dziewoński) [21:23:37] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:853991|Keep DiscussionTools "Share feedback..." links on WMF wikis for now (T322494)]] (duration: 04m 14s) [21:23:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:23:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:23:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851182 (https://phabricator.wikimedia.org/T321734) (owner: 10DLynch) [21:24:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:25:08] !log urbanecm@deploy1002 backport aborted: (duration: 01m 16s) [21:26:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851182 (https://phabricator.wikimedia.org/T321734) (owner: 10DLynch) [21:26:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/854592 (https://phabricator.wikimedia.org/T322599) (owner: 10Bartosz Dziewoński) [21:26:22] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851182|Bump sampling rate to 0.2 for various editing schemas on a/b test wikis (T321734)]], [[gerrit:854592|ThreadItemStore: Fix setting parent IDs when parent already existed (T322599)]] [21:26:33] T321734: Extend the MobileWebUIActions sampling rate to A/B test wiki - https://phabricator.wikimedia.org/T321734 [21:26:33] T322599: PHP Notice: Undefined index: h-Pages_locked_from_recreation - https://phabricator.wikimedia.org/T322599 [21:26:41] !log urbanecm@deploy1002 urbanecm and kemayo and matmarex: Backport for [[gerrit:851182|Bump sampling rate to 0.2 for various editing schemas on a/b test wikis (T321734)]], [[gerrit:854592|ThreadItemStore: Fix setting parent IDs when parent already existed (T322599)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:26:58] Kemayo: MatmaRex: can you test ^^ at mwdebug1001, please? [21:27:18] Checking now [21:27:19] (config "Bump sampling rate to 0.2 for various editing schemas on a/b test wikis" and backport "ThreadItemStore: Fix setting parent IDs when parent already existed" are there) [21:27:36] urbanecm: Okay, mine looks good. [21:27:41] ack, waiting for MatmaRex [21:27:57] backport can't be easily tested, but we'll see the results in the maint script [21:28:03] (which shhould stop emitting notices) [21:28:10] makes sense. syncing then. [21:29:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:30:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:30:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:31:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:32:07] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851182|Bump sampling rate to 0.2 for various editing schemas on a/b test wikis (T321734)]], [[gerrit:854592|ThreadItemStore: Fix setting parent IDs when parent already existed (T322599)]] (duration: 05m 45s) [21:32:13] T321734: Extend the MobileWebUIActions sampling rate to A/B test wiki - https://phabricator.wikimedia.org/T321734 [21:32:15] T322599: PHP Notice: Undefined index: h-Pages_locked_from_recreation - https://phabricator.wikimedia.org/T322599 [21:33:53] so, all except https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/854606 done now [21:34:02] in only 33 minutes, that's very good time [21:34:07] (03PS2) 10Urbanecm: Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854606 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [21:34:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854606 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [21:35:02] (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854606 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [21:35:15] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:854606|Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis (T315353)]] [21:35:21] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [21:35:33] :D thanks [21:35:34] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:854606|Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis (T315353)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:35:54] MatmaRex: it's at mwdebug1001 now, can you test? [21:36:02] yeah [21:37:42] urbanecm: seems good [21:37:50] great, syncing [21:41:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:41:52] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:854606|Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis (T315353)]] (duration: 06m 36s) [21:41:58] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [21:42:22] when running the maintenance script, can you redirect the output to a file? i'm curious to see any other errors it prints. (also curious to see how long it takes on each wiki, if we can record that… should be a couple of days in total) [21:42:37] yeah, i will do that this time [21:42:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:42:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:43:03] MatmaRex: what resources does the script use? is it mainly parsoid? or mainly db servers? [21:43:14] I'm wondering if it makes sense to run it in paralel for each DB section [21:43:33] both, but parsoid is the slow part [21:43:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:43:43] okay [21:44:12] it might be okay to run in parallel, but i don't want to accidentally cause some issues by doing that [21:44:16] yeah [21:44:18] i'll do it serial [21:47:12] urbanecm: ehh, i'm seeing some errors [21:47:21] MatmaRex: which errors? [21:47:26] do i need to revert something? [21:47:27] (03PS1) 10Jon Harald Søby: Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854618 (https://phabricator.wikimedia.org/T322696) [21:48:02] urbanecm: from my config deployment. not sure [21:48:04] Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'h-EranBot-2021-01-23T21:18:00.000Z' for key 'it_itemname' [21:48:04] Function: MediaWiki\Extension\DiscussionTools\ThreadItemStore::insertThreadItems [21:48:24] (03PS2) 10Jon Harald Søby: Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854618 (https://phabricator.wikimedia.org/T322696) [21:48:37] that's quite frequent [21:49:18] MatmaRex: do you know why that's happening? [21:49:26] if not, i prefer reverting the config patch and re-deploying later, when known [21:50:09] let's revert i guess [21:50:26] i think it's harmless but i should document why [21:50:42] we're not doing the script either, then. sorry about that [21:50:59] no problem [21:51:13] i prepared a wrapper script to capture output&runtimes [21:51:16] but we can use it later [21:51:26] oh neat [21:51:42] MatmaRex: double-checking, I'm reverting https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/854606, right? [21:52:21] yes [21:52:53] (03PS1) 10Urbanecm: Revert "Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854626 [21:52:55] (03CR) 10TrainBranchBot: "urbanecm@deploy1002 created a revert of this change as Ie0970b1c7ea06ecc45992723dbbe1ced823c762e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854606 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [21:53:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854626 (owner: 10Urbanecm) [21:53:18] (03CR) 10Urbanecm: [V: 03+2] Revert "Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854626 (owner: 10Urbanecm) [21:53:33] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:854626|Revert "Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis"]] [21:53:52] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:854626|Revert "Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:54:36] (03PS3) 10Jon Harald Søby: Add no=>nb to wgInterlanguageLinkCodeMap for some multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854618 (https://phabricator.wikimedia.org/T322696) [21:54:48] syncing [21:54:56] MatmaRex: fwiw the script's https://phabricator.wikimedia.org/P38771, in case you're interested in it [21:55:57] urbanecm: thanks. i thought there was some wrapper for running scripts on wikis from a list already! [21:56:06] there is, but it doesn't capture runtime :D [21:56:18] oh [21:56:34] maybe i should just add that to the script itself [21:57:01] yeah, it'd make stuff easier a bit [21:58:37] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:854626|Revert "Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis"]] (duration: 05m 04s) [21:58:41] MatmaRex: and, reverted [21:58:43] anything else? [21:58:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:58:54] urbanecm: thanks [21:59:09] databases are the worst [21:59:28] heh :) [21:59:35] !log UTC late evening B&C window done [21:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:59:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [22:00:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [22:00:48] MatmaRex: fwiw, for true beta-only patches (-labs.php changes only), i don't mind +2'ing them whenever (and other deployers probably don't either), feel free to ping one of deployers here and it can be done quickly. For patches that touch the prod IS.php, those need to be synced (even if no-op), and those are better within a window. [22:01:15] hm, okay [22:01:32] as a deployer, it's useful for me to know they're no-op from the calendar, as i can skip mwdebug for them, making the sync a bit faster. [22:01:34] hope it helps :) [22:02:20] looks like exceptions stopped appearing [22:17:13] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Thanks for the feedback and requirements documentation, @Vgutierrez. Acoustic, the vendor in this case, doesn't have specifi... [23:29:22] (03Abandoned) 10Krinkle: mediawiki: Remove imports redundant with `profile::mediawiki::common` [puppet] - 10https://gerrit.wikimedia.org/r/842934 (owner: 10Krinkle) [23:46:07] (03PS1) 10Andrea Denisse: netmon: Add the netmon role to netmon2002 [puppet] - 10https://gerrit.wikimedia.org/r/854624 [23:49:27] (03PS2) 10Andrea Denisse: netmon: Add the netmon role to netmon2002 [puppet] - 10https://gerrit.wikimedia.org/r/854624