[00:01:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10Papaul) [00:09:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P50020 and previous config saved to /var/cache/conftool/dbconfig/20230803-000904-ladsgroup.json [00:11:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host titan2001.mgmt.codfw.wmnet with reboot policy FORCED [00:13:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host titan2002.mgmt.codfw.wmnet with reboot policy FORCED [00:20:56] RECOVERY - cinder-volume process on cloudcontrol1007 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:24:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P50021 and previous config saved to /var/cache/conftool/dbconfig/20230803-002410-ladsgroup.json [00:26:44] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2023-07-25 00:00:05 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:33:20] 10SRE, 10noc.wikimedia.org, 10serviceops: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Krinkle) 05Open→03Resolved a:03Joe Presumed fixed by {T341859}. In particular: * [operations/puppet] servi... [00:36:23] (03PS3) 10Krinkle: noc: Fix various PHP errors that prevent db.php from working locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859) [00:38:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945006 [00:38:47] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945006 (owner: 10TrainBranchBot) [00:39:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T342617)', diff saved to https://phabricator.wikimedia.org/P50022 and previous config saved to /var/cache/conftool/dbconfig/20230803-003916-ladsgroup.json [00:39:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [00:39:21] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:39:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [00:39:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T342617)', diff saved to https://phabricator.wikimedia.org/P50023 and previous config saved to /var/cache/conftool/dbconfig/20230803-003939-ladsgroup.json [00:54:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945006 (owner: 10TrainBranchBot) [00:59:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T342617)', diff saved to https://phabricator.wikimedia.org/P50024 and previous config saved to /var/cache/conftool/dbconfig/20230803-005908-ladsgroup.json [00:59:12] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:14:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P50025 and previous config saved to /var/cache/conftool/dbconfig/20230803-011414-ladsgroup.json [01:29:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P50026 and previous config saved to /var/cache/conftool/dbconfig/20230803-012920-ladsgroup.json [01:31:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T342617)', diff saved to https://phabricator.wikimedia.org/P50027 and previous config saved to /var/cache/conftool/dbconfig/20230803-013123-ladsgroup.json [01:31:27] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:44:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T342617)', diff saved to https://phabricator.wikimedia.org/P50028 and previous config saved to /var/cache/conftool/dbconfig/20230803-014426-ladsgroup.json [01:44:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [01:44:30] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:44:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [01:44:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [01:44:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [01:45:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T342617)', diff saved to https://phabricator.wikimedia.org/P50029 and previous config saved to /var/cache/conftool/dbconfig/20230803-014503-ladsgroup.json [01:46:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P50030 and previous config saved to /var/cache/conftool/dbconfig/20230803-014629-ladsgroup.json [02:01:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P50031 and previous config saved to /var/cache/conftool/dbconfig/20230803-020137-ladsgroup.json [02:04:12] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:20] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T342617)', diff saved to https://phabricator.wikimedia.org/P50032 and previous config saved to /var/cache/conftool/dbconfig/20230803-021643-ladsgroup.json [02:16:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [02:16:47] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [02:16:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [02:32:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host titan2002.mgmt.codfw.wmnet with reboot policy FORCED [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:46:17] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Bodhisattwa) Mail threads have been started on this issue at [[ https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wi... [02:49:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:00:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:14:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T342617)', diff saved to https://phabricator.wikimedia.org/P50033 and previous config saved to /var/cache/conftool/dbconfig/20230803-031359-ladsgroup.json [03:14:04] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [03:29:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P50034 and previous config saved to /var/cache/conftool/dbconfig/20230803-032905-ladsgroup.json [03:44:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P50035 and previous config saved to /var/cache/conftool/dbconfig/20230803-034411-ladsgroup.json [03:52:10] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:52:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:56:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:56:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:59:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T342617)', diff saved to https://phabricator.wikimedia.org/P50036 and previous config saved to /var/cache/conftool/dbconfig/20230803-035917-ladsgroup.json [03:59:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [03:59:26] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [03:59:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [03:59:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T342617)', diff saved to https://phabricator.wikimedia.org/P50037 and previous config saved to /var/cache/conftool/dbconfig/20230803-035940-ladsgroup.json [04:42:31] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42764/console" [puppet] - 10https://gerrit.wikimedia.org/r/944247 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [04:42:49] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::wanrouter_cache: add wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/944247 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [04:48:16] (03PS1) 10KartikMistry: Update MinT to 2023-08-02-142037-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/945036 (https://phabricator.wikimedia.org/T338292) [04:48:40] (03PS1) 10Giuseppe Lavagetto: mw-on-k8s: add wikifunction pools [puppet] - 10https://gerrit.wikimedia.org/r/945037 (https://phabricator.wikimedia.org/T297815) [04:50:54] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42765/console" [puppet] - 10https://gerrit.wikimedia.org/r/945037 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [04:53:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [04:56:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mw-on-k8s: add wikifunction pools [puppet] - 10https://gerrit.wikimedia.org/r/945037 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [05:07:45] (03PS1) 10Giuseppe Lavagetto: MediaWiki: add wikifunctions pool to mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/945040 (https://phabricator.wikimedia.org/T297815) [05:08:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:21:47] <_joe_> jouncebot: nowandnext [05:21:47] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [05:21:47] In 0 hour(s) and 38 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T0600) [05:21:47] In 0 hour(s) and 38 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T0600) [05:29:46] (03PS2) 10Giuseppe Lavagetto: MediaWiki: add wikifunctions pool to mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/945040 (https://phabricator.wikimedia.org/T297815) [05:32:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T342617)', diff saved to https://phabricator.wikimedia.org/P50038 and previous config saved to /var/cache/conftool/dbconfig/20230803-053259-ladsgroup.json [05:33:05] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [05:33:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] MediaWiki: add wikifunctions pool to mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/945040 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [05:34:15] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [05:34:21] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [05:34:29] (03Merged) 10jenkins-bot: MediaWiki: add wikifunctions pool to mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/945040 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [05:43:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s6 T343296 [05:43:47] T343296: Switchover s6 master (db2129 -> db2114) - https://phabricator.wikimedia.org/T343296 [05:44:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s6 T343296 [05:44:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2114 with weight 0 T343296', diff saved to https://phabricator.wikimedia.org/P50039 and previous config saved to /var/cache/conftool/dbconfig/20230803-054418-marostegui.json [05:46:37] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [05:46:43] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [05:48:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P50040 and previous config saved to /var/cache/conftool/dbconfig/20230803-054805-ladsgroup.json [05:50:12] (03CR) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [05:51:34] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes1022.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:51:54] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes1008.eqiad.wmnet, kubernetes1022.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:52:19] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [05:52:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [05:57:04] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes2010.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimed [05:57:04] iki/PyBal [05:57:46] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2024.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2016.codfw.wmnet are marked down but pooled https://wikitech.wikimed [05:57:46] iki/PyBal [05:58:38] _joe_: marostegui OK to deploy MinT now (not sure about above errors happening..) [05:58:53] yeah [05:58:54] go for it [05:58:59] Thanks! [05:59:00] ah wait [05:59:02] the above errors [05:59:05] I don't know [05:59:07] <_joe_> marostegui: it's ok [05:59:08] _joe_: ^? [05:59:12] right [05:59:14] <_joe_> yes it's me [05:59:19] <_joe_> I am fixing it [05:59:34] <_joe_> it's mw-debug so no big deal [05:59:39] <_joe_> that's what I am using it for :) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T0600) [06:00:04] kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T0600). [06:00:10] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2114 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/944342 (https://phabricator.wikimedia.org/T343296) (owner: 10Gerrit maintenance bot) [06:00:28] !log Starting s6 codfw failover from db2129 to db2114 - T343296 [06:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:35] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-08-02-142037-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/945036 (https://phabricator.wikimedia.org/T338292) (owner: 10KartikMistry) [06:00:36] T343296: Switchover s6 master (db2129 -> db2114) - https://phabricator.wikimedia.org/T343296 [06:00:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2114 to s6 primary T343296', diff saved to https://phabricator.wikimedia.org/P50041 and previous config saved to /var/cache/conftool/dbconfig/20230803-060055-marostegui.json [06:01:22] (03Merged) 10jenkins-bot: Update MinT to 2023-08-02-142037-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/945036 (https://phabricator.wikimedia.org/T338292) (owner: 10KartikMistry) [06:01:31] I'm around ish but this is codfw so not much needed [06:01:49] (03PS1) 10Giuseppe Lavagetto: MediaWiki: use proper format for mcrouter route [deployment-charts] - 10https://gerrit.wikimedia.org/r/945150 [06:02:05] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] MediaWiki: use proper format for mcrouter route [deployment-charts] - 10https://gerrit.wikimedia.org/r/945150 (owner: 10Giuseppe Lavagetto) [06:02:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2129 T343296', diff saved to https://phabricator.wikimedia.org/P50042 and previous config saved to /var/cache/conftool/dbconfig/20230803-060241-marostegui.json [06:02:46] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [06:03:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P50043 and previous config saved to /var/cache/conftool/dbconfig/20230803-060311-ladsgroup.json [06:03:44] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [06:03:51] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [06:04:59] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:05:09] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [06:05:59] (03CR) 10Marostegui: [C: 03+1] mariadb: Switch candidate host of s1 [puppet] - 10https://gerrit.wikimedia.org/r/944897 (https://phabricator.wikimedia.org/T342284) (owner: 10Ladsgroup) [06:06:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:06:50] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:07:31] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:07:34] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:08:14] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:11:26] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:13:44] (03PS1) 10Marostegui: db2129: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/945462 (https://phabricator.wikimedia.org/T334650) [06:17:08] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:18:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T342617)', diff saved to https://phabricator.wikimedia.org/P50044 and previous config saved to /var/cache/conftool/dbconfig/20230803-061817-ladsgroup.json [06:18:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [06:18:21] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [06:18:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [06:18:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T342617)', diff saved to https://phabricator.wikimedia.org/P50045 and previous config saved to /var/cache/conftool/dbconfig/20230803-061827-ladsgroup.json [06:21:23] (03PS1) 10Elukey: services: bump eventgate-main's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/945487 (https://phabricator.wikimedia.org/T343002) [06:22:04] (03CR) 10Marostegui: [C: 03+2] db2129: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/945462 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [06:25:09] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:28:15] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) Thank you. I am leaving the host off MariaDB started until @Jhancock.wm has finished the onsite checks. @Jhancock.wm the host has been powered off again now so you can check. Please brin... [06:30:07] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:31:50] !log Updated MinT to 2023-08-02-142037-production (T338292) [06:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:53] T338292: Add sentence segmenter feature - https://phabricator.wikimedia.org/T338292 [06:33:36] !log oblivian@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [06:33:36] !log oblivian@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [06:33:43] !log oblivian@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [06:33:44] !log oblivian@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [06:35:20] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: GitLab 16 major version upgrade [06:35:27] (03CR) 10Elukey: C:bigtop::hadoop move net-topology.py to files. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [06:36:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:37:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:37:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.670 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:37:12] (03PS4) 10Giuseppe Lavagetto: mediawiki::wancache: add the wikifunctions pools and routes [puppet] - 10https://gerrit.wikimedia.org/r/944248 (https://phabricator.wikimedia.org/T297815) [06:38:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:40:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:40:07] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42766/console" [puppet] - 10https://gerrit.wikimedia.org/r/944248 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [06:41:54] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42767/console" [puppet] - 10https://gerrit.wikimedia.org/r/944248 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [06:42:31] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::wancache: add the wikifunctions pools and routes [puppet] - 10https://gerrit.wikimedia.org/r/944248 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [06:45:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:55:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 1%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50046 and previous config saved to /var/cache/conftool/dbconfig/20230803-065529-root.json [06:58:06] (03PS1) 10Giuseppe Lavagetto: Add wikifunctions object cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945534 (https://phabricator.wikimedia.org/T297815) [07:00:05] Amir1, apergos, and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T0700). [07:00:11] morning! there are no trainees signed up today, although two people are yet to be (re)scheduled. and there are no patches scheduled for deployment during this sleepy August morning window. [07:01:10] if any self-deployers want to sneak something in at the last minute, now's your chance. [07:04:10] (03CR) 10Elukey: [C: 03+2] services: bump eventgate-main's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/945487 (https://phabricator.wikimedia.org/T343002) (owner: 10Elukey) [07:06:42] 10SRE, 10Wikimedia-Apache-configuration, 10serviceops: Incorrect handling of ETags taking precedence over timestamps in conditional requests - https://phabricator.wikimedia.org/T320241 (10Ifrahkhanyaree) [07:06:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Marostegui) Thank you @Papaul and @Jhancock.wm! [07:08:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Marostegui) [07:09:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Marostegui) [07:10:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 3%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50047 and previous config saved to /var/cache/conftool/dbconfig/20230803-071034-root.json [07:11:10] (03PS1) 10Marostegui: site: Add pc101[56] as in setup [puppet] - 10https://gerrit.wikimedia.org/r/945535 (https://phabricator.wikimedia.org/T342164) [07:12:07] (03CR) 10Marostegui: [C: 03+2] site: Add pc101[56] as in setup [puppet] - 10https://gerrit.wikimedia.org/r/945535 (https://phabricator.wikimedia.org/T342164) (owner: 10Marostegui) [07:14:07] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [07:24:43] (03PS1) 10Marostegui: install_server: Allow install db12[26-33] [puppet] - 10https://gerrit.wikimedia.org/r/945536 (https://phabricator.wikimedia.org/T342176) [07:25:24] (03CR) 10Marostegui: [C: 03+2] install_server: Allow install db12[26-33] [puppet] - 10https://gerrit.wikimedia.org/r/945536 (https://phabricator.wikimedia.org/T342176) (owner: 10Marostegui) [07:25:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 5%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50048 and previous config saved to /var/cache/conftool/dbconfig/20230803-072539-root.json [07:28:23] 10SRE, 10LDAP-Access-Requests: Grant wmf and turnilo/superset access for Rae Adimer - https://phabricator.wikimedia.org/T342591 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi [07:36:04] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [07:36:19] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [07:38:37] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [07:39:04] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [07:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 10%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50049 and previous config saved to /var/cache/conftool/dbconfig/20230803-074044-root.json [07:50:58] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3326 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [07:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:51:19] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: GitLab 16 major version upgrade [07:51:24] (ProbeDown) firing: (2) Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:52:26] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 155618 bytes in 0.624 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [07:53:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T342617)', diff saved to https://phabricator.wikimedia.org/P50050 and previous config saved to /var/cache/conftool/dbconfig/20230803-075305-ladsgroup.json [07:53:13] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [07:53:14] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:44] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50051 and previous config saved to /var/cache/conftool/dbconfig/20230803-075548-root.json [07:56:24] (ProbeDown) resolved: (2) Service gitlab1004:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:57:07] ^ gitlab should recover soon [07:57:23] ack [07:58:43] ah this is a resolved actually not a firing. So yes recovered :) [07:59:12] RECOVERY - Check systemd state on config-master2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:38] !log installing Linux 5.10.179 on Buster hosts with Linux 5.10 [07:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:45] (03PS2) 10Ladsgroup: mariadb: Switch candidate host of s1 [puppet] - 10https://gerrit.wikimedia.org/r/944897 (https://phabricator.wikimedia.org/T342284) [07:59:52] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Switch candidate host of s1 [puppet] - 10https://gerrit.wikimedia.org/r/944897 (https://phabricator.wikimedia.org/T342284) (owner: 10Ladsgroup) [08:00:04] dancy and jnuche: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T0800). [08:08:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P50052 and previous config saved to /var/cache/conftool/dbconfig/20230803-080812-ladsgroup.json [08:10:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50053 and previous config saved to /var/cache/conftool/dbconfig/20230803-081053-root.json [08:13:21] (03CR) 10Muehlenhoff: [C: 03+2] docker_registry_ha: Remove Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/944934 (owner: 10Muehlenhoff) [08:19:44] (03CR) 10Muehlenhoff: [C: 03+2] cassandra: Pass ports in firewall-agnostic format [puppet] - 10https://gerrit.wikimedia.org/r/944896 (owner: 10Muehlenhoff) [08:22:33] (03PS1) 10Jgiannelos: wikifeeds: Bumps service to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/945539 [08:23:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P50054 and previous config saved to /var/cache/conftool/dbconfig/20230803-082318-ladsgroup.json [08:24:24] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Bumps service to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/945539 (owner: 10Jgiannelos) [08:25:28] (03Merged) 10jenkins-bot: wikifeeds: Bumps service to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/945539 (owner: 10Jgiannelos) [08:25:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50055 and previous config saved to /var/cache/conftool/dbconfig/20230803-082558-root.json [08:28:28] (03CR) 10Volans: Modify install and apt server config to support Juniper ZTP via HTTP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [08:28:47] (03CR) 10Muehlenhoff: [C: 03+2] PCC: Pass ports without ferm-specific service constants [puppet] - 10https://gerrit.wikimedia.org/r/931578 (owner: 10Muehlenhoff) [08:31:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:35:32] (03CR) 10Ayounsi: Policy and definition updates for post-migration esams ranges (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [08:37:21] (03CR) 10Ayounsi: [C: 03+1] Adjust network prepare-upgrade cookbook to use TCP 8080 [cookbooks] - 10https://gerrit.wikimedia.org/r/942638 (https://phabricator.wikimedia.org/T337028) (owner: 10Cathal Mooney) [08:38:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T342617)', diff saved to https://phabricator.wikimedia.org/P50056 and previous config saved to /var/cache/conftool/dbconfig/20230803-083824-ladsgroup.json [08:38:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [08:38:28] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [08:38:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [08:38:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T342617)', diff saved to https://phabricator.wikimedia.org/P50057 and previous config saved to /var/cache/conftool/dbconfig/20230803-083845-ladsgroup.json [08:41:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: Repooling after migration', diff saved to https://phabricator.wikimedia.org/P50058 and previous config saved to /var/cache/conftool/dbconfig/20230803-084103-root.json [08:44:32] !log installing yajl security updates [08:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:28] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [08:53:57] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [08:55:25] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [08:55:58] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [09:03:04] !log installing systemd bugfix updates from Bookworm 12.1 point release [09:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:43] !log Deploying rename changes for mw149[7-8] to kubernetes102[5-6] - T343306 [09:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:45] T343306: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 [09:03:46] (03CR) 10Clément Goubert: [C: 03+2] Rename mw149[7-8] to kubernetes102[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/944895 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [09:07:32] I think someone ought to take a look at T337649 and assess whether big or small red buttons need pushing. [09:07:33] T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 [09:07:42] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Idle - NTT, AS2914/IPv6: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:09:27] Granted I'm just a dumb end user, but that looks kinda exclamation-marks-and-all-caps to me. [09:10:00] hnowlan: ^ [09:11:19] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10CommRel-Specialists-Support (Jul-Sep-2023): Request Access to Superset querying presto_analytics_hive datasets - https://phabricator.wikimedia.org/T343320 (10fgiunchedi) Thank you @SNowick_WMF ! @odimitrijevic @Milimetric we're seeking approval for this... [09:11:50] 10SRE, 10Traffic: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [09:11:58] (03PS1) 10Filippo Giunchedi: profile: remove jmxtrans mention from zookeeper, obsolete [puppet] - 10https://gerrit.wikimedia.org/r/945541 [09:12:00] (03PS1) 10Filippo Giunchedi: admin: add amalr [puppet] - 10https://gerrit.wikimedia.org/r/945542 (https://phabricator.wikimedia.org/T343320) [09:12:05] (03PS1) 10AikoChou: ml-services: update outlink docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/945543 (https://phabricator.wikimedia.org/T343002) [09:12:43] (03CR) 10Elukey: [C: 03+1] ml-services: update outlink docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/945543 (https://phabricator.wikimedia.org/T343002) (owner: 10AikoChou) [09:13:10] (03CR) 10AikoChou: [C: 03+2] ml-services: update outlink docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/945543 (https://phabricator.wikimedia.org/T343002) (owner: 10AikoChou) [09:13:12] (03CR) 10Elukey: [C: 03+1] profile: remove jmxtrans mention from zookeeper, obsolete [puppet] - 10https://gerrit.wikimedia.org/r/945541 (owner: 10Filippo Giunchedi) [09:13:59] (03Merged) 10jenkins-bot: ml-services: update outlink docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/945543 (https://phabricator.wikimedia.org/T343002) (owner: 10AikoChou) [09:14:29] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10fgiunchedi) @adee_wmde we're missing out of band verification of your ssh key for this request and then we're good to go -- could you publish the ssh key to your user page on wiki?... [09:15:05] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove jmxtrans mention from zookeeper, obsolete [puppet] - 10https://gerrit.wikimedia.org/r/945541 (owner: 10Filippo Giunchedi) [09:16:42] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:17:01] (03PS1) 10Muehlenhoff: Add Cumin alias for zookeeper-flink [puppet] - 10https://gerrit.wikimedia.org/r/945544 [09:17:12] (03PS1) 10Filippo Giunchedi: admin: add maryana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/945546 (https://phabricator.wikimedia.org/T342797) [09:17:20] (03PS2) 10Muehlenhoff: Add Cumin alias for zookeeper-flink [puppet] - 10https://gerrit.wikimedia.org/r/945544 [09:17:50] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:19:14] (03PS1) 10Clément Goubert: Add kubernetes102[5,6] to its k8s_neighbors list [homer/public] - 10https://gerrit.wikimedia.org/r/945547 (https://phabricator.wikimedia.org/T343306) [09:19:47] (03PS1) 10Fabfur: Version 1.1.0-3 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) [09:20:13] (03PS2) 10Fabfur: Release 1.1.0-3 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) [09:20:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:21:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/945542 (https://phabricator.wikimedia.org/T343320) (owner: 10Filippo Giunchedi) [09:21:25] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:22:29] (03CR) 10CI reject: [V: 04-1] Release 1.1.0-3 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [09:23:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:23:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:23:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T342617)', diff saved to https://phabricator.wikimedia.org/P50059 and previous config saved to /var/cache/conftool/dbconfig/20230803-092338-ladsgroup.json [09:23:41] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [09:25:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:25:41] (03PS1) 10Ayounsi: Rename protocol icmpv6 to icmp6 [homer/public] - 10https://gerrit.wikimedia.org/r/945550 [09:26:38] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:26:48] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for zookeeper-flink [puppet] - 10https://gerrit.wikimedia.org/r/945544 (owner: 10Muehlenhoff) [09:27:12] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:33:00] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add amalr [puppet] - 10https://gerrit.wikimedia.org/r/945542 (https://phabricator.wikimedia.org/T343320) (owner: 10Filippo Giunchedi) [09:33:54] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Patch-For-Review: Request Access to Superset querying presto_analytics_hive datasets - https://phabricator.wikimedia.org/T343320 (10fgiunchedi) >>! In T343320#9065669, @fgiunchedi wrote: > Thank you @SNowic... [09:34:48] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Patch-For-Review: Request Access to Superset querying presto_analytics_hive datasets - https://phabricator.wikimedia.org/T343320 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi @ARamadan-WMF access wi... [09:38:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:40:02] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:40:38] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:43:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:43:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [09:44:47] (03PS1) 10Muehlenhoff: installserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945552 [09:50:26] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [09:51:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:26] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:53:50] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:59:36] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [10:00:05] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T1000) [10:01:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:01:34] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [10:04:26] (03PS1) 10Hnowlan: thumbor: include workers for djvu files, limit thumbnailrender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/945553 (https://phabricator.wikimedia.org/T337649) [10:05:54] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:06:46] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:07:39] (03PS2) 10Muehlenhoff: installserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945552 [10:08:46] (03PS2) 10Clément Goubert: thumbor: include workers for djvu files, limit thumbnailrender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/945553 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [10:09:16] (03CR) 10Clément Goubert: [C: 03+1] thumbor: include workers for djvu files, limit thumbnailrender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/945553 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [10:09:45] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:09:45] (03CR) 10Hnowlan: [C: 03+2] thumbor: include workers for djvu files, limit thumbnailrender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/945553 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [10:10:19] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:10:23] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [10:10:27] (03Merged) 10jenkins-bot: thumbor: include workers for djvu files, limit thumbnailrender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/945553 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [10:10:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945552 (owner: 10Muehlenhoff) [10:11:00] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [10:11:05] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [10:11:32] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:11:33] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [10:13:02] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [10:13:15] (03PS1) 10Muehlenhoff: openldap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945554 [10:14:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T342617)', diff saved to https://phabricator.wikimedia.org/P50061 and previous config saved to /var/cache/conftool/dbconfig/20230803-101441-ladsgroup.json [10:14:45] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:15:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T342617)', diff saved to https://phabricator.wikimedia.org/P50062 and previous config saved to /var/cache/conftool/dbconfig/20230803-101509-ladsgroup.json [10:16:32] (03PS1) 10Hnowlan: images: add memcache throttle metric [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/945556 [10:18:29] (03CR) 10Clément Goubert: [C: 03+1] images: add memcache throttle metric [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/945556 (owner: 10Hnowlan) [10:22:42] (03CR) 10Fabfur: [V: 03+1] "Skipping CI checks" [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [10:23:08] (03CR) 10Fabfur: [V: 03+2] Release 1.1.0-3 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [10:23:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945554 (owner: 10Muehlenhoff) [10:27:40] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:28:20] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:29:03] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P50065 and previous config saved to /var/cache/conftool/dbconfig/20230803-102948-ladsgroup.json [10:30:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P50066 and previous config saved to /var/cache/conftool/dbconfig/20230803-103015-ladsgroup.json [10:31:01] (03PS1) 10Muehlenhoff: firewall: Make more Ferm-specific setup conditional to the ferm provider [puppet] - 10https://gerrit.wikimedia.org/r/945557 (https://phabricator.wikimedia.org/T336497) [10:34:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:36:05] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Mpossoupe - https://phabricator.wikimedia.org/T343432 (10Mpossoupe) [10:38:33] (03PS1) 10Hnowlan: trafficserver: route wikifeeds requests via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/945558 (https://phabricator.wikimedia.org/T339119) [10:40:56] (03CR) 10Jgiannelos: [C: 04-1] "Lets hold on this because it looks like wikifeeds doesn't handle onthisday URL routing right." [puppet] - 10https://gerrit.wikimedia.org/r/945558 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [10:41:20] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:42:10] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:43:53] 10SRE, 10MW-on-K8s, 10serviceops: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Clement_Goubert) After a quick check, it appears we are setting ` $MaxMessageSize 64k ` on mw bare metal hosts and not on kubernetes. Patch incoming. [10:44:15] 10SRE, 10MW-on-K8s, 10serviceops: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High a:03Clement_Goubert [10:44:26] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:44:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P50067 and previous config saved to /var/cache/conftool/dbconfig/20230803-104454-ladsgroup.json [10:45:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P50068 and previous config saved to /var/cache/conftool/dbconfig/20230803-104521-ladsgroup.json [10:49:34] (03CR) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:52:15] (03PS1) 10Jforrester: WikiLambda: Add PHP code for Z2K5/'short descriptions' [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944858 (https://phabricator.wikimedia.org/T343396) [10:54:10] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't write logs to disk - https://phabricator.wikimedia.org/T342079 (10Volans) Those are various different requests: 1) `logger.debug("cfssl output: %s.", str(cfssl_raw.stdout))` is part of the cookbook, if you don't want it logged to disk... [10:55:30] (03PS1) 10Clément Goubert: mediawiki: Raise syslog max_message_size [deployment-charts] - 10https://gerrit.wikimedia.org/r/945561 (https://phabricator.wikimedia.org/T343390) [10:58:26] (03CR) 10Jelto: vrts: add test VM to site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939349 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [10:59:52] (03CR) 10Volans: [C: 04-1] "Logic it's ok but it would not work as-is, see inline for the details." [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede) [11:00:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T342617)', diff saved to https://phabricator.wikimedia.org/P50069 and previous config saved to /var/cache/conftool/dbconfig/20230803-110000-ladsgroup.json [11:00:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:00:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T342617)', diff saved to https://phabricator.wikimedia.org/P50070 and previous config saved to /var/cache/conftool/dbconfig/20230803-110028-ladsgroup.json [11:07:50] (03CR) 10Hnowlan: [C: 03+1] mediawiki: Raise syslog max_message_size [deployment-charts] - 10https://gerrit.wikimedia.org/r/945561 (https://phabricator.wikimedia.org/T343390) (owner: 10Clément Goubert) [11:09:41] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Raise syslog max_message_size [deployment-charts] - 10https://gerrit.wikimedia.org/r/945561 (https://phabricator.wikimedia.org/T343390) (owner: 10Clément Goubert) [11:10:43] (03Merged) 10jenkins-bot: mediawiki: Raise syslog max_message_size [deployment-charts] - 10https://gerrit.wikimedia.org/r/945561 (https://phabricator.wikimedia.org/T343390) (owner: 10Clément Goubert) [11:13:10] (03PS1) 10Jgiannelos: rest-gateway: Fix onthisday feed routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/945567 [11:13:43] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10dr0ptp4kt) Thanks @fgiunchedi and thanks for working through this with me!. Should I Phab-mention anyone or tag the ticket, or do the WMCS folks normally catch stuff w... [11:14:26] (03CR) 10Hnowlan: [C: 03+1] rest-gateway: Fix onthisday feed routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/945567 (owner: 10Jgiannelos) [11:14:33] (03PS2) 10Jgiannelos: rest-gateway: Fix onthisday feed routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/945567 (https://phabricator.wikimedia.org/T339119) [11:16:26] (03CR) 10Jgiannelos: "Is the matching happening with priority? I assumed it worked like that so /all/ should match first, if not fallback to the next rule." [deployment-charts] - 10https://gerrit.wikimedia.org/r/945567 (https://phabricator.wikimedia.org/T339119) (owner: 10Jgiannelos) [11:20:45] (03CR) 10Hnowlan: rest-gateway: Fix onthisday feed routing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/945567 (https://phabricator.wikimedia.org/T339119) (owner: 10Jgiannelos) [11:22:00] (03PS1) 10Muehlenhoff: firewall: Ship a base profile for the nftables provider [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) [11:22:34] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: Fix onthisday feed routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/945567 (https://phabricator.wikimedia.org/T339119) (owner: 10Jgiannelos) [11:22:50] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945010 [11:23:00] (03CR) 10Jgiannelos: [C: 03+2] rest-gateway: Fix onthisday feed routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/945567 (https://phabricator.wikimedia.org/T339119) (owner: 10Jgiannelos) [11:23:21] (03Merged) 10jenkins-bot: rest-gateway: Fix onthisday feed routing [deployment-charts] - 10https://gerrit.wikimedia.org/r/945567 (https://phabricator.wikimedia.org/T339119) (owner: 10Jgiannelos) [11:23:47] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:23:59] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:24:42] (03CR) 10CI reject: [V: 04-1] firewall: Ship a base profile for the nftables provider [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:28:31] (03PS2) 10Muehlenhoff: firewall: Ship a base profile for the nftables provider [puppet] - 10https://gerrit.wikimedia.org/r/945573 (https://phabricator.wikimedia.org/T336497) [11:30:04] (03PS1) 10Clément Goubert: mediawiki::logging: k8s syslog max message size [puppet] - 10https://gerrit.wikimedia.org/r/945584 (https://phabricator.wikimedia.org/T343390) [11:30:13] (03CR) 10Hnowlan: [C: 03+2] images: add memcache throttle metric [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/945556 (owner: 10Hnowlan) [11:31:08] (03CR) 10Hnowlan: [C: 03+1] restbase: Upgrade restbase2013 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944959 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [11:31:13] (03CR) 10Hnowlan: [C: 03+1] restbase: Upgrade restbase2014 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944960 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [11:31:16] (03CR) 10Hnowlan: [C: 03+1] restbase: Upgrade restbase2019 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944961 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [11:31:22] (03CR) 10Hnowlan: [C: 03+1] restbase: Upgrade restbase2024 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944963 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [11:31:26] (03CR) 10Hnowlan: [C: 03+1] restbase: Upgrade restbase2021 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/944962 (https://phabricator.wikimedia.org/T339298) (owner: 10Eevans) [11:33:56] (03PS4) 10Hnowlan: api-gateway: add route for metrics/knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/939656 (https://phabricator.wikimedia.org/T342213) (owner: 10Milimetric) [11:34:17] (03Merged) 10jenkins-bot: images: add memcache throttle metric [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/945556 (owner: 10Hnowlan) [11:35:06] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42768/console" [puppet] - 10https://gerrit.wikimedia.org/r/945584 (https://phabricator.wikimedia.org/T343390) (owner: 10Clément Goubert) [11:40:08] (03CR) 10Hnowlan: [C: 03+1] mediawiki::logging: k8s syslog max message size [puppet] - 10https://gerrit.wikimedia.org/r/945584 (https://phabricator.wikimedia.org/T343390) (owner: 10Clément Goubert) [11:40:47] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add route for metrics/knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/939656 (https://phabricator.wikimedia.org/T342213) (owner: 10Milimetric) [11:40:50] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] mediawiki::logging: k8s syslog max message size [puppet] - 10https://gerrit.wikimedia.org/r/945584 (https://phabricator.wikimedia.org/T343390) (owner: 10Clément Goubert) [11:41:29] (03Merged) 10jenkins-bot: api-gateway: add route for metrics/knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/939656 (https://phabricator.wikimedia.org/T342213) (owner: 10Milimetric) [11:41:55] (03PS1) 10Muehlenhoff: nftables::file: Expand prefix to three digits [puppet] - 10https://gerrit.wikimedia.org/r/945586 [11:43:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945586 (owner: 10Muehlenhoff) [11:44:24] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:44:34] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:44:44] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:44:51] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:44:59] (03CR) 10Jgiannelos: "Issue looks fixed on staging." [puppet] - 10https://gerrit.wikimedia.org/r/945558 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [11:45:05] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:45:15] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:45:30] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:45:38] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:46:27] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:46:38] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:46:47] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:46:54] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:47:01] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [11:47:21] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [11:47:23] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:47:33] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:47:39] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:47:45] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:48:03] !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [11:48:03] !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [11:48:10] !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:48:10] !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:48:17] !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [11:48:17] !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [11:48:21] !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:48:22] !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:48:31] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [11:48:49] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [11:48:55] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [11:49:02] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [11:51:10] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Clement_Goubert) Deployed, I'll check logstash periodically to see if it was enough to fix the majority of cases. [11:54:20] (03PS5) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) [11:57:00] (03CR) 10CI reject: [V: 04-1] sre.ganeti.reboot_vm: Allow users to reenable Puppet. [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T1200) [12:00:54] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:01:07] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:01:08] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:01:22] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:01:23] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [12:01:34] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [12:01:35] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:01:43] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:01:44] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [12:01:58] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [12:01:59] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:02:07] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:02:09] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [12:02:20] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [12:02:21] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [12:02:29] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [12:02:30] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [12:02:41] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [12:02:42] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [12:02:49] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [12:03:24] (03CR) 10Jgiannelos: "By matching everything under `/feed` we will also match this URL used by android apps:" [puppet] - 10https://gerrit.wikimedia.org/r/945558 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [12:04:37] (03PS1) 10Clément Goubert: mediawiki::logging: MaxMessageSize unit is lowercase [puppet] - 10https://gerrit.wikimedia.org/r/945589 (https://phabricator.wikimedia.org/T343390) [12:06:49] (03PS6) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) [12:09:09] (03CR) 10CI reject: [V: 04-1] sre.ganeti.reboot_vm: Allow users to reenable Puppet. [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede) [12:15:11] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@54c0898] (releasing): (no justification provided) [12:15:35] (03PS7) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) [12:15:53] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@54c0898] (releasing): (no justification provided) (duration: 00m 42s) [12:19:15] jouncebot: nowandnext [12:19:15] For the next 0 hour(s) and 40 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T1200) [12:19:15] In 0 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T1300) [12:19:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944858 (https://phabricator.wikimedia.org/T343396) (owner: 10Jforrester) [12:20:46] James_F: see _security [12:23:19] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Clement_Goubert) The configuration option hasn't been taken up by the rsyslog containers in the pod, because it's a configmap ch... [12:23:45] (03Merged) 10jenkins-bot: WikiLambda: Add PHP code for Z2K5/'short descriptions' [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944858 (https://phabricator.wikimedia.org/T343396) (owner: 10Jforrester) [12:26:13] (03CR) 10Slyngshede: sre.ganeti.reboot_vm: Allow users to reenable Puppet. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede) [12:31:18] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:944858|WikiLambda: Add PHP code for Z2K5/'short descriptions' (T343396)]] [12:31:24] T343396: Implementation reordering deletes short descriptions - https://phabricator.wikimedia.org/T343396 [12:31:58] !log updated T343294 migitations [12:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:00] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:944858|WikiLambda: Add PHP code for Z2K5/'short descriptions' (T343396)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [12:34:32] !log jforrester@deploy1002 jforrester: Continuing with sync [12:34:40] (03PS1) 10Jgiannelos: rest-gateway: Add route for wikifeeds availability [deployment-charts] - 10https://gerrit.wikimedia.org/r/945591 (https://phabricator.wikimedia.org/T339119) [12:35:40] (03CR) 10Jgiannelos: "This endpoint is only enabled on `wikimedia.org` domain on restbase/deploy config. That said its very lightweight and domain agnostic. May" [deployment-charts] - 10https://gerrit.wikimedia.org/r/945591 (https://phabricator.wikimedia.org/T339119) (owner: 10Jgiannelos) [12:40:59] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:944858|WikiLambda: Add PHP code for Z2K5/'short descriptions' (T343396)]] (duration: 09m 41s) [12:41:05] T343396: Implementation reordering deletes short descriptions - https://phabricator.wikimedia.org/T343396 [12:45:24] (03CR) 10Ayounsi: [C: 03+1] Add kubernetes102[5,6] to its k8s_neighbors list [homer/public] - 10https://gerrit.wikimedia.org/r/945547 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [12:47:31] (03CR) 10Ayounsi: [C: 03+1] installserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945552 (owner: 10Muehlenhoff) [12:48:10] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10WikiLambda: Varnish/ATS are occasionally responding to Wikifunctions object page reads with a 404 even though `cache;desc="pass"` is set on normal requests - https://phabricator.wikimedia.org/T343440 (10Vgutierrez) ` vgutierrez@carrot:~$ for i in {1..100}; do... [12:50:51] (03CR) 10Ayounsi: [C: 03+1] "Probably a good time to update modules/ferm/types/port.pp to prevent someone from using the old notation." [puppet] - 10https://gerrit.wikimedia.org/r/945552 (owner: 10Muehlenhoff) [12:52:48] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 2 others: Varnish/ATS are occasionally responding to Wikifunctions object page reads with a 404 even though `cache;desc="pass"` is set on normal requests - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) a:03Clement_Goubert [12:54:41] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 2 others: Varnish/ATS are occasionally responding to Wikifunctions object page reads with a 404 even though `cache;desc="pass"` is set on normal requests - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) I'll prepare a patch to exc... [12:54:49] (03PS1) 10Kamila Součková: benthos: add wmf-certificates for Kafka [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/945595 (https://phabricator.wikimedia.org/T324200) [12:54:49] (03CR) 10Elukey: [C: 03+1] Release 1.1.0-3 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [12:54:55] (03CR) 10Clément Goubert: [C: 03+2] mediawiki::logging: MaxMessageSize unit is lowercase [puppet] - 10https://gerrit.wikimedia.org/r/945589 (https://phabricator.wikimedia.org/T343390) (owner: 10Clément Goubert) [12:56:31] (03CR) 10Clément Goubert: [C: 03+1] benthos: add wmf-certificates for Kafka [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/945595 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [12:56:58] (03CR) 10Kamila Součková: [C: 03+2] benthos: add wmf-certificates for Kafka [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/945595 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [12:57:18] (03CR) 10Kamila Součková: [V: 03+2 C: 03+2] benthos: add wmf-certificates for Kafka [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/945595 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T1300) [13:00:05] aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:03:47] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10Papaul) [13:04:00] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 2 others: Varnish/ATS are occasionally responding to Wikifunctions object page reads with a 404 even though `cache;desc="pass"` is set on normal requests - https://phabricator.wikimedia.org/T343440 (10Jdforrester-WMF) p:05Triage→03High [13:05:21] (03CR) 10Jforrester: [C: 03+1] "LGTM. Do you want to deploy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945534 (https://phabricator.wikimedia.org/T297815) (owner: 10Giuseppe Lavagetto) [13:06:45] aanzx: around? [13:06:53] yes [13:08:11] where did you get the talk namespace translation? [13:08:50] it was above on work talk And other talk pages [13:09:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944857 (https://phabricator.wikimedia.org/T343410) (owner: 10Anzx) [13:10:18] (03Merged) 10jenkins-bot: pawikisource: create audiobook namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944857 (https://phabricator.wikimedia.org/T343410) (owner: 10Anzx) [13:10:43] !log taavi@deploy1002 Started scap: Backport for [[gerrit:944857|pawikisource: create audiobook namespace (T343410)]] [13:10:46] T343410: Requesting a new Namespace on pawikisource - Audiobook - https://phabricator.wikimedia.org/T343410 [13:11:42] (03PS1) 10Papaul: Add titan do site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/945598 (https://phabricator.wikimedia.org/T342300) [13:12:03] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [13:12:06] (03CR) 10CI reject: [V: 04-1] Add titan do site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/945598 (https://phabricator.wikimedia.org/T342300) (owner: 10Papaul) [13:12:22] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [13:12:27] !log taavi@deploy1002 taavi and anzx: Backport for [[gerrit:944857|pawikisource: create audiobook namespace (T343410)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:12:34] testing [13:12:58] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan2001'] [13:13:24] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 2 others: mw-on-k8s responds 404 for Wikifunctions view pages - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) [13:13:54] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan2002'] [13:16:45] (03PS2) 10Papaul: Add titan to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/945598 (https://phabricator.wikimedia.org/T342300) [13:17:07] (03CR) 10CI reject: [V: 04-1] Add titan to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/945598 (https://phabricator.wikimedia.org/T342300) (owner: 10Papaul) [13:17:38] taavi: looks good [13:17:50] !log taavi@deploy1002 taavi and anzx: Continuing with sync [13:18:03] logstash looks good too, syncing. will do a namespaceDupes run afterwards [13:18:40] (03PS1) 10Clément Goubert: wikifunctions: Add view rewrite rule for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/945600 (https://phabricator.wikimedia.org/T343440) [13:19:12] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 3 others: mw-on-k8s responds 404 for Wikifunctions view pages - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) It's a missing rewrite rule in mediawiki::sites [13:19:38] (03PS3) 10Papaul: Add titan to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/945598 (https://phabricator.wikimedia.org/T342300) [13:20:52] (03CR) 10Papaul: [C: 03+2] Add titan to site.pp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/945598 (https://phabricator.wikimedia.org/T342300) (owner: 10Papaul) [13:21:33] (03PS2) 10Clément Goubert: wikifunctions: Add view rewrite rule for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/945600 (https://phabricator.wikimedia.org/T343440) [13:21:38] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 91.198.174.244 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:23:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['titan2001'] [13:23:41] (03CR) 10Bking: [C: 03+2] flink-zk: Enable prometheus scrapes [puppet] - 10https://gerrit.wikimedia.org/r/942494 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [13:23:45] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:944857|pawikisource: create audiobook namespace (T343410)]] (duration: 13m 01s) [13:23:48] T343410: Requesting a new Namespace on pawikisource - Audiobook - https://phabricator.wikimedia.org/T343410 [13:25:00] PROBLEM - BGP status on pfw3-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:25:20] (03PS4) 10Func: Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) [13:26:24] !log taavi@mwmaint1002 ~ $ mwscript namespaceDupes.php pawikisource --fix --add-prefix "BROKEN " # T343410 [13:26:26] RECOVERY - BGP status on pfw3-codfw is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:13] thanks [13:28:40] (03CR) 10Ssingh: "I think we should not override the CI failure. My reason for thinking so:" [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [13:30:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['titan2002'] [13:30:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host titan2001.codfw.wmnet with OS bookworm [13:31:05] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host titan2001.codfw.wmnet with OS bookworm [13:31:29] (03PS3) 10Clément Goubert: wikifunctions: Add view rewrite rule for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/945600 (https://phabricator.wikimedia.org/T343440) [13:33:36] (03CR) 10CI reject: [V: 04-1] wikifunctions: Add view rewrite rule for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/945600 (https://phabricator.wikimedia.org/T343440) (owner: 10Clément Goubert) [13:33:45] (03CR) 10Volans: sre.ganeti.reboot_vm: Allow users to reenable Puppet. (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/935687 (https://phabricator.wikimedia.org/T307792) (owner: 10Slyngshede) [13:33:56] (03PS4) 10Clément Goubert: wikifunctions: Add view rewrite rule for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/945600 (https://phabricator.wikimedia.org/T343440) [13:35:04] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan2001'] [13:40:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['titan2001'] [13:43:10] (03CR) 10Fabfur: Release 1.1.0-3 (031 comment) [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [13:43:54] (03PS1) 10Jforrester: [Wikifunctions] Allow logged-in users to make function calls again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945603 [13:45:50] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan2002'] [13:45:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['titan2002'] [13:46:21] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan2002'] [13:46:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['titan2002'] [13:47:57] taavi: You done with deploy? [13:48:25] James_F: yes [13:48:29] Ace. [13:48:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945603 (owner: 10Jforrester) [13:49:23] (03Merged) 10jenkins-bot: [Wikifunctions] Allow logged-in users to make function calls again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945603 (owner: 10Jforrester) [13:49:49] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:945603|[Wikifunctions] Allow logged-in users to make function calls again]] [13:51:30] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:945603|[Wikifunctions] Allow logged-in users to make function calls again]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:51:45] !log jforrester@deploy1002 jforrester: Continuing with sync [13:52:51] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Jhancock.wm) @RobH The fan plugs directly into the main board. I did one more fan swap to be sure. repeatable results etc. It will probably fail by lunch and then we can be sure. [13:58:13] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:945603|[Wikifunctions] Allow logged-in users to make function calls again]] (duration: 08m 24s) [14:00:04] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:36] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on titan2001.codfw.wmnet with reason: host reimage [14:05:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on titan2001.codfw.wmnet with reason: host reimage [14:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:25] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: Add route for wikifeeds availability [deployment-charts] - 10https://gerrit.wikimedia.org/r/945591 (https://phabricator.wikimedia.org/T339119) (owner: 10Jgiannelos) [14:08:16] (03Merged) 10jenkins-bot: rest-gateway: Add route for wikifeeds availability [deployment-charts] - 10https://gerrit.wikimedia.org/r/945591 (https://phabricator.wikimedia.org/T339119) (owner: 10Jgiannelos) [14:11:26] (03PS1) 10Kamila Součková: benthos-cache-invalidator: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/945606 (https://phabricator.wikimedia.org/T324200) [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:13:04] (03PS1) 10Hnowlan: api-gateway: allow non-/metrics paths in AQS case [deployment-charts] - 10https://gerrit.wikimedia.org/r/945607 (https://phabricator.wikimedia.org/T342213) [14:14:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] benthos-cache-invalidator: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/945606 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [14:15:17] (03CR) 10Kamila Součková: [C: 03+2] benthos-cache-invalidator: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/945606 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [14:16:00] (03Merged) 10jenkins-bot: benthos-cache-invalidator: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/945606 (https://phabricator.wikimedia.org/T324200) (owner: 10Kamila Součková) [14:16:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:57] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:58] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:21:08] (03PS1) 10Stevemunene: Prevent removal of py2 on bullseye hadoop client and worker [puppet] - 10https://gerrit.wikimedia.org/r/945608 (https://phabricator.wikimedia.org/T332570) [14:21:16] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [14:21:29] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [14:22:16] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [14:22:39] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [14:22:48] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for Mpossoupe - https://phabricator.wikimedia.org/T343432 (10fgiunchedi) @Mpossoupe I take it you can't access superset dashboards with private (PII) data ? In that case we'll need to add you to `analytics-privatedata-users`, I'll send reviews to that effect [14:25:14] (03PS1) 10Kamila Součková: Revert "benthos: temporarily disable readiness probe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/944861 [14:27:28] (03PS2) 10Kamila Součková: Revert "benthos: temporarily disable readiness probe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/944861 [14:27:32] (03PS3) 10Fabfur: Release 1.1.0-3 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) [14:27:36] (03CR) 10CI reject: [V: 04-1] Revert "benthos: temporarily disable readiness probe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/944861 (owner: 10Kamila Součková) [14:28:25] (03PS1) 10Filippo Giunchedi: admin: add mpossoupe to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/945610 (https://phabricator.wikimedia.org/T343432) [14:28:57] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/945608 (https://phabricator.wikimedia.org/T332570) (owner: 10Stevemunene) [14:29:55] (03PS3) 10Kamila Součková: Revert "benthos: temporarily disable readiness probe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/944861 [14:30:43] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:52] (03CR) 10Kamila Součková: [C: 03+2] Revert "benthos: temporarily disable readiness probe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/944861 (owner: 10Kamila Součková) [14:31:35] (03Merged) 10jenkins-bot: Revert "benthos: temporarily disable readiness probe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/944861 (owner: 10Kamila Součková) [14:34:55] (03CR) 10Ssingh: "LGTM! One minor not." [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:35:09] (03CR) 10Ssingh: Release 1.1.0-3 (031 comment) [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:36:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/945610 (https://phabricator.wikimedia.org/T343432) (owner: 10Filippo Giunchedi) [14:38:36] (03CR) 10Muehlenhoff: admin: add maryana to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945546 (https://phabricator.wikimedia.org/T342797) (owner: 10Filippo Giunchedi) [14:41:19] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Superset for Mpossoupe - https://phabricator.wikimedia.org/T343432 (10Mpossoupe) Thanks @fgiunchedi [14:42:29] (03PS4) 10Fabfur: Release 1.1.0-3 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) [14:42:58] (03CR) 10Fabfur: Release 1.1.0-3 (031 comment) [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:43:34] (03CR) 10Ssingh: [C: 03+1] "LGTM! Nice work!" [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:45:50] (03CR) 10Fabfur: [C: 03+2] Release 1.1.0-3 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/945548 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [14:45:55] (03PS2) 10Hnowlan: rest-gateway: move knowledge-gap endpoint from api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/945607 (https://phabricator.wikimedia.org/T342213) [14:46:21] (03PS3) 10Hnowlan: rest-gateway: move knowledge-gap endpoint from api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/945607 (https://phabricator.wikimedia.org/T342213) [14:46:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:46:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host titan2001.codfw.wmnet with OS bookworm [14:46:52] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host titan2001.codfw.wmnet with OS bookworm completed: - titan2001 (**WARN**) - Removed... [14:47:20] (03CR) 10Muehlenhoff: installserver: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945552 (owner: 10Muehlenhoff) [14:47:22] (03CR) 10Muehlenhoff: [C: 03+2] installserver: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945552 (owner: 10Muehlenhoff) [14:49:16] (03PS1) 10Jelto: gitlab: enable ldap group sync on active GitLab server [puppet] - 10https://gerrit.wikimedia.org/r/945612 (https://phabricator.wikimedia.org/T319211) [14:50:46] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42773/console" [puppet] - 10https://gerrit.wikimedia.org/r/945612 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto) [14:51:38] (03PS2) 10Muehlenhoff: Kerberos: Pass firewall settings in tool-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/931633 [14:54:35] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [14:55:17] 10SRE, 10MW-on-K8s, 10serviceops: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Clement_Goubert) 05In progress→03Resolved The patch has been live for a few hours, and jsontruncated messages from mw-on-k8s are now on the same b... [14:55:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/931633 (owner: 10Muehlenhoff) [15:00:18] (03CR) 10Clément Goubert: [C: 03+2] Add kubernetes102[5,6] to its k8s_neighbors list [homer/public] - 10https://gerrit.wikimedia.org/r/945547 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [15:00:53] (03Merged) 10jenkins-bot: Add kubernetes102[5,6] to its k8s_neighbors list [homer/public] - 10https://gerrit.wikimedia.org/r/945547 (https://phabricator.wikimedia.org/T343306) (owner: 10Clément Goubert) [15:01:38] (03CR) 10Jforrester: [C: 03+1] wikifunctions: Add view rewrite rule for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/945600 (https://phabricator.wikimedia.org/T343440) (owner: 10Clément Goubert) [15:02:17] !log Run homer on lsw1-f3-eqiad for kubernetes102[5-6] imaging - T343306 [15:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:20] T343306: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 [15:02:51] (03PS4) 10Hnowlan: rest-gateway: move knowledge-gap endpoint from api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/945607 (https://phabricator.wikimedia.org/T342213) [15:03:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10Marostegui) @papaul RAID1 should be good [15:05:24] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1025.eqiad.wmnet with OS bullseye [15:06:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10MoritzMuehlenhoff) Can you install the server with Bookworm, please? The current test setup on a VM (lists1003) is also on Bookworm already. [15:07:08] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan2002'] [15:07:23] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:07:47] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['titan2002'] [15:07:56] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945013 [15:08:22] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945013 (owner: 10PipelineBot) [15:08:27] (03CR) 10Muehlenhoff: [C: 03+2] Kerberos: Pass firewall settings in tool-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/931633 (owner: 10Muehlenhoff) [15:08:35] (03CR) 10Clément Goubert: [C: 03+2] wikifunctions: Add view rewrite rule for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/945600 (https://phabricator.wikimedia.org/T343440) (owner: 10Clément Goubert) [15:09:02] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945013 (owner: 10PipelineBot) [15:09:45] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [15:09:49] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:10:05] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [15:10:22] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:10:29] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [15:11:00] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [15:11:06] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [15:11:36] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [15:11:52] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add mpossoupe to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/945610 (https://phabricator.wikimedia.org/T343432) (owner: 10Filippo Giunchedi) [15:12:21] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:13:11] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:13:12] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:13:50] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:13:51] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:13:58] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10bd808) >>! In T343039#9060092, @dr0ptp4kt wrote: > As far as commands, generally the ones listed in https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_... [15:15:25] (03PS2) 10Filippo Giunchedi: admin: add maryana to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/945546 (https://phabricator.wikimedia.org/T342797) [15:15:38] (03CR) 10Filippo Giunchedi: admin: add maryana to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945546 (https://phabricator.wikimedia.org/T342797) (owner: 10Filippo Giunchedi) [15:19:04] (03PS3) 10Muehlenhoff: aptrepo: Pass ports without Ferm-specific service identifiers [puppet] - 10https://gerrit.wikimedia.org/r/944211 [15:19:17] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:19:19] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:20:10] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:20:11] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [15:20:36] !log installing glibc security updates on bookworm [15:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:56] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 2 others: mw-on-k8s responds 404 for Wikifunctions view pages - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) ` cgoubert@deploy2002:~$ curl -s --insecure -v -H "Host: www.wikifunctions.org" https://mwdebug.discovery.wmnet:4444/vi... [15:21:21] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [15:21:22] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [15:22:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/944211 (owner: 10Muehlenhoff) [15:22:11] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [15:22:12] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [15:22:44] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [15:22:45] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:23:54] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:23:55] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [15:23:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [15:24:00] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [15:24:03] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [15:24:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:25:58] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Pass ports without Ferm-specific service identifiers [puppet] - 10https://gerrit.wikimedia.org/r/944211 (owner: 10Muehlenhoff) [15:27:53] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 2 others: mw-on-k8s responds 404 for Wikifunctions view pages - https://phabricator.wikimedia.org/T343440 (10Clement_Goubert) 05Open→03Resolved ` ❯ for i in {1..100}; do curl -s -v https://www.wikifunctions.org/view/en/Z10000 -o /dev/null 2>... [15:29:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:30:02] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['titan2002'] [15:30:45] 10SRE, 10Abstract Wikipedia team, 10MW-on-K8s, 10Traffic, and 2 others: mw-on-k8s responds 404 for Wikifunctions view pages - https://phabricator.wikimedia.org/T343440 (10Jdforrester-WMF) Thank you! [15:32:25] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10Papaul) [15:33:30] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10Papaul) @Marostegui thanks , @MoritzMuehlenhoff yes I can. [15:34:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10MoritzMuehlenhoff) [15:39:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['titan2002'] [15:40:16] !log imported `varnishkafka` package in bookworm-wikimedia (T342154) [15:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:19] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [15:40:49] 10SRE, 10Traffic: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Fabfur) [15:42:19] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Jhancock.wm) error is on fan2 now [15:42:42] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [15:43:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host titan2002.codfw.wmnet with OS bookworm [15:43:09] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host titan2002.codfw.wmnet with OS bookworm [15:47:58] !log installing pandoc security updates [15:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:39] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [16:02:21] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [16:02:46] (03PS1) 10Jforrester: Fix unsafe validator to not reach into undefined keys [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944862 (https://phabricator.wikimedia.org/T343393) [16:04:44] kamila_: \o/ benthos on k8s! [16:04:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on titan2002.codfw.wmnet with reason: host reimage [16:05:01] CC: godog: --^ [16:05:43] woot woot, good times! [16:06:19] godog: titan2001 is ready working on 2002 [16:06:36] papaul: amazing, thank you <3 [16:06:55] elukey: definitely getting there, but now I may have broken something '^^ [16:07:01] godog: np [16:07:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on titan2002.codfw.wmnet with reason: host reimage [16:08:04] kamila_: standard procedure with new k8s stuff, keep going! [16:11:42] I've got (another!) high-priority back-port for Wikifunctions. OK to go? [16:13:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944862 (https://phabricator.wikimedia.org/T343393) (owner: 10Jforrester) [16:13:14] Taking silence as assent. [16:13:32] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Rename kubernetes10[25-26] - cgoubert@cumin1001 - T343306" [16:13:35] T343306: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 [16:14:28] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Rename kubernetes10[25-26] - cgoubert@cumin1001 - T343306" [16:15:31] elukey: actually, I think it's working, I was just a bit too conservative in order to not fill up kafka with test stuff, so the output topic isn't getting a lot of messages, but it is getting some :D so yeah, it appears it works [16:15:43] seee! nice :) [16:15:51] _joe_, hnowlan: so, benthos works. what now? :D [16:15:59] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1025.eqiad.wmnet with reason: host reimage [16:16:56] (03Merged) 10jenkins-bot: Fix unsafe validator to not reach into undefined keys [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944862 (https://phabricator.wikimedia.org/T343393) (owner: 10Jforrester) [16:17:00] kamila_: Now make it do something useful *handwaves* [16:17:03] kamila_: completely replace restbase purge behaviour, shouldn't take very long [16:17:08] ;P [16:17:11] x) [16:17:15] yep :D [16:17:24] great news though! [16:17:26] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:944862|Fix unsafe validator to not reach into undefined keys (T343393)]] [16:17:29] T343393: Trying to publish a composition fails, throwing `Cannot read properties of undefined (reading 'Z9K1')` JS error - https://phabricator.wikimedia.org/T343393 [16:17:40] agreed with h.nowlan, good job :D [16:18:39] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:18:39] thanks <3 [16:18:54] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:19:02] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1025.eqiad.wmnet with reason: host reimage [16:19:03] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:944862|Fix unsafe validator to not reach into undefined keys (T343393)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [16:22:15] !log jforrester@deploy1002 jforrester: Continuing with sync [16:24:24] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10RobH) >>! In T343254#9063837, @Papaul wrote: > @robh you right it is not the fan that has the issue it is the board i will have @Jhancock.wm check if it is the daughter board or if it is the mainboa... [16:26:50] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:28:24] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:944862|Fix unsafe validator to not reach into undefined keys (T343393)]] (duration: 10m 57s) [16:28:28] T343393: Trying to publish a composition fails, throwing `Cannot read properties of undefined (reading 'Z9K1')` JS error - https://phabricator.wikimedia.org/T343393 [16:40:14] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cgoubert@cumin1001" [16:54:35] (03PS5) 10Hnowlan: rest-gateway: move knowledge-gap endpoint from api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/945607 (https://phabricator.wikimedia.org/T342213) [16:56:48] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cgoubert@cumin1001" [16:56:53] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1025.eqiad.wmnet with OS bullseye [16:59:16] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Jhancock.wm) @RobH Yes, let's order the fan. Thank you! [16:59:52] papaul: fyi my end-of-reimage netbox cookbook run commited your changes for titan2002 [17:00:06] bd808: My dear minions, it's time we take the moon! Just kidding. Time for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T1700). [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T1700) [17:00:31] * bd808 looks to see if anything needs to roll out today [17:00:34] (03CR) 10Herron: [C: 03+2] pyrra: add pyrra::(api|filesystem) modules [puppet] - 10https://gerrit.wikimedia.org/r/929719 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:02:17] (03PS3) 10BCornwall: init: Optimize puppet disabling on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/943620 (https://phabricator.wikimedia.org/T342182) [17:02:28] (03CR) 10BCornwall: init: Optimize puppet disabling on reboot (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/943620 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [17:03:13] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-07-31-112756-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/945618 [17:04:17] (03CR) 10Herron: [C: 03+2] profile::pyrra::api: create profile [puppet] - 10https://gerrit.wikimedia.org/r/929729 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:06:00] (03CR) 10Herron: [C: 03+2] profile::pyrra::filesystem: add profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929731 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:06:26] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-07-31-112756-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/945618 (owner: 10BryanDavis) [17:07:17] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-07-31-112756-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/945618 (owner: 10BryanDavis) [17:07:51] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab: enable ldap group sync on active GitLab server [puppet] - 10https://gerrit.wikimedia.org/r/945612 (https://phabricator.wikimedia.org/T319211) (owner: 10Jelto) [17:07:56] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10RobH) [17:09:28] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10RobH) >>! In T343254#9067280, @Jhancock.wm wrote: > @RobH Yes, let's order the fan. Thank you! Understood, filed T343477 for the order and requested a quote. Order progress can be tracked on that... [17:11:50] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:12:06] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:12:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10ssingh) >>! In T342159#9025176, @RobH wrote: > Please note parent task 341588 has the range of cp1[090-105] however, cp1090 is already live/in use. Additionally, we have 4 cp hosts fr... [17:14:46] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:17:06] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:17:14] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:17:37] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:18:00] 10SRE, 10ops-codfw, 10DBA: codfw: es2025 lost System Board Fan6 - https://phabricator.wikimedia.org/T343254 (10Marostegui) Thank you all so much <3 [17:19:54] (03CR) 10Herron: [C: 03+2] pyrra: deploy to thanos-fe hosts [puppet] - 10https://gerrit.wikimedia.org/r/929734 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:27:51] (03PS1) 10Herron: pyrra: fix typo in apache config [puppet] - 10https://gerrit.wikimedia.org/r/945620 (https://phabricator.wikimedia.org/T302995) [17:31:35] (03CR) 10Herron: [C: 03+2] pyrra: fix typo in apache config [puppet] - 10https://gerrit.wikimedia.org/r/945620 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:31:43] (03PS1) 10Ahmon Dancy: mediawiki::wanrouter_cache: add wikifunctions placeholder [puppet] - 10https://gerrit.wikimedia.org/r/945621 (https://phabricator.wikimedia.org/T297815) [17:32:08] (03CR) 10CI reject: [V: 04-1] mediawiki::wanrouter_cache: add wikifunctions placeholder [puppet] - 10https://gerrit.wikimedia.org/r/945621 (https://phabricator.wikimedia.org/T297815) (owner: 10Ahmon Dancy) [17:32:46] (03PS1) 10Ahmon Dancy: mediawiki::wanrouter_cache: add wikifunctions placeholder [puppet] - 10https://gerrit.wikimedia.org/r/945622 (https://phabricator.wikimedia.org/T297815) [17:33:11] (03CR) 10CI reject: [V: 04-1] mediawiki::wanrouter_cache: add wikifunctions placeholder [puppet] - 10https://gerrit.wikimedia.org/r/945622 (https://phabricator.wikimedia.org/T297815) (owner: 10Ahmon Dancy) [17:33:44] (03Abandoned) 10Ahmon Dancy: mediawiki::wanrouter_cache: add wikifunctions placeholder [puppet] - 10https://gerrit.wikimedia.org/r/945622 (https://phabricator.wikimedia.org/T297815) (owner: 10Ahmon Dancy) [17:34:05] (03PS1) 10Jforrester: Add 'wikilambda-edit-object-description' to granular authorization rules [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944864 (https://phabricator.wikimedia.org/T343400) [17:34:23] (03PS2) 10Ahmon Dancy: mediawiki::wanrouter_cache: add wikifunctions placeholder [puppet] - 10https://gerrit.wikimedia.org/r/945621 (https://phabricator.wikimedia.org/T297815) [17:34:46] (03CR) 10CI reject: [V: 04-1] mediawiki::wanrouter_cache: add wikifunctions placeholder [puppet] - 10https://gerrit.wikimedia.org/r/945621 (https://phabricator.wikimedia.org/T297815) (owner: 10Ahmon Dancy) [17:35:24] claime: yes please do thanks [17:35:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:35:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host titan2002.codfw.wmnet with OS bookworm [17:35:40] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host titan2002.codfw.wmnet with OS bookworm completed: - titan2002 (**WARN**) - Removed... [17:37:29] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: pyrra-filesystem.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:11] (03PS1) 10Herron: pyrra: remove apache rewrite config [puppet] - 10https://gerrit.wikimedia.org/r/945623 (https://phabricator.wikimedia.org/T302995) [17:38:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Jhancock.wm) [17:39:42] (03PS1) 10Jforrester: Wikifunctions: Allow logged-in users to edit object labels, aliases, and descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945624 (https://phabricator.wikimedia.org/T343400) [17:41:53] (03CR) 10Herron: [C: 03+2] pyrra: remove apache rewrite config [puppet] - 10https://gerrit.wikimedia.org/r/945623 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:45:53] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:49:13] (03PS1) 10Herron: pyrra-filesystem: update prometheus folder arg name [puppet] - 10https://gerrit.wikimedia.org/r/945625 (https://phabricator.wikimedia.org/T302995) [17:50:46] (03CR) 10Herron: [C: 03+2] pyrra-filesystem: update prometheus folder arg name [puppet] - 10https://gerrit.wikimedia.org/r/945625 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:55:21] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:18] (03PS1) 10Herron: pyrra-filesystem: ensure config directory [puppet] - 10https://gerrit.wikimedia.org/r/945627 (https://phabricator.wikimedia.org/T302995) [18:00:06] dancy and jnuche: That opportune time is upon us again. Time for a MediaWiki train - Utc-7+Utc-0 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T1800). [18:03:49] (03CR) 10Herron: [C: 03+2] pyrra-filesystem: ensure config directory [puppet] - 10https://gerrit.wikimedia.org/r/945627 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [18:09:31] Train is blocked at the moment [18:13:09] (03PS1) 10Herron: pyrra-filesystem: add package require for file resource [puppet] - 10https://gerrit.wikimedia.org/r/945630 (https://phabricator.wikimedia.org/T302995) [18:15:41] (03CR) 10Herron: [C: 03+2] pyrra-filesystem: add package require for file resource [puppet] - 10https://gerrit.wikimedia.org/r/945630 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [18:21:36] (03PS3) 10Ahmon Dancy: mediawiki::wanrouter_cache: add wikifunctions placeholder [puppet] - 10https://gerrit.wikimedia.org/r/945621 (https://phabricator.wikimedia.org/T297815) [18:22:31] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945621 (https://phabricator.wikimedia.org/T297815) (owner: 10Ahmon Dancy) [18:25:07] (03PS1) 10Ssingh: Release 0.9.1-1+wmf12u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/945633 (https://phabricator.wikimedia.org/T342154) [18:28:58] (03PS2) 10Ssingh: Release 0.9.1-1+wmf12u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/945633 (https://phabricator.wikimedia.org/T342154) [18:32:43] (03CR) 10Jforrester: [C: 03+1] "Oops." [puppet] - 10https://gerrit.wikimedia.org/r/945621 (https://phabricator.wikimedia.org/T297815) (owner: 10Ahmon Dancy) [18:33:49] (03PS3) 10Ssingh: Release 0.9.1-1+wmf12u1 [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/945633 (https://phabricator.wikimedia.org/T342154) [18:35:51] (03CR) 10Ssingh: "Ready for review." [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/945633 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [18:38:43] (03CR) 10Ssingh: "Note: This commit includes the previous builds as we never put them in git, which is why you will see the changelog from the previous two " [debs/python-anycast-healthchecker] - 10https://gerrit.wikimedia.org/r/945633 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [18:45:15] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10Papaul) [18:46:12] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10Papaul) 05Open→03Resolved @fgiunchedi this is complete [18:50:08] (03PS1) 10Jdlrobson: Fix mobile search text overlapping [extensions/MobileFrontend] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945646 (https://phabricator.wikimedia.org/T343397) [19:00:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [extensions/MobileFrontend] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945646 (https://phabricator.wikimedia.org/T343397) (owner: 10Jdlrobson) [19:02:43] (03PS1) 10Papaul: Add lists2001 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/945636 (https://phabricator.wikimedia.org/T342375) [19:04:09] (03CR) 10Papaul: [C: 03+2] Add lists2001 to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/945636 (https://phabricator.wikimedia.org/T342375) (owner: 10Papaul) [19:11:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lists2001.codfw.wmnet with OS bookworm [19:12:00] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [19:12:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lists2001.codfw.wmnet with OS bookworm [19:12:19] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [19:14:22] (03PS1) 10Ssingh: Release 3.99.0~alpha2-2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/945637 (https://phabricator.wikimedia.org/T342154) [19:16:03] (03Merged) 10jenkins-bot: Fix mobile search text overlapping [extensions/MobileFrontend] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945646 (https://phabricator.wikimedia.org/T343397) (owner: 10Jdlrobson) [19:16:29] (03CR) 10CI reject: [V: 04-1] Release 3.99.0~alpha2-2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/945637 (https://phabricator.wikimedia.org/T342154) (owner: 10Ssingh) [19:16:30] !log dancy@deploy1002 Started scap: Backport for [[gerrit:945646|Fix mobile search text overlapping (T343397)]] [19:16:33] T343397: Wikidata descriptions in Search result overlay can overlap - https://phabricator.wikimedia.org/T343397 [19:20:08] !log dancy@deploy1002 jdlrobson and dancy: Backport for [[gerrit:945646|Fix mobile search text overlapping (T343397)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [19:20:19] !log dancy@deploy1002 jdlrobson and dancy: Continuing with sync [19:22:32] (03PS2) 10Ssingh: Release 3.99.0~alpha2-2 [debs/gdnsd] - 10https://gerrit.wikimedia.org/r/945637 (https://phabricator.wikimedia.org/T342154) [19:23:01] (03PS1) 10Southparkfan: Cloud VPS: enable rsyslog subject name validation in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/945638 (https://phabricator.wikimedia.org/T127717) [19:25:36] (03PS2) 10Southparkfan: Cloud VPS: enable rsyslog subject name validation in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/945638 (https://phabricator.wikimedia.org/T127717) [19:26:04] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:945646|Fix mobile search text overlapping (T343397)]] (duration: 09m 33s) [19:26:09] T343397: Wikidata descriptions in Search result overlay can overlap - https://phabricator.wikimedia.org/T343397 [19:28:09] (03PS1) 10Jforrester: Add restriction and warning message in Function Evaluator widget for logged out users [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945647 (https://phabricator.wikimedia.org/T343402) [19:28:28] (03PS1) 10Jforrester: Don't clear the About edit fields when we pick a new language [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945648 (https://phabricator.wikimedia.org/T343380) [19:30:14] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [19:31:14] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [19:33:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:35:45] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lists2001.codfw.wmnet with reason: host reimage [19:38:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists2001.codfw.wmnet with reason: host reimage [19:43:59] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945639 (https://phabricator.wikimedia.org/T340248) [19:44:01] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945639 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [19:44:41] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/945639 (https://phabricator.wikimedia.org/T340248) (owner: 10TrainBranchBot) [19:45:45] (03CR) 10Andrew Bogott: [C: 03+2] Cloud VPS: enable rsyslog subject name validation in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/945638 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [19:53:01] !log dancy@deploy1002 rebuilt and synchronized wikiversions files group2 wikis to 1.41.0-wmf.20 refs T340248 [19:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:04] T340248: 1.41.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T340248 [19:53:08] (03PS3) 10Anzx: pawikisource: add audiobook namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944863 (https://phabricator.wikimedia.org/T343410) [19:54:41] (03PS1) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) [19:54:58] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [19:56:14] (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [20:00:05] brennen and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230803T2000). [20:00:05] Dreamy_Jazz and aanzx: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] \o [20:00:41] o/ [20:03:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:03:32] I can deploy o/ [20:03:41] :D [20:04:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [20:04:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lists2001.codfw.wmnet with OS bookworm [20:04:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lists2001.codfw.wmnet with OS bookworm completed: - lists2001 (**WARN**)... [20:04:35] thcipriani: Once you're done please shout. :-) [20:05:19] James_F: will do [20:05:23] <3 [20:05:50] (03PS3) 10Thcipriani: Write new on group1 except wikidatawiki for event table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944350 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [20:06:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944350 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [20:07:28] (03Merged) 10jenkins-bot: Write new on group1 except wikidatawiki for event table migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944350 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [20:07:43] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:944350|Write new on group1 except wikidatawiki for event table migration (T330158)]] [20:07:46] T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158 [20:09:12] !log thcipriani@deploy1002 dreamyjazz and thcipriani: Backport for [[gerrit:944350|Write new on group1 except wikidatawiki for event table migration (T330158)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:09:38] Dreamy_Jazz: ^ your change is on mwdebug, check please :) [20:09:45] Thanks. Testing now. [20:10:19] Will, if possible, need you to check the DB on hewiki and wikidatawiki to verify the tests were fine. [20:10:35] k, what am I checking for? [20:10:51] That rows exist in the tables "cu_private_event" and "cu_log_event". [20:11:15] I will say once that check can be done. [20:12:00] okie doke [20:15:02] Okay. Not able to perform a move on hewiki as I need autoconfirmed. [20:15:23] So, instead please check that rows exist in "cu_private_event" [20:15:38] I can confirm there is an event in cu_private_event on hewiki and there wasn't one previously [20:15:46] Thanks [20:15:47] or a row, rather :) [20:15:51] :D [20:16:00] anything else to check? Good to go? [20:16:02] Will just check no writing to wikidata [20:16:10] k [20:17:04] Please check that there are no rows in "cu_private_event" on wikidata [20:17:19] confirmed: Empty set [20:17:23] Good. Test complete. [20:17:38] cool, thanks for checking, going live now. [20:17:43] !log thcipriani@deploy1002 dreamyjazz and thcipriani: Continuing with sync [20:20:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:38] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:944350|Write new on group1 except wikidatawiki for event table migration (T330158)]] (duration: 15m 54s) [20:23:41] T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158 [20:23:50] ^ Dreamy_Jazz should be live now [20:23:58] Thanks [20:24:30] aanzx: you're up next [20:24:40] Ok [20:25:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944863 (https://phabricator.wikimedia.org/T343410) (owner: 10Anzx) [20:25:53] (03Merged) 10jenkins-bot: pawikisource: add audiobook namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944863 (https://phabricator.wikimedia.org/T343410) (owner: 10Anzx) [20:26:10] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:944863|pawikisource: add audiobook namespace alias (T343410)]] [20:26:13] T343410: Requesting a new Namespace on pawikisource - Audiobook - https://phabricator.wikimedia.org/T343410 [20:27:45] !log thcipriani@deploy1002 anzx and thcipriani: Backport for [[gerrit:944863|pawikisource: add audiobook namespace alias (T343410)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:27:57] Testing [20:28:07] <3 [20:30:08] thcipriani: works fine , good to sync [20:30:22] aanzx: perfect, thanks for checking going live [20:30:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:56] !log thcipriani@deploy1002 anzx and thcipriani: Continuing with sync [20:31:13] (03CR) 10Krinkle: rsyslog: ingest 'excimer' logs from webperf to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [20:36:49] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:944863|pawikisource: add audiobook namespace alias (T343410)]] (duration: 10m 39s) [20:36:52] T343410: Requesting a new Namespace on pawikisource - Audiobook - https://phabricator.wikimedia.org/T343410 [20:36:57] thcipriani: can you run namespaceDupes.php on pawikisource [20:37:06] yep, sure [20:39:09] aanzx: without --fix: 0 pages to fix, 0 were resolvable. [20:39:17] Looks good! [20:39:32] Thanks 👍 [20:39:38] thanks for the patch [20:39:57] !log end UTC late backport [20:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:08] James_F: taavi I'm clear ^ [20:41:50] thanks! [20:44:26] (03PS2) 10Jforrester: Wikifunctions: Add TODO task numbers where appropriate [deployment-charts] - 10https://gerrit.wikimedia.org/r/944992 [20:46:31] (03CR) 10Jforrester: [C: 03+2] Add 'wikilambda-edit-object-description' to granular authorization rules [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944864 (https://phabricator.wikimedia.org/T343400) (owner: 10Jforrester) [20:46:37] (03CR) 10Jforrester: [C: 03+2] Add restriction and warning message in Function Evaluator widget for logged out users [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945647 (https://phabricator.wikimedia.org/T343402) (owner: 10Jforrester) [20:46:43] (03CR) 10Jforrester: [C: 03+2] Don't clear the About edit fields when we pick a new language [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945648 (https://phabricator.wikimedia.org/T343380) (owner: 10Jforrester) [20:47:21] (03PS1) 10Jforrester: Wikifunctions: Use image that cascades validation state [deployment-charts] - 10https://gerrit.wikimedia.org/r/945668 [20:47:31] (03CR) 10Jforrester: [C: 03+2] "This is a no-op doc change, to land alongside my next." [deployment-charts] - 10https://gerrit.wikimedia.org/r/944992 (owner: 10Jforrester) [20:47:40] (03CR) 10Jforrester: [C: 03+2] Wikifunctions: Use image that cascades validation state [deployment-charts] - 10https://gerrit.wikimedia.org/r/945668 (owner: 10Jforrester) [20:48:31] (03Merged) 10jenkins-bot: Wikifunctions: Add TODO task numbers where appropriate [deployment-charts] - 10https://gerrit.wikimedia.org/r/944992 (owner: 10Jforrester) [20:48:33] (03Merged) 10jenkins-bot: Wikifunctions: Use image that cascades validation state [deployment-charts] - 10https://gerrit.wikimedia.org/r/945668 (owner: 10Jforrester) [20:49:35] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:49:38] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:49:46] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:49:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:48] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:50:03] (03Merged) 10jenkins-bot: Add 'wikilambda-edit-object-description' to granular authorization rules [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/944864 (https://phabricator.wikimedia.org/T343400) (owner: 10Jforrester) [20:50:20] (03Merged) 10jenkins-bot: Add restriction and warning message in Function Evaluator widget for logged out users [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945647 (https://phabricator.wikimedia.org/T343402) (owner: 10Jforrester) [20:50:22] (03Merged) 10jenkins-bot: Don't clear the About edit fields when we pick a new language [extensions/WikiLambda] (wmf/1.41.0-wmf.20) - 10https://gerrit.wikimedia.org/r/945648 (https://phabricator.wikimedia.org/T343380) (owner: 10Jforrester) [20:50:29] * James_F waits for chartmuseum [20:51:56] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:52:25] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:54:04] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [20:55:16] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [20:55:20] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [20:56:27] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [20:57:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10Papaul) [20:59:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10Papaul) 05Open→03Resolved @Marostegui complete [20:59:43] !log jforrester@deploy1002 Synchronized php-1.41.0-wmf.20/extensions/WikiLambda/: T343402 and T343380 (duration: 07m 50s) [20:59:47] T343380: Label editor throws away input when selecting language - https://phabricator.wikimedia.org/T343380 [20:59:48] T343402: Not logged in users should get a warning that they cannot run functions - https://phabricator.wikimedia.org/T343402 [21:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/943620 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [21:00:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:00:37] (03CR) 10BCornwall: [C: 03+2] init: Optimize puppet disabling on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/943620 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [21:05:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:06:38] (I'm clear from prod.) [21:07:20] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10thcipriani) Noticed today that display names changed to using `cn` instead of `uid` (discussed back in {T288392}): {F37163889 siz... [21:10:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists2001.codfw.wmnet - https://phabricator.wikimedia.org/T342375 (10Ladsgroup) Thanks @Papaul! [21:22:00] (03CR) 10CI reject: [V: 04-1] webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [21:22:25] (03PS5) 10Krinkle: xhgui: remove 'xhgui' module, role and profile [puppet] - 10https://gerrit.wikimedia.org/r/935522 (https://phabricator.wikimedia.org/T342724) [21:22:51] (03PS7) 10Krinkle: webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 [21:23:58] (03CR) 10jenkins-bot: webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [21:26:29] (03PS8) 10Krinkle: webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 [21:26:41] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935522 (https://phabricator.wikimedia.org/T342724) (owner: 10Krinkle) [21:26:57] (03CR) 10CI reject: [V: 04-1] webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [21:27:24] (03CR) 10Muehlenhoff: xhgui: remove 'xhgui' module, role and profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935522 (https://phabricator.wikimedia.org/T342724) (owner: 10Krinkle) [21:29:02] (03CR) 10Krinkle: xhgui: remove 'xhgui' module, role and profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/935522 (https://phabricator.wikimedia.org/T342724) (owner: 10Krinkle) [21:29:09] (03PS6) 10Krinkle: xhgui: remove 'xhgui' module, role and profile [puppet] - 10https://gerrit.wikimedia.org/r/935522 (https://phabricator.wikimedia.org/T342724) [21:30:01] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10bd808) >>! In T320390#9067911, @thcipriani wrote: > Noticed today that display names changed to using `cn` instead of `uid` (discu... [21:31:47] moritzm: could you take a look at https://integration.wikimedia.org/ci/job/operations-puppet-tests-buster-docker/68913/console ? I'm not getting why it fails. [21:32:08] Among the various unrelated bash errors and general Jenkins wrappers, the main thing that stands out is the Python KeyError [21:32:16] but I don't get how that relates to my patch :) [21:35:08] (03PS9) 10Krinkle: webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 [21:35:16] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [21:36:44] Krinkle: that indeed looks like unrelated breakage, I don't see how this could be related to your patch [21:36:54] I'll have a closer look tomorrow [21:36:57] moritzm: it nly happens on that patch it seems, not on the parent patch [21:37:03] so I guess I must be doing something to trigger it [21:37:19] okay, no problem! [21:38:16] the latest revision worked, though? [21:38:50] not sure, some transient breakage maybe, at least I haven't seen that specific error so far [21:42:23] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!! 😊" [puppet] - 10https://gerrit.wikimedia.org/r/935522 (https://phabricator.wikimedia.org/T342724) (owner: 10Krinkle) [22:16:42] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:18:38] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: setup switch port and DNS for db2188-db2195 - pt1979@cumin2002" [22:19:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: setup switch port and DNS for db2188-db2195 - pt1979@cumin2002" [22:19:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:22:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2188.mgmt.codfw.wmnet with reboot policy FORCED [22:22:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2189.mgmt.codfw.wmnet with reboot policy FORCED [22:32:08] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10brennen) To summarize discussion from Slack and libera.chat `#wikimedia-gitlab`: - Apart from T343485, we don't believe this has... [22:35:10] (03PS10) 10Krinkle: webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 [22:35:18] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [22:37:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) [22:37:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) [22:38:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2190.mgmt.codfw.wmnet with reboot policy FORCED [22:39:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2191.mgmt.codfw.wmnet with reboot policy FORCED [22:49:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2189.mgmt.codfw.wmnet with reboot policy FORCED [22:50:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2188.mgmt.codfw.wmnet with reboot policy FORCED [23:04:50] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2188'] [23:05:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2189'] [23:08:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) [23:15:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2189'] [23:15:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2188'] [23:18:57] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2188'] [23:19:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2188'] [23:19:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2189'] [23:19:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2189'] [23:21:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2191.mgmt.codfw.wmnet with reboot policy FORCED [23:22:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2190.mgmt.codfw.wmnet with reboot policy FORCED [23:26:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2188.codfw.wmnet with OS bullseye [23:26:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2188.codfw.wmnet with OS bullseye [23:27:03] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) [23:27:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2190'] [23:27:56] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2191'] [23:36:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2189.codfw.wmnet with OS bullseye [23:36:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2189.codfw.wmnet with OS bullseye [23:39:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2191'] [23:39:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db2190'] [23:39:52] (03PS1) 10Jforrester: Wikifunctions: Use orchestrator image that double-checks validation state too [deployment-charts] - 10https://gerrit.wikimedia.org/r/945684 [23:41:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2190.codfw.wmnet with OS bullseye [23:41:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2190.codfw.wmnet with OS bullseye [23:41:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10Papaul) [23:46:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2191.codfw.wmnet with OS bullseye [23:46:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db21[88-95] - https://phabricator.wikimedia.org/T342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2191.codfw.wmnet with OS bullseye [23:47:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2188.codfw.wmnet with reason: host reimage [23:50:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2188.codfw.wmnet with reason: host reimage [23:56:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2189.codfw.wmnet with reason: host reimage [23:59:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2189.codfw.wmnet with reason: host reimage