[00:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T0000). Please do the needful. [00:04:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:44] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:49:10] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10jmixter) sorry here is the information: * Wikitech full username: Jeff Mixter * Wikitech username: Jeff Mixter * Wikitech shell username: jmixter * Email address: jmixter-ctr@wikimedia.or... [00:58:44] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [01:12:34] (03PS12) 10Ottomata: Airflow puppetization + airflow@analytics on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) [01:16:18] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:10:18] 10SRE, 10Okapi [Wikimedia Enterprise], 10Traffic: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra - https://phabricator.wikimedia.org/T281428 (10RBrounley_WMF) a:03Eugene.chernov [04:18:06] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10Papaul) @RobH I have only one question for now. what is or will be your approach on keeping the TFTP server up to date with the latest firmware. [04:18:46] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:51:11] (03PS1) 10Muehlenhoff: Extend access for iflorez [puppet] - 10https://gerrit.wikimedia.org/r/695818 [04:54:42] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for iflorez [puppet] - 10https://gerrit.wikimedia.org/r/695818 (owner: 10Muehlenhoff) [05:13:29] 10SRE: Icinga alerts mention the wrong data center - https://phabricator.wikimedia.org/T283762 (10Marostegui) p:05Triage→03High [05:15:55] (03PS1) 10Marostegui: Revert "db1148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/695827 [05:19:28] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:23:15] (03CR) 10Marostegui: "Thanks a lot daniel! :)" [puppet] - 10https://gerrit.wikimedia.org/r/695371 (https://phabricator.wikimedia.org/T283486) (owner: 10Marostegui) [05:23:23] (03PS3) 10Marostegui: data.yaml: Add Kay Wong to analytics-privatedata-users (with kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/695371 (https://phabricator.wikimedia.org/T283486) [05:23:56] (03CR) 10Marostegui: [C: 03+2] Revert "db1148: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/695827 (owner: 10Marostegui) [05:24:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 25%: Repool db1148', diff saved to https://phabricator.wikimedia.org/P16223 and previous config saved to /var/cache/conftool/dbconfig/20210527-052442-root.json [05:24:45] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add Kay Wong to analytics-privatedata-users (with kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/695371 (https://phabricator.wikimedia.org/T283486) (owner: 10Marostegui) [05:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:58] PROBLEM - puppet last run on cloudelastic1003 is CRITICAL: CRITICAL: Puppet has been disabled for 604871 seconds, message: cloudelastic reboot - ryankemper@cumin1001 - T283223, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:28:45] (03PS1) 10Giuseppe Lavagetto: service::catalog: stop using %{::site} in interpolations [puppet] - 10https://gerrit.wikimedia.org/r/695870 (https://phabricator.wikimedia.org/T283762) [05:29:22] !log `ryankemper@cloudelastic1003:~$ sudo run-puppet-agent --force` [05:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10Marostegui) [05:30:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10Marostegui) 05Open→03Resolved Patch merged. User added to nda ldap group Kerberos principal user created and emailed ayyenwong@gmail.co... [05:31:42] RECOVERY - puppet last run on cloudelastic1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:39:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 50%: Repool db1148', diff saved to https://phabricator.wikimedia.org/P16224 and previous config saved to /var/cache/conftool/dbconfig/20210527-053946-root.json [05:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:56] (03PS1) 10Marostegui: data.yaml: Add Bumeh-ctr to analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/695872 (https://phabricator.wikimedia.org/T283648) [05:41:04] (03CR) 10Marostegui: [C: 04-2] "Waiting for ssh key verification" [puppet] - 10https://gerrit.wikimedia.org/r/695872 (https://phabricator.wikimedia.org/T283648) (owner: 10Marostegui) [05:41:30] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) [05:41:49] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) Patch uploaded, this is only waiting for the ssh key verification. [05:44:02] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10Marostegui) a:03Marostegui [05:44:32] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10Marostegui) @CBogen we'd need the expiry date of the contract. We'd also need to verify the ssh key, this could be done via video call with you or with @jmixter directly pasting his ssh ke... [05:47:32] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) [05:53:33] (03PS1) 10Marostegui: db1147: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/695884 [05:54:27] (03CR) 10Marostegui: [C: 03+2] db1147: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/695884 (owner: 10Marostegui) [05:54:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 75%: Repool db1148', diff saved to https://phabricator.wikimedia.org/P16225 and previous config saved to /var/cache/conftool/dbconfig/20210527-055450-root.json [05:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1147', diff saved to https://phabricator.wikimedia.org/P16226 and previous config saved to /var/cache/conftool/dbconfig/20210527-055507-marostegui.json [05:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:47] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) Thanks to Keith's work we now have two 5-nodes clusters! \o/ The last step before... [06:06:24] (03PS2) 10Giuseppe Lavagetto: service::catalog: stop using %{::site} in interpolations [puppet] - 10https://gerrit.wikimedia.org/r/695870 (https://phabricator.wikimedia.org/T283762) [06:07:14] (03PS3) 10Jcrespo: Revert "Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only""" [puppet] - 10https://gerrit.wikimedia.org/r/695030 [06:08:37] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29716/console" [puppet] - 10https://gerrit.wikimedia.org/r/695870 (https://phabricator.wikimedia.org/T283762) (owner: 10Giuseppe Lavagetto) [06:08:49] (03PS2) 10Marostegui: data.yaml: Add Bumeh-ctr to analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/695872 (https://phabricator.wikimedia.org/T283648) [06:09:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 100%: Repool db1148', diff saved to https://phabricator.wikimedia.org/P16227 and previous config saved to /var/cache/conftool/dbconfig/20210527-060953-root.json [06:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:37] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only""" [puppet] - 10https://gerrit.wikimedia.org/r/695030 (owner: 10Jcrespo) [06:22:01] (03PS3) 10Giuseppe Lavagetto: service::catalog: fix the use of %{::site} in interpolations [puppet] - 10https://gerrit.wikimedia.org/r/695870 (https://phabricator.wikimedia.org/T283762) [06:22:37] (03PS1) 10Jcrespo: Revert "Revert "Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only"""" [puppet] - 10https://gerrit.wikimedia.org/r/695830 [06:22:44] (03CR) 10jerkins-bot: [V: 04-1] service::catalog: fix the use of %{::site} in interpolations [puppet] - 10https://gerrit.wikimedia.org/r/695870 (https://phabricator.wikimedia.org/T283762) (owner: 10Giuseppe Lavagetto) [06:23:35] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29717/console" [puppet] - 10https://gerrit.wikimedia.org/r/695870 (https://phabricator.wikimedia.org/T283762) (owner: 10Giuseppe Lavagetto) [06:27:43] (03PS4) 10Giuseppe Lavagetto: service::catalog: fix the use of %{::site} in interpolations [puppet] - 10https://gerrit.wikimedia.org/r/695870 (https://phabricator.wikimedia.org/T283762) [06:30:07] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29718/console" [puppet] - 10https://gerrit.wikimedia.org/r/695870 (https://phabricator.wikimedia.org/T283762) (owner: 10Giuseppe Lavagetto) [06:32:33] (03PS2) 10Jcrespo: Revert "Revert "Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only"""" [puppet] - 10https://gerrit.wikimedia.org/r/695830 [06:33:02] (03CR) 10Jcrespo: [C: 04-2] "Waiting for backup to run, should finish in ~2 days." [puppet] - 10https://gerrit.wikimedia.org/r/695830 (owner: 10Jcrespo) [06:36:09] (03CR) 10Ayounsi: [C: 03+1] service::catalog: fix the use of %{::site} in interpolations [puppet] - 10https://gerrit.wikimedia.org/r/695870 (https://phabricator.wikimedia.org/T283762) (owner: 10Giuseppe Lavagetto) [06:38:58] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] service::catalog: fix the use of %{::site} in interpolations [puppet] - 10https://gerrit.wikimedia.org/r/695870 (https://phabricator.wikimedia.org/T283762) (owner: 10Giuseppe Lavagetto) [06:40:25] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) 05Open→03Resolved RAID back to optimal ` root@db2107:~# megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level... [06:40:53] jouncebot: now [06:40:53] No deployments scheduled for the next 3 hour(s) and 19 minute(s) [06:40:56] jouncebot: next [06:40:57] In 3 hour(s) and 19 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T1000) [06:40:59] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (100xkaywong) Thanks @Marostegui! [06:41:10] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695932 [06:41:12] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695932 (owner: 10Urbanecm) [06:42:02] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695932 (owner: 10Urbanecm) [06:43:17] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 13s) [06:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:22] (03PS1) 10Elukey: hadoop: tune some HDFS Namenode GC settings [puppet] - 10https://gerrit.wikimedia.org/r/695933 (https://phabricator.wikimedia.org/T283733) [06:43:33] (03PS4) 10Muehlenhoff: ssh: Remove deprecated option UsePrivilegeSeparation sandbox [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [06:44:34] (03CR) 10Muehlenhoff: "With all jessie hosts gone, this is good to go now, I'll take care of the rollout." [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [06:47:29] !log ryankemper@puppetmaster2001 conftool action : set/pooled=no; selector: name=wdqs1003.eqiad.wmnet [06:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:54] (03PS1) 10Muehlenhoff: Remove old compat code [puppet] - 10https://gerrit.wikimedia.org/r/695940 [06:50:06] 10SRE: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10jcrespo) 05Resolved→03Open I'd like to reopen this, and enable "Allow Mixed Priority" to allow jobs with different priorities to run at the same time. If I have running backups (which is true 90% of the time now)... [06:52:09] (03CR) 10Joal: [C: 03+1] "Thanks luca :)" [puppet] - 10https://gerrit.wikimedia.org/r/695933 (https://phabricator.wikimedia.org/T283733) (owner: 10Elukey) [06:53:05] 10SRE, 10Patch-For-Review: Icinga alerts mention the wrong data center - https://phabricator.wikimedia.org/T283762 (10Joe) 05Open→03Resolved a:03Joe [06:53:50] (03CR) 10Muehlenhoff: ssh: Remove deprecated option UsePrivilegeSeparation sandbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [06:55:27] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:59:56] (03CR) 10Elukey: [C: 03+2] hadoop: tune some HDFS Namenode GC settings [puppet] - 10https://gerrit.wikimedia.org/r/695933 (https://phabricator.wikimedia.org/T283733) (owner: 10Elukey) [07:07:44] checking backups [07:09:07] (03PS5) 10Muehlenhoff: ssh: Remove deprecated option UsePrivilegeSeparation sandbox [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [07:11:49] !log cmooney@cumin1001 Gerrit 694305: Add Wikidough Anycast range to aggregate config to cr2-codfw [07:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:05] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:39] !log cmooney@cumin1001 Gerrit 694305: Add Wikidough Anycast range to aggregate config to cr1-eqdfw [07:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:49] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (backup2002), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:17:56] (03PS5) 10Jcrespo: mediabackup: Install minio on the storage hosts and open port 9000 [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) [07:17:58] (03PS1) 10Jcrespo: backup: Ignore one-time read only backups staleness while they are enabled [puppet] - 10https://gerrit.wikimedia.org/r/695967 (https://phabricator.wikimedia.org/T282249) [07:18:11] (03PS2) 10Jcrespo: backup: Ignore one-time read only backups staleness while they are enabled [puppet] - 10https://gerrit.wikimedia.org/r/695967 (https://phabricator.wikimedia.org/T282249) [07:19:58] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: route Traffic team alerts [puppet] - 10https://gerrit.wikimedia.org/r/695367 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [07:20:04] !log cmooney@cumin1001 Gerrit 694305: Run homer to announce Wikidough Anycast range from cr's in ulsfo [07:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:17] (03CR) 10Jcrespo: [C: 03+2] backup: Ignore one-time read only backups staleness while they are enabled [puppet] - 10https://gerrit.wikimedia.org/r/695967 (https://phabricator.wikimedia.org/T282249) (owner: 10Jcrespo) [07:25:03] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10ayounsi) As you said it would be a good idea to see how it fits in the big automation picture. First by detailing precisely the current workflows, identifying the pain poin... [07:25:07] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:28:07] (03PS3) 10Jcrespo: Revert "Revert "Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only"""" [puppet] - 10https://gerrit.wikimedia.org/r/695830 [07:33:32] (03CR) 10Muehlenhoff: "> Patch Set 4:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [07:42:11] !log cmooney@cumin1001 Gerrit 694305: Run homer to add Wikidough prefix aggregate config on cr2-eqord [07:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:27] !log cmooney@cumin1001 Gerrit 694305: Run homer to add Wikidough prefix aggregate config on cr's in eqsin [07:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:11] !log adding stephane at kiwix as owner of offline-l per email [07:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:23] !log cmooney@cumin1001 Gerrit 694305: Run homer to add Wikidough prefix aggregate config on cr's in AMS [07:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:27] (03PS1) 10WMDE-Fisch: Don't update backButton visibility if not set [extensions/VisualEditor] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695831 (https://phabricator.wikimedia.org/T283511) [07:49:55] (03PS1) 10WMDE-Fisch: Don't update backButton visibility if not set [extensions/VisualEditor] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695832 (https://phabricator.wikimedia.org/T283511) [07:51:46] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T283796 (10MGBnetwork) [07:52:17] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T283796 (10MGBnetwork) a:05akosiaris→03None [07:54:36] (03CR) 10Jcrespo: "> Patch Set 5:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [08:01:12] (03CR) 10Jcrespo: "> Ok. I will try." [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [08:06:48] 10SRE, 10ops-codfw, 10DBA, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) Everything is done from either dbs and backup hosts side of things. Removing DBA tag [08:06:56] 10SRE, 10ops-codfw, 10serviceops: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Marostegui) [08:17:40] (03CR) 10Muehlenhoff: mediabackup: Install minio on the storage hosts and open port 9000 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [08:23:49] (03CR) 10Hashar: [C: 03+1] "Can be done anytime, Puppet will not restart the service, they are explicitly unmanaged :]" [puppet] - 10https://gerrit.wikimedia.org/r/695940 (owner: 10Muehlenhoff) [08:24:13] (03Abandoned) 10David Caro: prometheus: Override retention also when specifying retention by size [puppet] - 10https://gerrit.wikimedia.org/r/695194 (owner: 10David Caro) [08:24:15] (03PS1) 10Jbond: site.pp: add snapshot101[45] to insetup [puppet] - 10https://gerrit.wikimedia.org/r/695985 (https://phabricator.wikimedia.org/T283545) [08:25:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29719/console" [puppet] - 10https://gerrit.wikimedia.org/r/695985 (https://phabricator.wikimedia.org/T283545) (owner: 10Jbond) [08:26:15] (03CR) 10Jbond: [V: 03+1 C: 03+2] site.pp: add snapshot101[45] to insetup [puppet] - 10https://gerrit.wikimedia.org/r/695985 (https://phabricator.wikimedia.org/T283545) (owner: 10Jbond) [08:27:00] (03CR) 10ArielGlenn: [C: 03+1] "Hey, that works for me, as long as it doesn't cause problems for dc ops." [puppet] - 10https://gerrit.wikimedia.org/r/695985 (https://phabricator.wikimedia.org/T283545) (owner: 10Jbond) [08:27:18] rats [08:27:29] you merged before my +1 got in. oh well the intent was there! [08:28:33] 10SRE, 10Dumps-Generation, 10Patch-For-Review: snapshot101[45] have no role, break puppet run - https://phabricator.wikimedia.org/T283545 (10jbond) i have re added role::insetup to theses hosts to stop the email spam [08:30:04] !log installing libx11 security updates [08:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:02] !log removing stale peers (AS8674 / Netnod and AS57695 / Misaka) from cr2-esams [08:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:31] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/695985 (https://phabricator.wikimedia.org/T283545) (owner: 10Jbond) [08:35:43] (03CR) 10David Caro: "I don't think I know enough to approve the patch, said that, a minor puppet code comment and LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695447 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [08:36:04] apergos: sorry was a prety harmless PS so i just self merged [08:39:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:40:11] jbond: it was! I'm not complaining about you, just my poor timing :-) [08:40:46] (03PS2) 10David Caro: ceph: send logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/695329 [08:40:49] (03CR) 10David Caro: ceph: send logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695329 (owner: 10David Caro) [08:41:05] :) ack [08:41:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:42] (03PS1) 10Kosta Harlan: Help panel: SwitchEditorPanel fixes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695833 (https://phabricator.wikimedia.org/T282800) [08:49:52] (03PS1) 10Kosta Harlan: Help panel: SwitchEditorPanel fixes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695834 (https://phabricator.wikimedia.org/T282800) [08:53:44] (03PS1) 10Filippo Giunchedi: swift: group-writable log directory [puppet] - 10https://gerrit.wikimedia.org/r/696013 [08:53:46] (03PS1) 10Filippo Giunchedi: pontoon: add hiera settings for swift [puppet] - 10https://gerrit.wikimedia.org/r/696014 [08:55:04] seeking folks for those quick reviews above ^ [08:55:36] (03PS1) 10Jbond: idp: add gitlab to production idp [puppet] - 10https://gerrit.wikimedia.org/r/696015 (https://phabricator.wikimedia.org/T279545) [08:56:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29720/console" [puppet] - 10https://gerrit.wikimedia.org/r/696015 (https://phabricator.wikimedia.org/T279545) (owner: 10Jbond) [08:57:54] (03CR) 10Ema: [C: 03+2] alertmanager: route Traffic team alerts [puppet] - 10https://gerrit.wikimedia.org/r/695367 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [08:58:08] (03CR) 10Jbond: [V: 03+1 C: 03+2] idp: add gitlab to production idp [puppet] - 10https://gerrit.wikimedia.org/r/696015 (https://phabricator.wikimedia.org/T279545) (owner: 10Jbond) [09:03:05] 10SRE, 10Gerrit, 10LDAP-Access-Requests: Add dancy to `archiva-deployers` LDAP group - https://phabricator.wikimedia.org/T283347 (10hashar) p:05Triage→03Medium [09:06:10] (03CR) 10Kormat: [C: 03+1] swift: group-writable log directory [puppet] - 10https://gerrit.wikimedia.org/r/696013 (owner: 10Filippo Giunchedi) [09:06:23] (03CR) 10Kormat: [C: 03+1] pontoon: add hiera settings for swift [puppet] - 10https://gerrit.wikimedia.org/r/696014 (owner: 10Filippo Giunchedi) [09:06:39] godog: you should have called them "swift reviews" ;) done [09:07:00] kormat: hahaha indeed! thank you very much [09:07:27] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: group-writable log directory [puppet] - 10https://gerrit.wikimedia.org/r/696013 (owner: 10Filippo Giunchedi) [09:07:33] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add hiera settings for swift [puppet] - 10https://gerrit.wikimedia.org/r/696014 (owner: 10Filippo Giunchedi) [09:08:50] (03PS1) 10Jbond: idp: Also allow NDA group to access gitlab [puppet] - 10https://gerrit.wikimedia.org/r/696017 [09:08:54] (03CR) 10Klausman: [C: 03+1] Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [09:09:02] (03CR) 10Klausman: [C: 03+1] Add knative serving and net-istio images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) (owner: 10Elukey) [09:09:23] (03CR) 10Jbond: [C: 03+2] idp: Also allow NDA group to access gitlab [puppet] - 10https://gerrit.wikimedia.org/r/696017 (owner: 10Jbond) [09:09:40] (03CR) 10Klausman: [C: 03+1] Add base kubeflow kfserving images and kube-rbac-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:10:05] (03CR) 10Klausman: [C: 03+1] Add Jetstack's cert-manager base go images. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T280661) (owner: 10Elukey) [09:12:04] (03CR) 10Klausman: [C: 03+1] Add new golang 1.15 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [09:15:11] (03PS1) 10Jbond: P:gitlab: open SSH port to the world [puppet] - 10https://gerrit.wikimedia.org/r/696024 (https://phabricator.wikimedia.org/T276144) [09:19:21] (03CR) 10David Caro: [C: 03+2] ceph: send logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695329 (owner: 10David Caro) [09:21:13] (03PS1) 10Jcrespo: dbbackups: Move s2 from db2098 to db2097, reimage db2098 to buster [puppet] - 10https://gerrit.wikimedia.org/r/696027 (https://phabricator.wikimedia.org/T280979) [09:22:49] (03CR) 10Muehlenhoff: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [09:26:16] (03CR) 10Jbond: mediabackup: Install minio on the storage hosts and open port 9000 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [09:29:15] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: fixes to functionality and tests [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/696029 (https://phabricator.wikimedia.org/T283774) [09:29:27] (03PS1) 10Matthias Mullie: Rename Special:MediaSearch to Special:OldMediaSearch [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695835 (https://phabricator.wikimedia.org/T265939) [09:31:25] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] httpd-fcgi: fixes to functionality and tests [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/696029 (https://phabricator.wikimedia.org/T283774) (owner: 10Giuseppe Lavagetto) [09:34:24] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Joe) [09:35:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695221 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:37:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I think the next patch renames the python class. I wonder if the two changes could be squashed together." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695220 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:37:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] gerrit: using wmcs as the default branch [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695199 (owner: 10David Caro) [09:39:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.ceph: Added mon upgrade cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695201 (owner: 10David Caro) [09:40:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.ceph: add cookbook to upgrade all osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695202 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:41:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695223 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [09:43:04] (03PS1) 10JMeybohm: Add dummy secrets for httpbb tests [labs/private] - 10https://gerrit.wikimedia.org/r/696035 (https://phabricator.wikimedia.org/T264209) [09:44:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.openstack: add live_upgrade cloudvirt cookbook (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695222 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:44:16] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add dummy secrets for httpbb tests [labs/private] - 10https://gerrit.wikimedia.org/r/696035 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [09:45:11] (03PS1) 10Matthias Mullie: Rename to OldMediaSearch & remove duplicate preferences & hooks [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696039 (https://phabricator.wikimedia.org/T265939) [09:46:36] (03Abandoned) 10Matthias Mullie: Rename Special:MediaSearch to Special:OldMediaSearch [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695835 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [09:46:54] !log restarting mariadb on pc1007 to upgrade it [09:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:10] (03CR) 10David Caro: [C: 03+2] gerrit: using wmcs as the default branch [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695199 (owner: 10David Caro) [09:50:35] (03CR) 10David Caro: [C: 03+2] wmcs: add cloudvirt drain cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695220 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:50:38] (03CR) 10David Caro: [C: 03+2] wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695221 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:50:55] (03PS1) 10Matthias Mullie: Rename to MediaSearch & activate preferences & hooks [extensions/MediaSearch] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696042 (https://phabricator.wikimedia.org/T265939) [09:52:56] (03CR) 10David Caro: [C: 03+2] wmcs.openstack: add live_upgrade cloudvirt cookbook (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695222 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:53:05] (03CR) 10David Caro: [C: 03+2] wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695223 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [09:53:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29723/console" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [09:54:24] (03Merged) 10jenkins-bot: gerrit: using wmcs as the default branch [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695199 (owner: 10David Caro) [09:54:26] (03Merged) 10jenkins-bot: wmcs: add cloudvirt drain cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695220 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:54:38] (03Merged) 10jenkins-bot: wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695221 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:57:11] (03Merged) 10jenkins-bot: wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695222 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:57:13] (03Merged) 10jenkins-bot: wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695223 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [10:00:04] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T1000) [10:07:42] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29727/console" [puppet] - 10https://gerrit.wikimedia.org/r/695439 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [10:09:33] (03PS1) 10Gergő Tisza: fixLinkRecommendationData.php: also fix search index for old DB entries [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695837 (https://phabricator.wikimedia.org/T283606) [10:10:12] (03PS1) 10Gergő Tisza: fixLinkRecommendationData.php: also fix search index for old DB entries [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695838 (https://phabricator.wikimedia.org/T283606) [10:12:55] urbanecm: do you happen to be around? [10:13:04] apergos: somehow [10:13:05] what's up [10:13:11] I have some consulting questions on the upcoming deployment backport window [10:13:20] yes? [10:13:24] we have no trainees for the eu version btw but there is one for the US one! [10:13:42] so, there are some patches in the window with multiple files and we will have the usual issue of [10:13:48] 'what's the right order' [10:14:35] how should this be handled? [10:14:45] i see only one patch in the US window (the one that's 13 hours from now), https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/693951, which changes only one file [10:14:48] am i missing something? [10:14:52] ah no, forthe eu one [10:15:23] sorry, even though there are no trainees I'll still show up [10:15:45] so, going through the backports for EU window... [10:15:54] ...https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/696039 cannot be backported easily at all [10:16:03] (it has i18n changes, and that requires full scap) [10:16:06] yeah, there's like 4 of them [10:16:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] apereo_cas: rename config properties [puppet] - 10https://gerrit.wikimedia.org/r/661713 (https://phabricator.wikimedia.org/T273867) (owner: 10Jbond) [10:16:19] ok, well that's... the way it is then. good to know [10:16:26] same with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/696042 [10:16:41] yep [10:16:45] leaving us with 2 [10:17:07] (03PS1) 10MMandere: prometheus: Add dependency between varnish exporter and varnish service [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) [10:17:42] `Add Link: Prevent double-opening of the post-edit dialog`is frontend, where it really doesn't matter (as it has some cache on frontend anyway, so the order in which the files arrive usually doesn't matter much, if they arrive at almost the same time) [10:17:55] good good [10:18:06] and all the other patches sound to change only one file [10:18:14] I swear I'll write these guidelines down someplace so people don't have to fumble around in the dark every tmie [10:18:27] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/695833 [10:18:52] that's a couple files. ah but one is js and so front end cache and etc [10:18:52] ok [10:18:56] yeah [10:19:09] all righty then [10:19:20] for the patch you linked the order shouldn't matter, as the patch actually does two things: changes how the JS var is determined, and changes how it is used [10:20:47] btw, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseMediaInfo/+/696039 wouldn't be backportable even if i18n rebuilds weren't an issue [10:21:11] if you sync the special page first, it'll complain that it doesn't have i18n info for the new name [10:21:17] if you sync i18n first, it'll do the same [10:21:26] same with extension.json [10:21:32] so what happens to those two patchsets then? [10:21:56] the backporter needs to veto them and ask the requesting developer to submit different patches, that can be backported safely [10:21:58] do we tell them 'wait for the train'? [10:22:02] ah ha [10:22:04] or that [10:22:07] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:22:21] i'll -1 them with those comments [10:22:26] thank you very much [10:23:13] but for i18n changes, we usually tell them "wait for train" unless it's really really needed for some reason [10:23:19] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dborch1001.wikimedia.org with reason: Rebuilding db2094:s8 from db2082 12:19:41 i thought also i might directly move pc1010 to pc2, so that it'll have a few days of pc2 cache available when we make it pc2 primary next week [10:23:20] (rebuilding cache can take like 40 minutes) [10:23:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dborch1001.wikimedia.org with reason: Rebuilding db2094:s8 from db2082 12:19:41 i thought also i might directly move pc1010 to pc2, so that it'll have a few days of pc2 cache available when we make it pc2 primary next week [10:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:30] well crap. [10:23:33] whichi is most of the window right there.... [10:23:35] yeah [10:23:43] yeah sorry kormat [10:23:45] i guess that could have been a much worse mispaste to send to the SAL forever [10:23:46] now we know I guess [10:24:13] all kinds of bizarre things have gotten into the SAL in the past though :-) [10:24:24] hehe [10:26:59] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2082.codfw.wmnet with reason: Rebuilding db2094:s8 from db2082 T283793 [10:26:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2082.codfw.wmnet with reason: Rebuilding db2094:s8 from db2082 T283793 [10:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:03] T283793: db2094:3318 (sanitarium on codfw) needs recloning - https://phabricator.wikimedia.org/T283793 [10:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:04] (03CR) 10Urbanecm: [C: 04-1] "This patch cannot be backported, because it changes i18n, which requires a full scap sync-world (which can easily take over 40 minutes, ie" [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696039 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [10:29:23] (03CR) 10Urbanecm: "This patch cannot be backported, because it changes i18n, which requires a full scap sync-world (which can easily take over 40 minutes, ie" [extensions/MediaSearch] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696042 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [10:29:27] (03CR) 10Urbanecm: [C: 04-1] Rename to MediaSearch & activate preferences & hooks [extensions/MediaSearch] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696042 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [10:29:30] apergos: ^^ [10:29:52] noted! [10:33:09] (03Abandoned) 10Jbond: puppetdb: add site specific cnames for puppetdb [dns] - 10https://gerrit.wikimedia.org/r/693159 (https://phabricator.wikimedia.org/T283185) (owner: 10Jbond) [10:34:26] (03PS1) 10Kormat: Revert "db-eqiad.php: Set pc1010 as pc1 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695839 [10:34:32] jouncebot: now [10:34:32] For the next 0 hour(s) and 25 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T1000) [10:34:37] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [10:35:51] 10SRE, 10CFSSL-PKI, 10Patch-For-Review: Investigate Check for expired certificates debmonitor - https://phabricator.wikimedia.org/T283185 (10jbond) 05Open→03Resolved Closing, it was decided to remove this check as there are too many variables to make it useful, further we already have expiry checking for... [10:40:44] (03CR) 10Jcrespo: "I will deploy this now. I haven't recovered s7 and s8 yet, but I did with s2 from yesterday backups. Will ping you when they are ready to " [puppet] - 10https://gerrit.wikimedia.org/r/696027 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [10:41:44] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Move s2 from db2098 to db2097, reimage db2098 to buster [puppet] - 10https://gerrit.wikimedia.org/r/696027 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [10:52:38] (03PS1) 10Effie Mouzeli: Add redis password for mw:nutcracker:redis_password [labs/private] - 10https://gerrit.wikimedia.org/r/696309 [10:53:10] (03CR) 10Matthias Mullie: "> Patch Set 1:" [extensions/MediaSearch] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696042 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [10:57:43] urbanecm: apergos: I abandoned those patches & removed from this backports window - expect we'll wait for train [10:57:59] ack, thanks matthiasmullie :) [10:59:22] (03PS10) 10Effie Mouzeli: (WIP) mwdebug: add helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 [10:59:53] thanks! [11:00:04] Amir1, Lucas_WMDE, apergos, and duesen: #bothumor My software never has bugs. It just develops random features. Rise for EU Backport and Config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T1100). [11:00:05] tgr and WMDE-Fisch: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:14] \o [11:00:23] hello hello! [11:00:29] wow that’s a lot of patches [11:00:41] two branches, not as bad as it might seem [11:00:41] yeah probably to many to get the all through [11:00:52] and different extensions [11:00:53] :-) [11:01:07] lets see [11:01:25] o/ I have a ton of patches but they don't require much individual work. [11:01:30] can self-serve them. [11:01:43] self-serve it is. [11:02:03] anyone here to be trained btw? [11:02:11] otherwise go ahead and self-serve [11:02:12] no one signed up on the board [11:02:25] so I'm not even in the hangout. there will be someone for the US slot however! [11:04:40] a reminder matthiasmullie, no train next week. [11:05:36] apergos: yeah... guess it'll have to wait another week then :p [11:05:46] :-) [11:10:29] I'll probably add some more last-minute patches - we are trying to get a feature ready for deployment. Just ignore that block of patches, I'll deploy it. (The train is two hours from now so there is no risk of running into their slot.) [11:13:23] please make sure you leave time for WMDE-Fisch's patches to go before you add any new ones, and please do actually add the new ones to the calendar, tgr [11:13:42] 😬 [11:13:55] oh, sorry, didn't realize you were waiting for me [11:14:17] you're first in the list... [11:14:19] please go first, I'll do it when everything else is done [11:14:34] yeah, didn't notice some of the patches were removed [11:14:39] WMDE-Fisch: do you need someone to do the deploys or are you self serve? [11:14:43] I can't selfe serve today [11:14:47] okay [11:14:58] I can do those as well if you prefer [11:15:07] feel free [11:15:42] doing [11:16:38] (03CR) 10Gergő Tisza: [C: 03+2] Don't update backButton visibility if not set [extensions/VisualEditor] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695832 (https://phabricator.wikimedia.org/T283511) (owner: 10WMDE-Fisch) [11:16:39] (03CR) 10Gergő Tisza: [C: 03+2] Don't update backButton visibility if not set [extensions/VisualEditor] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695831 (https://phabricator.wikimedia.org/T283511) (owner: 10WMDE-Fisch) [11:16:54] (03PS6) 10Jbond: C:package_builder: Add Script for building debian packages from git [puppet] - 10https://gerrit.wikimedia.org/r/681445 [11:17:47] at least 15 mins. all right, zuul, get a move on... [11:18:25] (03CR) 10jerkins-bot: [V: 04-1] C:package_builder: Add Script for building debian packages from git [puppet] - 10https://gerrit.wikimedia.org/r/681445 (owner: 10Jbond) [11:18:47] (03CR) 10Muehlenhoff: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [11:26:37] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: add cookbook to upgrade all osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695202 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [11:26:42] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: Added mon upgrade cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695201 (owner: 10David Caro) [11:26:51] (03CR) 10jerkins-bot: [V: 04-1] wmcs.ceph: Added mon upgrade cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695201 (owner: 10David Caro) [11:26:53] (03CR) 10jerkins-bot: [V: 04-1] wmcs.ceph: add cookbook to upgrade all osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695202 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [11:32:23] (03PS1) 10David Caro: ceph: don't log to file as syslog works already [puppet] - 10https://gerrit.wikimedia.org/r/696330 (https://phabricator.wikimedia.org/T281247) [11:35:53] (03PS1) 10David Caro: live_upgrade_ussury_to_victoria: reword log message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696331 [11:38:02] (03CR) 10Gergő Tisza: [C: 03+2] Add Link: Prevent double-opening of the post-edit dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695436 (https://phabricator.wikimedia.org/T283120) (owner: 10MewOphaswongse) [11:38:05] (03CR) 10Gergő Tisza: [C: 03+2] Add Link: Prevent double-opening of the post-edit dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695437 (https://phabricator.wikimedia.org/T283120) (owner: 10MewOphaswongse) [11:38:08] (03CR) 10Gergő Tisza: [C: 03+2] Always delete from search index in AddLinkSubmissionHandler [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695478 (https://phabricator.wikimedia.org/T283606) (owner: 10MewOphaswongse) [11:38:11] (03CR) 10Gergő Tisza: [C: 03+2] Always delete from search index in AddLinkSubmissionHandler [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695479 (https://phabricator.wikimedia.org/T283606) (owner: 10MewOphaswongse) [11:41:56] 10SRE: Mapping of servers to stakeholders - https://phabricator.wikimedia.org/T216088 (10ayounsi) > Does this sound like an accurate description of the various angles here? Accurate enough to stall the task for 2 years :) Following John progress on https://gerrit.wikimedia.org/r/c/operations/puppet/+/695230/ he... [11:42:45] (03Merged) 10jenkins-bot: Don't update backButton visibility if not set [extensions/VisualEditor] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695832 (https://phabricator.wikimedia.org/T283511) (owner: 10WMDE-Fisch) [11:42:48] (03CR) 10jerkins-bot: [V: 04-1] Don't update backButton visibility if not set [extensions/VisualEditor] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695831 (https://phabricator.wikimedia.org/T283511) (owner: 10WMDE-Fisch) [11:44:18] tgr: the first of those two patches failed to merge, jenkins -1 [11:44:21] see ^^ [11:44:21] `npm WARN tar ENOSPC: no space left on device` [11:44:26] oh ffs [11:44:31] hmpf [11:44:56] (03CR) 10Gergő Tisza: [C: 03+2] Don't update backButton visibility if not set [extensions/VisualEditor] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695831 (https://phabricator.wikimedia.org/T283511) (owner: 10WMDE-Fisch) [11:45:01] I wonder who is around that can poke. hashar? [11:45:18] I've seen that error yesterday, too. It's sporadic but should probably be looked into. [11:45:54] (03PS2) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695201 [11:45:56] (03PS2) 10David Caro: wmcs.ceph: add cookbook to upgrade all osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695202 (https://phabricator.wikimedia.org/T280641) [11:46:23] I have asked in -releng if anyone can look into it [11:46:52] yeah I also saw that somewhere the other day [11:46:58] 90% sure it is one of the cache filed up with garbage somehow [11:47:29] I think we can reuse https://phabricator.wikimedia.org/T283497 [11:47:30] WMDE-Fisch: it's on mwdebug1001 (for wmf.6) [11:47:36] +1 [11:49:15] tgr: seems good, no error, go on [11:50:35] (03CR) 10David Caro: [C: 03+2] wmcs.ceph: Added mon upgrade cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695201 (owner: 10David Caro) [11:50:39] (03CR) 10David Caro: [C: 03+2] live_upgrade_ussury_to_victoria: reword log message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696331 (owner: 10David Caro) [11:51:03] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.6/extensions/VisualEditor/modules/ve-mw/ui/dialogs/ve.ui.MWTransclusionDialog.js: Backport: [[gerrit:695832|Don't update backButton visibility if not set (T283511)]] (duration: 01m 06s) [11:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:07] T283511: TypeError: backButton is undefined ve.ui.MWTransclusionDialog.prototype.updateActionSet - https://phabricator.wikimedia.org/T283511 [11:54:10] (03Merged) 10jenkins-bot: live_upgrade_ussury_to_victoria: reword log message [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696331 (owner: 10David Caro) [11:54:12] (03Merged) 10jenkins-bot: wmcs.ceph: Added mon upgrade cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695201 (owner: 10David Caro) [11:55:55] (03PS1) 10Ladsgroup: statistics: Migrate cron to systemd timer in rsync_job [puppet] - 10https://gerrit.wikimedia.org/r/696348 (https://phabricator.wikimedia.org/T273673) [11:56:23] (03CR) 10jerkins-bot: [V: 04-1] statistics: Migrate cron to systemd timer in rsync_job [puppet] - 10https://gerrit.wikimedia.org/r/696348 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:57:22] (03Merged) 10jenkins-bot: wmcs.ceph: add cookbook to upgrade all osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695202 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [11:58:38] (03PS2) 10Ladsgroup: statistics: Migrate cron to systemd timer in rsync_job [puppet] - 10https://gerrit.wikimedia.org/r/696348 (https://phabricator.wikimedia.org/T273673) [11:59:06] (03CR) 10jerkins-bot: [V: 04-1] statistics: Migrate cron to systemd timer in rsync_job [puppet] - 10https://gerrit.wikimedia.org/r/696348 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:59:59] 10SRE: Create cookbook to add a node to a Ganeti cluster - https://phabricator.wikimedia.org/T274527 (10MoritzMuehlenhoff) [12:00:43] tgr: are you ok to continue here? the official window is done and I am just signed up for the one hour [12:00:48] (03PS3) 10Ladsgroup: statistics: Migrate cron to systemd timer in rsync_job [puppet] - 10https://gerrit.wikimedia.org/r/696348 (https://phabricator.wikimedia.org/T273673) [12:01:28] apergos: yeah, there is another hour till the next window, I'll use that. [12:01:41] let's hope Jenkins cooperates. [12:01:45] (03Merged) 10jenkins-bot: Add Link: Prevent double-opening of the post-edit dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695436 (https://phabricator.wikimedia.org/T283120) (owner: 10MewOphaswongse) [12:01:48] (03Merged) 10jenkins-bot: Add Link: Prevent double-opening of the post-edit dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695437 (https://phabricator.wikimedia.org/T283120) (owner: 10MewOphaswongse) [12:01:50] (03Merged) 10jenkins-bot: Always delete from search index in AddLinkSubmissionHandler [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695478 (https://phabricator.wikimedia.org/T283606) (owner: 10MewOphaswongse) [12:01:51] crossing fingers! [12:02:17] (03CR) 10Volans: [V: 03+1] Add python-build-bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/685462 (owner: 10Volans) [12:02:19] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/696348 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [12:03:06] (03CR) 10Ayounsi: "The overall approach seems to be the right way to me." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695236 (owner: 10Jbond) [12:05:38] tgr: I'm around for the next 25 minutes (then meetings) if you need some manual QA assistance [12:05:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove old compat code [puppet] - 10https://gerrit.wikimedia.org/r/695940 (owner: 10Muehlenhoff) [12:06:02] (03CR) 10Jbond: (Test): Example PR demonstrating the contacts profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695236 (owner: 10Jbond) [12:06:08] ack, thx [12:07:34] (03PS1) 10Gergő Tisza: Avoid session loading when loading task types in help panel RL data [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695841 (https://phabricator.wikimedia.org/T282800) [12:08:13] (03PS1) 10Gergő Tisza: Avoid session loading when loading task types in help panel RL data [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695842 (https://phabricator.wikimedia.org/T282800) [12:09:24] (03PS1) 10Gergő Tisza: Fix Ie9a1018c198 for external cluster [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695843 (https://phabricator.wikimedia.org/T283606) [12:10:01] (03PS1) 10Gergő Tisza: Fix Ie9a1018c198 for external cluster [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695844 (https://phabricator.wikimedia.org/T283606) [12:10:37] (03CR) 10Ladsgroup: "This doesn't seem to be used in production, should we just drop the whole thing? https://puppet-compiler.wmflabs.org/compiler1002/29728/" [puppet] - 10https://gerrit.wikimedia.org/r/696348 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [12:10:48] (03CR) 10Ayounsi: [C: 03+1] "Chatted with Riccardo and LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/695399 (https://phabricator.wikimedia.org/T276760) (owner: 10Volans) [12:11:14] (03PS1) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) [12:12:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventstreams_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:12:50] (03CR) 10Elukey: [V: 03+1] "Left some minor comments but I think it is a great work, the airflow instance define is a little dense to visually parse/understand but it" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [12:14:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:15:27] (03CR) 10jerkins-bot: [V: 04-1] Cookbook to add a new node to a Ganeti cluster (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [12:17:57] (03PS1) 10Jbond: concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 [12:19:16] (03CR) 10jerkins-bot: [V: 04-1] concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 (owner: 10Jbond) [12:20:30] (03PS2) 10Jbond: concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 [12:21:19] (03Merged) 10jenkins-bot: Always delete from search index in AddLinkSubmissionHandler [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695479 (https://phabricator.wikimedia.org/T283606) (owner: 10MewOphaswongse) [12:21:21] (03Merged) 10jenkins-bot: Don't update backButton visibility if not set [extensions/VisualEditor] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695831 (https://phabricator.wikimedia.org/T283511) (owner: 10WMDE-Fisch) [12:21:25] (03CR) 10jerkins-bot: [V: 04-1] concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 (owner: 10Jbond) [12:23:01] WMDE-Fisch: the wmf.7 patch is on mwdebug1001. [12:23:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add new golang 1.15 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [12:23:28] tgr all good, go on [12:23:29] and thanks! [12:23:47] Sorry for the wait. I didn't count with a CI failure; in hindsight I should have waited with the other patches before yours is fully merged. [12:25:19] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/VisualEditor/modules/ve-mw/ui/dialogs/ve.ui.MWTransclusionDialog.js: Backport: [[gerrit:695831|Don't update backButton visibility if not set (T283511)]] (duration: 01m 06s) [12:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:32] T283511: TypeError: backButton is undefined ve.ui.MWTransclusionDialog.prototype.updateActionSet - https://phabricator.wikimedia.org/T283511 [12:30:39] (03PS2) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) [12:31:08] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10CBogen) >>! In T283632#7118358, @Marostegui wrote: > @CBogen we'd need the expiry date of the contract. We'd also need to verify the ssh key, this could be done via video call with you or... [12:31:51] (03PS1) 10Cathal Mooney: Remove Anycast IPv4 ranges from bgp_out policy in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/696383 [12:32:29] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10Marostegui) Thanks! Having @jmixter posting his ssh key on his wikitech userpage is also fine. (I am out tomorrow and Monday too :-) ) [12:33:54] (03CR) 10Ayounsi: [C: 03+1] Remove Anycast IPv4 ranges from bgp_out policy in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/696383 (owner: 10Cathal Mooney) [12:34:07] (03CR) 10jerkins-bot: [V: 04-1] Cookbook to add a new node to a Ganeti cluster (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [12:35:52] (03CR) 10Cathal Mooney: [C: 03+2] Remove Anycast IPv4 ranges from bgp_out policy in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/696383 (owner: 10Cathal Mooney) [12:36:41] (03PS3) 10Jbond: concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 [12:36:55] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add new golang 1.15 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [12:37:09] (03Merged) 10jenkins-bot: Remove Anycast IPv4 ranges from bgp_out policy in eqord [homer/public] - 10https://gerrit.wikimedia.org/r/696383 (owner: 10Cathal Mooney) [12:37:12] (03CR) 10jerkins-bot: [V: 04-1] concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 (owner: 10Jbond) [12:37:27] (03PS1) 10Ema: icinga: remove Grafana alerts for Traffic/Netops [puppet] - 10https://gerrit.wikimedia.org/r/696384 (https://phabricator.wikimedia.org/T282806) [12:39:17] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.6/extensions/GrowthExperiments/: Backport: [[gerrit:695436|Add Link: Prevent double-opening of the post-edit dialog (T283120)]] [[gerrit:695437|Add Link: Prevent double-opening of the post-edit dialog (T283120)]] (duration: 01m 06s) [12:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:24] T283120: [wmf.6] Add link: post-edit dialog displayed twice - https://phabricator.wikimedia.org/T283120 [12:39:32] (03PS4) 10Jbond: concat: Add puppetlabs-concat module [puppet] - 10https://gerrit.wikimedia.org/r/696380 [12:40:30] !log cr2-eqord: Gerrit 696383: Removing IPv4 Anycast ranges from bgp_out policy. [12:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:42] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/: Backport: [[gerrit:695437|Add Link: Prevent double-opening of the post-edit dialog (T283120)]] [[gerrit:695479|Always delete from search index in AddLinkSubmissionHandler (T283606)]] (duration: 01m 06s) [12:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:49] T283606: Add a link: too many articles have no suggestions upon arrival - https://phabricator.wikimedia.org/T283606 [12:40:54] (03PS13) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [12:42:10] (03PS1) 10Urbanecm: Enable Growth's community configuration on the pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696390 (https://phabricator.wikimedia.org/T283809) [12:45:23] (03CR) 10Giuseppe Lavagetto: "LGTM but remove the whitespace." (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/696309 (owner: 10Effie Mouzeli) [12:47:13] !log EU deploys done [12:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:32] jouncebot: now [12:47:32] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [12:47:37] jouncebot: next [12:47:37] In 0 hour(s) and 12 minute(s): MediaWiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T1300) [12:47:40] (03PS2) 10Gergő Tisza: GrowthExperiments: Enable Add Links for 50% of new users and all old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695364 (https://phabricator.wikimedia.org/T277356) [12:48:12] (03CR) 10Kormat: [C: 03+2] Revert "db-eqiad.php: Set pc1010 as pc1 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695839 (owner: 10Kormat) [12:48:28] (03CR) 10Giuseppe Lavagetto: (WIP) mwdebug: add helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 (owner: 10Effie Mouzeli) [12:49:09] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Set pc1010 as pc1 primary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695839 (owner: 10Kormat) [12:49:12] 10SRE, 10netops: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10ayounsi) 05Open→03Resolved Thanks all set and updated. noc@gtt.net is a valid email according do their website (so maybe it was a temporary issue?), and I added their 2nd level escalation email as we... [12:50:36] (03CR) 10Giuseppe Lavagetto: (WIP) mwdebug: add helmfile configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 (owner: 10Effie Mouzeli) [12:50:39] !log kormat@deploy1002 Synchronized wmf-config/db-eqiad.php: Repool pc1007 as pc1 master T282761 (duration: 01m 04s) [12:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:43] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [12:53:21] (03CR) 10Effie Mouzeli: (WIP) mwdebug: add helmfile configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 (owner: 10Effie Mouzeli) [12:54:14] (03CR) 10Volans: [V: 03+1 C: 03+2] script interface automation: fix re-assign of IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/695399 (https://phabricator.wikimedia.org/T276760) (owner: 10Volans) [12:55:08] (03PS1) 10Kormat: pc1010: Move to pc2. [puppet] - 10https://gerrit.wikimedia.org/r/696398 (https://phabricator.wikimedia.org/T282761) [12:55:42] (03Merged) 10jenkins-bot: script interface automation: fix re-assign of IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/695399 (https://phabricator.wikimedia.org/T276760) (owner: 10Volans) [12:55:43] !log T283606: running mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki={ar,bn,cs,vi}wiki --verbose --search-index [12:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:51] (03PS1) 10Hnowlan: cassandra: drop support for 2.1 in metrics. Fix version of metrics collector for cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/696399 (https://phabricator.wikimedia.org/T275353) [12:55:52] T283606: Add a link: too many articles have no suggestions upon arrival - https://phabricator.wikimedia.org/T283606 [12:56:34] (03CR) 10jerkins-bot: [V: 04-1] cassandra: drop support for 2.1 in metrics. Fix version of metrics collector for cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/696399 (https://phabricator.wikimedia.org/T275353) (owner: 10Hnowlan) [12:56:36] (03PS2) 10Kormat: pc1010: Move to pc2. [puppet] - 10https://gerrit.wikimedia.org/r/696398 (https://phabricator.wikimedia.org/T282761) [12:57:22] !log T283606: running mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki={ar,bn,cs,vi}wiki --verbose --search-index with gerrit:696307 applied [12:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:52] (03PS2) 10Hnowlan: cassandra: drop support for 2.1 in metrics. Fix collector version [puppet] - 10https://gerrit.wikimedia.org/r/696399 (https://phabricator.wikimedia.org/T275353) [12:58:38] (03PS1) 10JMeybohm: docker_registry_ha: Enable local nginx cache by default [puppet] - 10https://gerrit.wikimedia.org/r/696403 (https://phabricator.wikimedia.org/T256762) [12:59:06] (03PS1) 10Volans: script interface automation: fix log message [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/696404 (https://phabricator.wikimedia.org/T276760) [12:59:19] 10SRE, 10netops: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10ayounsi) Actually looks like they don't want emails. So I left a note in Netbox saying that it's phone or portal only. [13:00:04] twentyafterfour and hashar: Dear deployers, time to do the MediaWiki train - American+European Version (secondary timeslot) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T1300). [13:00:40] (03CR) 10Marostegui: [C: 03+1] pc1010: Move to pc2. [puppet] - 10https://gerrit.wikimedia.org/r/696398 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [13:01:08] (03CR) 10Kormat: [C: 03+2] pc1010: Move to pc2. [puppet] - 10https://gerrit.wikimedia.org/r/696398 (https://phabricator.wikimedia.org/T282761) (owner: 10Kormat) [13:02:04] (03CR) 10Volans: [C: 03+2] script interface automation: fix log message [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/696404 (https://phabricator.wikimedia.org/T276760) (owner: 10Volans) [13:02:45] (03Merged) 10jenkins-bot: script interface automation: fix log message [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/696404 (https://phabricator.wikimedia.org/T276760) (owner: 10Volans) [13:06:30] (03PS14) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [13:06:39] (03PS12) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [13:07:16] (03PS13) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [13:08:48] (03CR) 10jerkins-bot: [V: 04-1] profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (owner: 10Jbond) [13:09:16] (03CR) 10jerkins-bot: [V: 04-1] (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (owner: 10Jbond) [13:09:55] (03CR) 10Jbond: (Test): Example PR demonstrating the contacts profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695236 (owner: 10Jbond) [13:11:23] PROBLEM - MariaDB Replica Lag: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 417.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:12:48] marostegui: hurm. pc2007 is falling behind on replication ^ [13:13:08] kormat: applying the optimizes I guess? [13:13:12] from pc1007? [13:13:36] marostegui: those all finished yesterday [13:14:12] (03PS1) 10Hnowlan: maps: make maps2009 a buster imposm-based master in codfw [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) [13:14:55] (03PS14) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [13:15:01] (03PS15) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [13:15:20] (03PS15) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [13:15:32] kormat: ah pc1007 was repooled? [13:15:33] ACKNOWLEDGEMENT - MariaDB Replica Lag: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 481.44 seconds Kormat Investigating. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:15:34] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29730/console" [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [13:15:44] marostegui: yeah, about 25 mins ago [13:16:08] kormat: we've seen pc on codfw lagging when the cache is not as warm as normal, let's give it some time [13:16:16] huh, ok [13:17:29] kormat: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=1&orgId=1&var-server=pc2007&var-port=9104&from=now-24h&to=now [13:17:39] kormat: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=12&orgId=1&var-server=pc2007&var-port=9104&from=now-24h&to=now [13:18:02] (03PS16) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [13:18:22] (03PS16) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [13:19:01] kormat: I cannot see anything wrong HW side, let's wait for the innodb buffer pool size to recover [13:19:44] marostegui: alright. it confuses me that writes would be slow due to lack of buffering, but mariadb confusing me is definitely not news :) [13:22:22] (03CR) 10Volans: Cookbook to add a new node to a Ganeti cluster (WIP) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [13:23:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29733/console" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (owner: 10Jbond) [13:23:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/694523 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [13:25:36] (03CR) 10Muehlenhoff: gerrit: switch to Java 11 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/694524 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [13:26:00] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29734/console" [puppet] - 10https://gerrit.wikimedia.org/r/696399 (https://phabricator.wikimedia.org/T275353) (owner: 10Hnowlan) [13:27:43] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29735/console" [puppet] - 10https://gerrit.wikimedia.org/r/696399 (https://phabricator.wikimedia.org/T275353) (owner: 10Hnowlan) [13:31:01] (03CR) 10Jbond: [C: 03+1] Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [13:34:50] (03CR) 10RLazarus: [C: 03+1] "Looks good! Small comments but feel free to merge without waiting for another round." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695439 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [13:37:57] (03PS1) 10David Caro: unset_maintenance: don't set downtime on icinga [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696446 [13:37:59] (03PS1) 10David Caro: cloudvirt.*maintenante: use a default cloudcontrol node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696447 [13:38:01] (03PS1) 10David Caro: cloudvirt.{drain|safe_reboot}: use default control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696448 [13:42:15] RECOVERY - DPKG on cumin2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:43:44] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29736/console" [puppet] - 10https://gerrit.wikimedia.org/r/696418 (https://phabricator.wikimedia.org/T269582) (owner: 10Hnowlan) [13:44:41] (03PS7) 10JMeybohm: httpbb: Allow tests to be templates [puppet] - 10https://gerrit.wikimedia.org/r/695439 (https://phabricator.wikimedia.org/T264209) [13:44:43] (03PS1) 10JMeybohm: httpbb: Test docker-registry catalog [puppet] - 10https://gerrit.wikimedia.org/r/696452 [13:45:11] (03PS1) 10David Caro: cloudvirt.*: adding sal messages to all the cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696453 [13:45:17] (03CR) 10JMeybohm: "Thanks for the review!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695439 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [13:45:30] (03PS1) 10Kormat: db-repliation-tree: Display circular replication reasonably. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/696454 (https://phabricator.wikimedia.org/T283239) [13:47:06] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Ottomata) Approved. FYI to SRE: this is addition to the analytics-privatedata-users posix group without... [13:47:39] (03PS2) 10Kormat: db-repliation-tree: Display circular replication reasonably. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/696454 (https://phabricator.wikimedia.org/T283239) [13:48:12] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Ottomata) @schoenbaechler can you get Lucy Blackwell to approve this access here on this ticket? [13:48:20] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Marostegui) @Ottomata this ticket is assigned to you, will you take care of it? [13:49:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:39] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Ottomata) a:05Ottomata→03None Ah oops, it was assigned to me because the process was not clear, fixed... [13:50:41] (03CR) 10JMeybohm: [C: 03+2] httpbb: Allow tests to be templates [puppet] - 10https://gerrit.wikimedia.org/r/695439 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [13:50:45] (03CR) 10JMeybohm: [C: 03+2] httpbb: Test docker-registry catalog [puppet] - 10https://gerrit.wikimedia.org/r/696452 (owner: 10JMeybohm) [13:52:01] (03PS5) 10Giuseppe Lavagetto: modules::conftool add safe-service-restart scap option [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) (owner: 10Effie Mouzeli) [13:52:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,routinator} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:54:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:54:54] (03PS3) 10Kormat: db-repliation-tree: Display circular replication reasonably. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/696454 (https://phabricator.wikimedia.org/T283239) [13:55:22] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Marostegui) p:05Triage→03Medium a:03Marostegui @schoenbaechler can you please confirm you've read an... [13:55:26] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Marostegui) [14:00:17] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: add egress policies to databases [deployment-charts] - 10https://gerrit.wikimedia.org/r/693871 (owner: 10Giuseppe Lavagetto) [14:00:42] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Marostegui) >>! In T283190#7119409, @Ottomata wrote: > Ah oops, it was assigned to me because the process... [14:01:16] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Marostegui) [14:14:26] !log bounce keyholder-agent on cumin2001 to drop homer key (now on 2002 only) [14:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:49] RECOVERY - Keyholder SSH agent on cumin2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [14:16:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:17:53] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) a:03Marostegui [14:18:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:21:12] (03PS1) 10Ottomata: Add schoenbaechler to analytics-privatedata-users, no ssh [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) [14:22:02] (03PS3) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) [14:24:02] (03PS1) 10Ema: Traffic team alerts [alerts] - 10https://gerrit.wikimedia.org/r/696468 (https://phabricator.wikimedia.org/T282806) [14:24:51] (03CR) 10jerkins-bot: [V: 04-1] Traffic team alerts [alerts] - 10https://gerrit.wikimedia.org/r/696468 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [14:25:10] (03PS2) 10Ema: Traffic team alerts [alerts] - 10https://gerrit.wikimedia.org/r/696468 (https://phabricator.wikimedia.org/T282806) [14:25:58] (03CR) 10jerkins-bot: [V: 04-1] Cookbook to add a new node to a Ganeti cluster (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [14:26:00] (03CR) 10jerkins-bot: [V: 04-1] Traffic team alerts [alerts] - 10https://gerrit.wikimedia.org/r/696468 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [14:29:28] (03CR) 10Ottomata: Airflow puppetization + airflow@analytics on an-test-coord1001 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [14:31:10] (03PS4) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) [14:31:32] (03CR) 10Elukey: [V: 03+1 C: 03+1] Airflow puppetization + airflow@analytics on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [14:31:44] (03CR) 10Marostegui: [C: 04-2] "-2 cause it needs L3 signed and manager approval first. The patch looks good itself though. I will take care of merging it once it is read" [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) (owner: 10Ottomata) [14:32:18] (03CR) 10Marostegui: [C: 04-2] "self reminder: user needs to be added to WMF ldap group" [puppet] - 10https://gerrit.wikimedia.org/r/696467 (https://phabricator.wikimedia.org/T283190) (owner: 10Ottomata) [14:32:44] (03CR) 10Ottomata: "> the airflow instance define is a little dense to visually parse/understand but it makes sense to have things packed in one place." [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [14:33:05] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:33:59] (03CR) 10Jcrespo: [C: 04-1] "> Patch Set 5:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [14:35:30] (03CR) 10jerkins-bot: [V: 04-1] Cookbook to add a new node to a Ganeti cluster (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [14:35:38] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/695375 (owner: 10Muehlenhoff) [14:36:36] (03CR) 10Jcrespo: [C: 04-1] mediabackup: Install minio on the storage hosts and open port 9000 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/694332 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [14:37:27] (03PS4) 10Jbond: IDM: create new idm library with logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) [14:39:10] (03CR) 10Ottomata: Airflow puppetization + airflow@analytics on an-test-coord1001 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [14:40:58] (03PS5) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) [14:42:40] 10SRE, 10observability, 10Sustainability (Incident Followup): prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) I took the change of Bullseye upcoming upgrade to build a Prometheus 2.24.1 + k8s package in the `wmf/bullseye` in the `operations/debs/prometheus` repo, the... [14:55:51] (03PS13) 10Ottomata: Airflow puppetization + airflow@analytics on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) [14:55:57] (03CR) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster (WIP) (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) (owner: 10Muehlenhoff) [14:56:03] PROBLEM - mcrouter process on mwdebug1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [14:56:17] (03PS6) 10Muehlenhoff: Cookbook to add a new node to a Ganeti cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/696377 (https://phabricator.wikimedia.org/T274527) [14:59:00] (03PS6) 10Cwhite: logstash: replace ECS allow list with filter_on_template [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) [14:59:54] (03PS1) 10David Caro: wmcs.dologmsg: Fixed to use the new correct port [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696505 [15:01:53] PROBLEM - puppet last run on grafana2001 is CRITICAL: CRITICAL: Puppet has been disabled for 604895 seconds, message: grafana 8 test upgrade - filippo, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:02:45] (03CR) 10jerkins-bot: [V: 04-1] wmcs.dologmsg: Fixed to use the new correct port [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696505 (owner: 10David Caro) [15:03:30] !log disable puppet mc2019 [15:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:50] godog: you might want to downtime or ack the alert above about grafana2001? ^ [15:04:29] PROBLEM - Disk space on wdqs2004 is CRITICAL: DISK CRITICAL - free space: /srv 100410 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs2004&var-datasource=codfw+prometheus/ops [15:04:52] ryankemper: ^ [15:04:58] Do you want me to create a task for it? [15:05:20] 10SRE, 10Analytics, 10Analytics-Kanban, 10Traffic: Traffic anomalies: Factor out list of countries into a dedicated Hive table - https://phabricator.wikimedia.org/T272052 (10mforns) [15:06:09] 10SRE, 10Analytics-Radar, 10Privacy Engineering, 10Traffic: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10mforns) [15:07:40] ACKNOWLEDGEMENT - Disk space on wdqs2004 is CRITICAL: DISK CRITICAL - free space: /srv 100410 MB (3% inode=99%): Ryan Kemper https://phabricator.wikimedia.org/T280382 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs2004&var-datasource=codfw+prometheus/ops [15:08:12] marostegui: I acked with the existing ticket for wdqs low disk space issues, will take a look after this retro [15:08:17] marostegui: yeah I'll do that [15:09:22] than kyou both! [15:12:20] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Bumeh-ctr) >>! In T283648#7117478, @Marostegui wrote: > @Bumeh-ctr can you post your ssh key on wikitech with your bumeh-ctr account on your, user pa... [15:12:26] !log test netconf over ssh on cr3-ulsfo [15:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:08] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) @Bumeh-ctr can you edit your wikitech page: https://wikitech.wikimedia.org/wiki/User:Bumeh-ctr (logged in with your Bumeh-ctr account) an... [15:18:46] (03PS7) 10Cwhite: logstash: replace ECS allow list with filter_on_templates [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) [15:24:09] (03CR) 10Bstorm: nfs: fix the scratch mount setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695447 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [15:25:39] (03CR) 10David Caro: [C: 03+1] "LGTM, I think the confusion was with 'called', I interpreted it as 'named' when it meant 'executed'. The new wording is clearer, thanks." [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/695462 (owner: 10Bstorm) [15:26:05] (03CR) 10Cwhite: [C: 03+2] logstash: replace ECS allow list with filter_on_templates [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:29:02] (03PS1) 10Mforns: reportupdater::jobs: improve permits of logs rsynced to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/696516 (https://phabricator.wikimedia.org/T274880) [15:30:27] (03CR) 10jerkins-bot: [V: 04-1] reportupdater::jobs: improve permits of logs rsynced to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/696516 (https://phabricator.wikimedia.org/T274880) (owner: 10Mforns) [15:32:16] (03PS2) 10Mforns: reportupdater::jobs: improve permits of logs rsynced to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/696516 (https://phabricator.wikimedia.org/T274880) [15:36:11] PROBLEM - Host an-worker1129 is DOWN: PING CRITICAL - Packet loss = 100% [15:36:37] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10RobH) >>! In T283771#7118281, @Papaul wrote: > @RobH > I have only one question for now. what is or will be your approach on keeping the TFTP server up to date with the la... [15:37:53] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10Jgreen) 05Resolved→03Open @Papaul Thanks. Somehow DNS for the DRAC did not populate. It's not urgent, I found the IP in netbox and got it ins... [15:41:41] !log T280382 `wdqs2004` inexplicably has a 2.5TB `wikidata.jnl`. By comparison `wdqs1006` has a 1.6T `wikidata.jnl` [15:41:43] RECOVERY - mcrouter process on mwdebug1001 is OK: PROCS OK: 1 process with UID = 997 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [15:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:46] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [15:41:47] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Bumeh-ctr) @Marostegui It's done now. I hope I did it as you expected. [15:41:51] (03CR) 10Bstorm: [V: 03+2 C: 03+2] clarify the language in the README a bit [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/695462 (owner: 10Bstorm) [15:42:09] (03CR) 10Ladsgroup: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/693595 (owner: 10Ladsgroup) [15:42:36] an-woerker1129 is probably Chris moving the host to another rack [15:43:00] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) That works - thanks! [15:43:15] (03CR) 10Marostegui: "Key verified, this is ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/695872 (https://phabricator.wikimedia.org/T283648) (owner: 10Marostegui) [15:43:40] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/695462 (owner: 10Bstorm) [15:44:05] (03PS1) 10David Caro: role.wmcs.virt*: sorted and add extra comments [puppet] - 10https://gerrit.wikimedia.org/r/696553 [15:44:07] !log T280382 `wdqs2004` inexplicably has a 2.5TB `wikidata.jnl`. By comparison `wdqs1006` has a 1.6T `wikidata.jnl`, and `wdqs2004` and `wdqs2001` have a 975G `wikidata.jnl`. It's not clear why there's such a big divergence [15:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:16] (03PS2) 10David Caro: role.wmcs.virt*: sorted and add extra comments [puppet] - 10https://gerrit.wikimedia.org/r/696553 [15:45:50] (03CR) 10Andrew Bogott: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/696553 (owner: 10David Caro) [15:46:24] (03CR) 10David Caro: [C: 03+2] role.wmcs.virt*: sorted and add extra comments [puppet] - 10https://gerrit.wikimedia.org/r/696553 (owner: 10David Caro) [15:46:33] (03CR) 10Bstorm: nfs: fix the scratch mount setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695447 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [15:49:49] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [15:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:44] !log T280382 (fixing couple wrong host names in last log line) `wdqs2004` inexplicably has a 2.5TB `wikidata.jnl`. By comparison `wdqs1006` has a 1.6T `wikidata.jnl`, and `wdqs2001`, `wdqs2002`, and `wdqs2008`, have a 975G `wikidata.jnl` [15:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:48] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [15:53:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:34] (03PS2) 10David Caro: wmcs.do_log_msg: Fixed to use the new correct port [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/696505 [15:56:10] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer [15:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:18] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2004.codfw.wmnet --reason "transferring fresh wikidata journal following runaway inflation of wdqs2004's wikidata.jnl" --blazegraph_instance blazegraph` on `ryankemper@cumin2002` tmux session `wdqs_disk` [15:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:25] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [15:57:07] (03CR) 10Wolfgang Kandek: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/695375 (owner: 10Muehlenhoff) [15:58:02] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:08] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1007.eqiad.wmnet --dest wdqs1006.eqiad.wmnet --reason "transferring fresh wikidata journal following runaway inflation of wdqs1006's wikidata.jnl" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs_disk` [15:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] jbond42 and cdanis: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T1600). [16:00:21] 10SRE, 10ops-eqiad, 10DC-Ops: Netbox Errors - https://phabricator.wikimedia.org/T283518 (10Cmjohnson) 05Open→03Resolved a:05Jclark-ctr→03Cmjohnson @wiki_willy The cable number issues have been resolved, the wmf5676 and wmf5677 are not eqiad. I believe you fixed the apple wifi issue. resolving this,... [16:01:01] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @elukey an-worker1129 has been moved to A2 [16:03:39] RECOVERY - Host an-worker1129 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:04:05] (03Abandoned) 10Effie Mouzeli: WIP: add notls support for external addresses to memcached (1) [puppet] - 10https://gerrit.wikimedia.org/r/693474 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [16:05:01] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1053 - https://phabricator.wikimedia.org/T282839 (10Cmjohnson) HPE is sending the part, they sent me an email requesting duplicate information that I missed. Taken care of and the part should be here tomorrow or Tuesday. [16:06:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) Dell wants to take the server down to the minimum post. I asked that they send a dell technician to do go down this rabbit hole again. [16:06:27] (03PS1) 10Ayounsi: Ignore 192.168.0.0/16 subnets when importing IPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/696563 (https://phabricator.wikimedia.org/T283813) [16:06:51] RECOVERY - Disk space on wdqs2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs2004&var-datasource=codfw+prometheus/ops [16:09:41] (03CR) 10Hnowlan: [V: 03+1] "```" [puppet] - 10https://gerrit.wikimedia.org/r/696399 (https://phabricator.wikimedia.org/T275353) (owner: 10Hnowlan) [16:14:32] (03CR) 10Volans: [C: 03+1] "LGTM, feel free to test it on netbox-next, using the cloudcephosd2002-dev as test host." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/696563 (https://phabricator.wikimedia.org/T283813) (owner: 10Ayounsi) [16:15:00] (03PS3) 10Effie Mouzeli: WIP: add notls support for external addresses to memcached (1) [puppet] - 10https://gerrit.wikimedia.org/r/695377 (https://phabricator.wikimedia.org/T271967) (owner: 10Jbond) [16:16:26] (03PS12) 10Effie Mouzeli: (WIP) profile::memcached::instance: Add TLS support (2) [puppet] - 10https://gerrit.wikimedia.org/r/694465 (https://phabricator.wikimedia.org/T271967) [16:16:57] (03PS9) 10Effie Mouzeli: (WIP) hieradata: enable tls on mc2019 (3) [puppet] - 10https://gerrit.wikimedia.org/r/694484 (https://phabricator.wikimedia.org/T271967) [16:20:19] 10Puppet, 10User-jbond: Upgrade puppet to use hiera version 5 - https://phabricator.wikimedia.org/T254248 (10jbond) 05Open→03Resolved [16:24:21] 10SRE, 10CAS-SSO, 10User-jbond: Cross data center setup for CAS - https://phabricator.wikimedia.org/T233931 (10jbond) [16:25:06] 10SRE, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: CAS Store U2f tokens in a database - https://phabricator.wikimedia.org/T256113 (10jbond) 05Open→03Resolved a:03jbond [16:25:15] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Add check for changes applied at all runs - https://phabricator.wikimedia.org/T242910 (10jbond) [16:25:58] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: reprepo user different on release1001 and release2001 - https://phabricator.wikimedia.org/T245612 (10jbond) 05Open→03Resolved a:03jbond [16:26:04] 10SRE, 10User-jbond: Audit our infrastructure for authenticated services - https://phabricator.wikimedia.org/T220361 (10jbond) 05Open→03Resolved a:03jbond [16:26:15] 10Puppet, 10SRE, 10User-jbond: configure and Test vaults capabilities as an ondemand CA - https://phabricator.wikimedia.org/T247509 (10jbond) 05Open→03Resolved [16:26:42] 10SRE, 10CAS-SSO, 10User-jbond: Cross data center setup for CAS - https://phabricator.wikimedia.org/T233931 (10jbond) [16:26:45] 10SRE, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [16:27:07] 10SRE, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 (10jbond) 05Open→03Resolved a:03jbond [16:28:22] 10SRE, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [16:28:59] 10SRE, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Add U2F/FIDO as second factor for CAS - https://phabricator.wikimedia.org/T233937 (10jbond) 05Open→03Resolved a:03jbond closing this i think the idea to support multiple options is out of scope of this task [16:29:59] (03CR) 10Cwhite: [C: 03+2] Remove ecs cleanup filter generator. [software/ecs] - 10https://gerrit.wikimedia.org/r/674712 (owner: 10Cwhite) [16:30:05] 10SRE, 10CAS-SSO, 10User-jbond: IDP failover improvments - https://phabricator.wikimedia.org/T268217 (10jbond) IdP no longer has the primary/secondery hiera addresses however we should move the services to use a DNS discovery address [16:30:31] (03Merged) 10jenkins-bot: Remove ecs cleanup filter generator. [software/ecs] - 10https://gerrit.wikimedia.org/r/674712 (owner: 10Cwhite) [16:31:15] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC), 10User-Joe: mcrouter production architecture - https://phabricator.wikimedia.org/T192771 (10jbond) [16:31:18] 10SRE, 10Traffic, 10User-jbond: Setup a new PKI software as an alternative to the puppet CA for managing services certificates - https://phabricator.wikimedia.org/T194031 (10jbond) 05Open→03Resolved Closing we now have https://wikitech.wikimedia.org/wiki/PKI/ [16:32:25] (03PS3) 10Jbond: Switch eqiad labsldapconfig to the read-only replicas [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff) [16:32:45] (03CR) 10Jbond: [C: 03+1] "Anything blocking this?" [puppet] - 10https://gerrit.wikimedia.org/r/525220 (https://phabricator.wikimedia.org/T46722) (owner: 10Muehlenhoff) [16:34:03] 10SRE, 10User-MoritzMuehlenhoff, 10User-jbond: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163 (10jbond) @MoritzMuehlenhoff i think we can close this now right? [16:35:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Cmjohnson) @RobH mw1414-mw1422 in rack A3 are install ready, need password changed. [16:35:47] 10SRE, 10puppet-compiler, 10User-jbond: populate puppetdb fails for unknown hosts - https://phabricator.wikimedia.org/T248689 (10jbond) 05Open→03Resolved a:03jbond This is now fixed by running first with dev/null [16:37:54] 10Puppet, 10SRE, 10User-jbond: puppetmaster: clean up instances of the puppet-master package - https://phabricator.wikimedia.org/T276339 (10jbond) 05Open→03Resolved a:03jbond [16:41:11] 10Puppet, 10SRE, 10observability, 10User-jbond: PuppetDB grafana graphs not matching logs - https://phabricator.wikimedia.org/T265649 (10jbond) 05Open→03Resolved a:03jbond I made some changes to the metrics graphed and theses are looking much more accurate now [16:41:36] (03PS4) 10Effie Mouzeli: modules::memcached: add notls support for external addresses [puppet] - 10https://gerrit.wikimedia.org/r/695377 (https://phabricator.wikimedia.org/T271967) (owner: 10Jbond) [16:42:17] 10SRE, 10CAS-SSO, 10User-jbond: Apereo CAS expose CASCookieSameSite via profile::idp::client::http - https://phabricator.wikimedia.org/T264605 (10jbond) did this make it out? [16:45:51] (03PS1) 10Jbond: O:puppet_compiler: update redirects [puppet] - 10https://gerrit.wikimedia.org/r/696582 (https://phabricator.wikimedia.org/T264184) [16:48:06] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: hiera_lookup failing to preform lookups after hiera5 upgrade - https://phabricator.wikimedia.org/T258931 (10jbond) 05Open→03Resolved should now use `sudo puppet lookup` [16:50:12] (03PS5) 10Effie Mouzeli: modules::memcached: add notls support for external addresses [puppet] - 10https://gerrit.wikimedia.org/r/695377 (https://phabricator.wikimedia.org/T271967) (owner: 10Jbond) [16:50:44] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: puppetise pupet server copy of the public ca.pem - https://phabricator.wikimedia.org/T256721 (10jbond) 05Open→03Resolved [16:50:47] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) [16:51:04] (03PS13) 10Effie Mouzeli: (WIP) profile::memcached::instance: Add TLS support (2) [puppet] - 10https://gerrit.wikimedia.org/r/694465 (https://phabricator.wikimedia.org/T271967) [16:57:03] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) [16:57:56] 10SRE, 10Analytics-Radar, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10odimitrijevic) [16:59:13] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:29] (03PS14) 10Effie Mouzeli: profile::memcached::instance: Add TLS support [puppet] - 10https://gerrit.wikimedia.org/r/694465 (https://phabricator.wikimedia.org/T271967) [17:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T1700). [17:05:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:12] 10SRE, 10Traffic, 10netops, 10User-jbond: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10cmooney) There is this script for AWS that @ema pointed me towards: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/pro... [17:17:33] (03PS4) 10Hashar: gerrit: switch to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/694524 (https://phabricator.wikimedia.org/T268225) [17:17:36] (03PS1) 10Hashar: gerrit: remove Java 8 packages [puppet] - 10https://gerrit.wikimedia.org/r/696591 (https://phabricator.wikimedia.org/T268225) [17:17:47] (03CR) 10Hashar: "Java 8 removal is now in https://gerrit.wikimedia.org/r/c/operations/puppet/+/696591" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/694524 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [17:18:46] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:59] PROBLEM - Disk space on an-airflow1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=80%): /tmp 0 MB (0% inode=80%): /var/tmp 0 MB (0% inode=80%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-airflow1001&var-datasource=eqiad+prometheus/ops [17:20:41] !log Running SecurePoll maintenance script cli/updateNotBlockedKey.php for all wikis T277079 [17:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:45] T277079: Clean up not-blocked property from securepoll_properties - https://phabricator.wikimedia.org/T277079 [17:23:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:21] PROBLEM - Check systemd state on an-airflow1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:03] RECOVERY - Check systemd state on an-airflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:11] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10Papaul) 05Open→03Resolved This issue was that it was missing DNS name fran2001.mgmt.frack.codfw.wmnet in Netbox. It is good now [17:26:30] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:51] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:19] PROBLEM - WDQS high update lag on wdqs2008 is CRITICAL: 5288 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:31:13] PROBLEM - Check systemd state on an-airflow1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:41] RECOVERY - Check systemd state on an-airflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:43] RECOVERY - Disk space on an-airflow1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-airflow1001&var-datasource=eqiad+prometheus/ops [17:44:28] 10SRE, 10User-MoritzMuehlenhoff, 10User-jbond: Investigate GID allocation for system users - https://phabricator.wikimedia.org/T235163 (10MoritzMuehlenhoff) >>! In T235163#7120001, @jbond wrote: > @MoritzMuehlenhoff i think we can close this now right? The old system users which were created outside the 100... [17:52:43] (03PS1) 10Legoktm: shellbox: Use httpd-fcgi:2.4.38-4 for numerical UIDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/696598 [17:56:13] (03CR) 10Legoktm: [C: 03+2] shellbox: Use httpd-fcgi:2.4.38-4 for numerical UIDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/696598 (owner: 10Legoktm) [17:58:45] (03Merged) 10jenkins-bot: shellbox: Use httpd-fcgi:2.4.38-4 for numerical UIDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/696598 (owner: 10Legoktm) [18:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T1800). [18:00:04] tgr and Urbanecm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] o/ [18:00:55] I can do the deploys [18:01:22] tgr: go ahead :) [18:01:27] i'm around too [18:01:41] RECOVERY - WDQS high update lag on wdqs2008 is OK: (C)3600 ge (W)1200 ge 879.1 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:01:49] (03CR) 10Gergő Tisza: [C: 03+2] Help panel: SwitchEditorPanel fixes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695833 (https://phabricator.wikimedia.org/T282800) (owner: 10Kosta Harlan) [18:01:51] (03CR) 10Gergő Tisza: [C: 03+2] Help panel: SwitchEditorPanel fixes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695834 (https://phabricator.wikimedia.org/T282800) (owner: 10Kosta Harlan) [18:01:53] (03CR) 10Gergő Tisza: [C: 03+2] Avoid session loading when loading task types in help panel RL data [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695841 (https://phabricator.wikimedia.org/T282800) (owner: 10Gergő Tisza) [18:01:56] (03CR) 10Gergő Tisza: [C: 03+2] Avoid session loading when loading task types in help panel RL data [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695842 (https://phabricator.wikimedia.org/T282800) (owner: 10Gergő Tisza) [18:02:10] hi [18:02:19] (03PS2) 10Gergő Tisza: Enable Growth's community configuration on the pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696390 (https://phabricator.wikimedia.org/T283809) (owner: 10Urbanecm) [18:02:21] there's one more backport coming for GrowthExperiments.... \o/ [18:03:53] (03PS10) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [18:03:54] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10KFrancis) @Marostegui Hello! I am confirming Jeff Mixter has an NDA on file. Please proceed with the access request. Thanks! [18:04:52] (03CR) 10Gergő Tisza: [C: 03+2] Enable Growth's community configuration on the pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696390 (https://phabricator.wikimedia.org/T283809) (owner: 10Urbanecm) [18:05:51] (03Merged) 10jenkins-bot: Enable Growth's community configuration on the pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696390 (https://phabricator.wikimedia.org/T283809) (owner: 10Urbanecm) [18:06:43] urbanecm: it's on mwdebug1001 [18:06:53] thanks, looking [18:08:01] tgr: working, please sync [18:09:04] (03PS3) 10Gergő Tisza: GrowthExperiments: Enable Add Links for 50% of new users and all old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695364 (https://phabricator.wikimedia.org/T277356) [18:09:42] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:696390|Enable Growth's community configuration on the pilot wikis (T283809)]] (duration: 01m 06s) [18:09:43] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:48] T283809: Enable community configuration on Growth pilot wikis - https://phabricator.wikimedia.org/T283809 [18:10:15] (03CR) 10Kosta Harlan: "This change is ready for review." [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696527 (https://phabricator.wikimedia.org/T283765) (owner: 10Kosta Harlan) [18:17:23] thanks for the deployment tgr :) [18:18:24] (03CR) 10Gergő Tisza: [C: 03+2] Add Link: Fix homepage PV token and newcomer task token logging [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696527 (https://phabricator.wikimedia.org/T283765) (owner: 10Kosta Harlan) [18:20:55] (03CR) 10Kosta Harlan: "This change is ready for review." [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/696530 (https://phabricator.wikimedia.org/T283765) (owner: 10Kosta Harlan) [18:21:39] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10jmixter) OK, I posted the Public SSH Key on my wikitech user page - https://wikitech.wikimedia.org/wiki/User:Jeff_Mixter Please let me know if you need anything else. [18:22:14] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [18:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:04] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 for Ahmon Dancy (@dancy) - https://phabricator.wikimedia.org/T283850 (10dancy) [18:23:53] calendar is updated with the additional GrowthExperiments backports [18:24:04] (03PS11) 10Ottomata: Initial debianization and 2.1.0-py3.7-1 release [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/693222 (https://phabricator.wikimedia.org/T277012) [18:27:53] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283852 (10ssingh) [18:28:20] ^ mutante: not urgent at all! let's do it after Monday [18:28:36] (I am also happy doing it myself but I remember the protocol was to just request and not create them :) [18:28:52] 10SRE, 10Traffic, 10vm-requests: Please create two Ganeti VMs for Wikidough - https://phabricator.wikimedia.org/T283852 (10ssingh) p:05Triage→03Medium [18:28:57] (03CR) 10Ottomata: [C: 03+2] Airflow puppetization + airflow@analytics on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [18:32:06] (03Merged) 10jenkins-bot: Help panel: SwitchEditorPanel fixes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695833 (https://phabricator.wikimedia.org/T282800) (owner: 10Kosta Harlan) [18:32:08] (03Merged) 10jenkins-bot: Avoid session loading when loading task types in help panel RL data [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695841 (https://phabricator.wikimedia.org/T282800) (owner: 10Gergő Tisza) [18:32:11] (03Merged) 10jenkins-bot: Help panel: SwitchEditorPanel fixes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695834 (https://phabricator.wikimedia.org/T282800) (owner: 10Kosta Harlan) [18:32:13] (03Merged) 10jenkins-bot: Avoid session loading when loading task types in help panel RL data [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695842 (https://phabricator.wikimedia.org/T282800) (owner: 10Gergő Tisza) [18:32:16] (03PS1) 10Legoktm: prometheus-exporters: Use numerical UIDs when running for apache and php-fpm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/696604 [18:34:21] (03CR) 10Gergő Tisza: [C: 03+2] Add Link: Fix homepage PV token and newcomer task token logging [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/696530 (https://phabricator.wikimedia.org/T283765) (owner: 10Kosta Harlan) [18:35:27] (03PS1) 10Ssingh: site: add doh3001 and doh3002 with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/696605 (https://phabricator.wikimedia.org/T283852) [18:37:37] (03Merged) 10jenkins-bot: Add Link: Fix homepage PV token and newcomer task token logging [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696527 (https://phabricator.wikimedia.org/T283765) (owner: 10Kosta Harlan) [18:38:57] 10SRE, 10Continuous-Integration-Config: operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855 (10Legoktm) p:05Triage→03High [18:39:02] (03CR) 10Legoktm: [V: 03+2 C: 03+2] prometheus-exporters: Use numerical UIDs when running for apache and php-fpm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/696604 (owner: 10Legoktm) [18:43:24] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10RobH) I think either A or C, B seems problematic and allows for one person to serve as a blocker for updates being timely. Also I fear that B would make me the single poin... [18:43:57] (03PS1) 10Legoktm: shellbox: Use prometheus exporters with numerical UIDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/696606 [18:44:50] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in esams - https://phabricator.wikimedia.org/T283852 (10ssingh) [18:45:10] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10RobH) >>! In T283771#7118462, @ayounsi wrote: > As you said it would be a good idea to see how it fits in the big automation picture. > First by detailing precisely the cur... [18:46:21] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693208 (https://phabricator.wikimedia.org/T283266) (owner: 10Tks4Fish) [18:48:12] tgr: are you done with deployment? 🙂 [18:48:25] (03CR) 10Legoktm: [C: 03+2] shellbox: Use prometheus exporters with numerical UIDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/696606 (owner: 10Legoktm) [18:48:32] still deploying? my that was a long window in the end :-D [18:48:50] urbanecm: no, but I have time for a config patch [18:49:00] tgr: can you do the one i just +1'ed? [18:49:04] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments: Backport: [[gerrit:695834|Help panel: SwitchEditorPanel fixes (T282800)]] [[gerrit:695842|Avoid session loading when loading task types in help panel RL data (T282800)]] [[gerrit:696527|Add Link: Fix homepage PV token and newcomer task token logging (T283765)]] (duration: 01m 06s) [18:49:04] the last backport is taking its time [18:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:09] T283765: link_suggestion_interaction schema: `close` action token is different from token in preceding action `editsummary_save` - https://phabricator.wikimedia.org/T283765 [18:49:09] T282800: [wmf.5-regression] Help panel displays "growthexperiments-help-panel-suggested-edits-switch-editor-to-undefined" - https://phabricator.wikimedia.org/T282800 [18:49:18] tgr: or i can sync it, up2you [18:49:26] doing [18:49:31] thanks [18:49:36] (03PS2) 10Gergő Tisza: ptwiki: Add 'flow-delete' to 'eliminator' user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693208 (https://phabricator.wikimedia.org/T283266) (owner: 10Tks4Fish) [18:49:54] I'm here if needed (Tks4Fish) [18:50:13] urbanecm: can you add it to the calendar? [18:50:18] tgr: certainly [18:51:03] done [18:51:23] (03PS7) 10Superyetkin: Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) [18:51:30] (03Merged) 10jenkins-bot: shellbox: Use prometheus exporters with numerical UIDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/696606 (owner: 10Legoktm) [18:52:31] (03CR) 10Gergő Tisza: [C: 03+2] ptwiki: Add 'flow-delete' to 'eliminator' user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693208 (https://phabricator.wikimedia.org/T283266) (owner: 10Tks4Fish) [18:53:27] dopanic: can you test it, once tgr asks you to? [18:53:29] (03Merged) 10jenkins-bot: ptwiki: Add 'flow-delete' to 'eliminator' user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693208 (https://phabricator.wikimedia.org/T283266) (owner: 10Tks4Fish) [18:53:32] sure thing [18:53:35] thanks [18:53:38] * urbanecm afk for a while [18:53:40] ty [18:54:08] dopanic: it's on mwdebug1001 [18:54:28] looks good :) [18:55:05] had the page ready to go lol [18:55:35] (03PS4) 10Gergő Tisza: GrowthExperiments: Enable Add Links for 50% of new users and all old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695364 (https://phabricator.wikimedia.org/T277356) [18:56:08] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:693208|ptwiki: Add 'flow-delete' to 'eliminator' user group (T283266)]] (duration: 01m 04s) [18:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:12] T283266: Add `flow-delete` right to the `eliminator` user group at pt.wiki - https://phabricator.wikimedia.org/T283266 [18:56:41] (03Merged) 10jenkins-bot: Add Link: Fix homepage PV token and newcomer task token logging [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/696530 (https://phabricator.wikimedia.org/T283765) (owner: 10Kosta Harlan) [18:57:07] thanks a ton tgr ::) [18:57:24] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [18:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:33] (03PS1) 10Ottomata: airflow - use airflow.cfg for webserver port [puppet] - 10https://gerrit.wikimedia.org/r/696607 (https://phabricator.wikimedia.org/T272973) [18:58:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) Dell Called me today about this server and recommend that we do a minimum to post on the server to find out which part is causing the... [19:00:04] twentyafterfour and hashar: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T1900). [19:00:15] (03CR) 10Ottomata: [C: 03+2] airflow - use airflow.cfg for webserver port [puppet] - 10https://gerrit.wikimedia.org/r/696607 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [19:00:51] o/ I'll need a couple more minutes for the deploy window, sorry [19:03:22] !log tgr@deploy1002 Synchronized php-1.37.0-wmf.6/extensions/GrowthExperiments: Backport: [[gerrit:695833|Help panel: SwitchEditorPanel fixes (T282800)]] [[gerrit:695841|Avoid session loading when loading task types in help panel RL data (T282800)]] [[gerrit:696530|Add Link: Fix homepage PV token and newcomer task token logging (T283765)]] (duration: 01m 05s) [19:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:28] T283765: link_suggestion_interaction schema: `close` action token is different from token in preceding action `editsummary_save` - https://phabricator.wikimedia.org/T283765 [19:03:28] T282800: [wmf.5-regression] Help panel displays "growthexperiments-help-panel-suggested-edits-switch-editor-to-undefined" - https://phabricator.wikimedia.org/T282800 [19:03:30] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Enable Add Links for 50% of new users and all old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695364 (https://phabricator.wikimedia.org/T277356) (owner: 10Gergő Tisza) [19:04:39] (03Merged) 10jenkins-bot: GrowthExperiments: Enable Add Links for 50% of new users and all old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695364 (https://phabricator.wikimedia.org/T277356) (owner: 10Gergő Tisza) [19:07:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:09:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:12:03] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:695364|GrowthExperiments: Enable Add Links for 50% of new users and all old ones (T277356)]] (duration: 01m 04s) [19:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:07] T277356: Add a link: experiment - https://phabricator.wikimedia.org/T277356 [19:15:04] !log US morning deploys done [19:15:04] tgr: ping when clear? [19:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:07] ah, cool [19:15:11] sorry for the delay! [19:15:16] twentyafterfour is having some intense weather, i'll go ahead and roll forward [19:17:53] (03PS1) 10Brennen Bearnes: all wikis to 1.37.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696615 [19:17:55] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.37.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696615 (owner: 10Brennen Bearnes) [19:19:39] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696615 (owner: 10Brennen Bearnes) [19:21:03] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.7 [19:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:43] did get a `19:20:19 Check 'Logstash Error rate for mw1278.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.01, After: 2.00, Threshold: 1.00) [19:21:57] things look ok, however. [19:22:08] hmm [19:25:04] indeed logstash looks clean, not sure why it triggered a threshold [19:25:29] well my power and internet haven't gone out yet even though it's raining harder than it has in a year or two [19:26:00] good sign [19:26:05] of course, as soon as you say that... [19:31:19] logstash still looking pretty chill. [19:31:55] (03PS1) 10Legoktm: prometheus-apache-exporter: Don't use unsupported -log.format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/696619 (https://phabricator.wikimedia.org/T283861) [19:33:21] thanks for the roll forward brennen [19:33:26] any time [19:33:28] (03CR) 10Legoktm: [V: 03+2 C: 03+2] prometheus-apache-exporter: Don't use unsupported -log.format [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/696619 (https://phabricator.wikimedia.org/T283861) (owner: 10Legoktm) [19:38:02] (03PS1) 10Legoktm: shellbox: Use fixed prometheus-apache-exporter:0.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/696622 [19:38:29] (03PS1) 10Cwhite: logstash: filter_on_templates to use only latest template available [puppet] - 10https://gerrit.wikimedia.org/r/696623 [19:42:26] (03PS1) 10Herron: grafana: fetch operations/grafana-grizzly as /srv/grafana-grizzly [puppet] - 10https://gerrit.wikimedia.org/r/696626 [19:42:28] (03PS1) 10Herron: grafana: add wrapper to call grr with environment vars set [puppet] - 10https://gerrit.wikimedia.org/r/696627 [19:42:30] (03CR) 10Legoktm: [C: 03+2] shellbox: Use fixed prometheus-apache-exporter:0.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/696622 (owner: 10Legoktm) [19:43:24] (03CR) 10jerkins-bot: [V: 04-1] grafana: add wrapper to call grr with environment vars set [puppet] - 10https://gerrit.wikimedia.org/r/696627 (owner: 10Herron) [19:44:11] (03PS2) 10Herron: grafana: add wrapper to call grr with environment vars set [puppet] - 10https://gerrit.wikimedia.org/r/696627 [19:45:23] (03Merged) 10jenkins-bot: shellbox: Use fixed prometheus-apache-exporter:0.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/696622 (owner: 10Legoktm) [19:45:43] herron: what's "grr"? [19:46:34] legoktm: that's the command line for grizzly, I'm not sure why they went with the short hand [19:47:27] grr is my Gerrit helper tool :P https://gitlab.com/legoktm/rust-grr/ [19:47:50] (because Gerrit makes you go grrrr) [19:48:19] ha! collision! [19:48:37] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [19:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:04] !log Manually create missing SecurePoll DB tables on mnwwiktionary, taywiki, and trvwiki for T283844 [19:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:08] T283844: Some wikis have the SecurePoll extension installed, but not its tables - https://phabricator.wikimedia.org/T283844 [19:59:47] (03CR) 10Herron: "for some additional context about this" [puppet] - 10https://gerrit.wikimedia.org/r/696626 (owner: 10Herron) [20:37:23] !log add eugene-chernov, strofimovsky01, il to ldap nda [20:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:43] !log add eugene-chernov, strofimovsky01, il to ldap nda #T279545 [20:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:47] T279545: Gitlab Installation Procedure - https://phabricator.wikimedia.org/T279545 [20:49:13] 10SRE, 10Continuous-Integration-Config: operations/docker-images/production-images has no CI - https://phabricator.wikimedia.org/T283855 (10hashar) If the tests from integration/config maybe be generalized, maybe they could be added as a new `lint` sub command to docker-pkg? [20:58:01] (03PS8) 10Superyetkin: Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) [21:10:13] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Jclark-ctr) @Andrew updated firmware shows connection on icinga will monitor [21:23:31] RECOVERY - MariaDB Replica Lag: pc1 on pc2007 is OK: OK slave_sql_lag Replication lag: 44.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:25:29] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:33:03] (03CR) 10Bstorm: [C: 03+2] nfs: fix the scratch mount setup [puppet] - 10https://gerrit.wikimedia.org/r/695447 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [21:58:57] (03PS6) 10Eric Gardner: Enable MediaSearch Assessment filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693951 (https://phabricator.wikimedia.org/T276257) [22:06:36] !log Invalidate bot password for `PKM@PKMbot` (T283839) [22:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:24] 10SRE, 10Wikimedia-Mailing-lists: Close old travel mailing list - https://phabricator.wikimedia.org/T283884 (10Ladsgroup) We added twist in mm3, now owner of the disabled list is removed and a blackhole email address called "disabled-lists@lists.wikimedia.org" would be the sole owner. That way we can have 1- m... [22:26:09] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:30:23] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Close old travel mailing list - https://phabricator.wikimedia.org/T283884 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup I changed the owner to disabled-lists@, added a general ban list to avoid more requests coming in, and set the message acceptance f... [23:00:05] brennen: I, the Bot under the Fountain, allow thee, The Deployer, to do US Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210527T2300). [23:00:05] EricGardner: A patch you scheduled for US Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] * thcipriani waves [23:02:39] howdy all [23:04:23] (03PS1) 10Dzahn: control: small change in package description [debs/helm3] - 10https://gerrit.wikimedia.org/r/696695 [23:05:34] (03CR) 10Thcipriani: [C: 03+2] "Config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693951 (https://phabricator.wikimedia.org/T276257) (owner: 10Eric Gardner) [23:06:22] (03Merged) 10jenkins-bot: Enable MediaSearch Assessment filter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693951 (https://phabricator.wikimedia.org/T276257) (owner: 10Eric Gardner) [23:11:38] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10RobH) [23:15:51] 10SRE, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984 (10Dzahn) hello from the future. I wanted to install this blubber package today but I am on buster and noticed it's stretch-only. so I made T283891 [23:17:29] 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber): build and import blubber package for buster and bullseye - https://phabricator.wikimedia.org/T283891 (10Dzahn) [23:21:17] !log egardner@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:693951|Enable MediaSearch Assessment filter (T276257)]] (duration: 00m 57s) [23:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:21] T276257: [M] Create "assessment" filter in MediaSearch - https://phabricator.wikimedia.org/T276257 [23:22:48] (03CR) 10Cwhite: [C: 03+1] grafana: fetch operations/grafana-grizzly as /srv/grafana-grizzly [puppet] - 10https://gerrit.wikimedia.org/r/696626 (owner: 10Herron) [23:22:54] (03PS1) 10Thcipriani: README: deployment training [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696706 [23:24:25] (03CR) 10Thcipriani: [C: 03+2] README: deployment training [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696706 (owner: 10Thcipriani) [23:25:10] (03Merged) 10jenkins-bot: README: deployment training [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696706 (owner: 10Thcipriani) [23:27:57] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/696627 (owner: 10Herron) [23:32:57] (03PS1) 10RobH: phab1004 setup params [puppet] - 10https://gerrit.wikimedia.org/r/696712 (https://phabricator.wikimedia.org/T280540) [23:33:45] (03CR) 10RobH: [C: 03+2] phab1004 setup params [puppet] - 10https://gerrit.wikimedia.org/r/696712 (https://phabricator.wikimedia.org/T280540) (owner: 10RobH) [23:36:18] 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber): build and import blubber package for buster and bullseye - https://phabricator.wikimedia.org/T283891 (10Dzahn) I can confirm though that I could simply add the stretch APT sources, install blubber and then remove the stretch sources again and blubber w... [23:36:53] (03PS1) 10Thcipriani: Revert "README: deployment training" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696713 [23:38:05] !log derick@deploy1002 Synchronized README: Config: [[gerrit:696706|README: deployment training]] (duration: 00m 55s) [23:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install phab1004 (was: phab1002) - https://phabricator.wikimedia.org/T280540 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` phab1004.eqiad.wmnet ` The log can be found in `... [23:42:16] (03CR) 10Thcipriani: [C: 03+2] "Revert training thing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696713 (owner: 10Thcipriani) [23:42:58] (03Merged) 10jenkins-bot: Revert "README: deployment training" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/696713 (owner: 10Thcipriani) [23:45:44] !log thcipriani@deploy1002 Synchronized README: Config: [[gerrit:696713|Revert "README: deployment training"]] (duration: 00m 55s) [23:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:09] 10SRE, 10serviceops-radar, 10Release Pipeline (Blubber): build and import blubber package for buster and bullseye - https://phabricator.wikimedia.org/T283891 (10Dzahn) But... that blubber version does not work with v4 config yaml. ` version: config version "v4" is unsupported ` I can only use v4 with the... [23:54:15] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on phab1004.eqiad.wmnet with reason: REIMAGE [23:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:27] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab1004.eqiad.wmnet with reason: REIMAGE [23:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log