[00:02:28] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs - https://phabricator.wikimedia.org/T283632 (10Dzahn) Hi @jmixter please see https://wikitech.wikimedia.org/wiki/Production_access#Access_Request_Process for the access request process Others might pick this up, but there are a few things that... [00:06:56] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10Dzahn) [00:10:16] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10Dzahn) @Ottomata Hi! I see this in wikitech "The project lead where your access will be granted. For Analytics and Data systems this is Andrew Otto.". Does this apply here because it's ana... [00:14:36] (03CR) 10Dzahn: [C: 03+2] "as Ema once said there is no maximum in the RFC https://phabricator.wikimedia.org/T209590#4856116" [puppet] - 10https://gerrit.wikimedia.org/r/694731 (https://phabricator.wikimedia.org/T281390) (owner: 1020after4) [00:22:11] (03CR) 10Dzahn: "to be fair http://httpd.apache.org/docs/2.4/mod/core.html#limitrequestline says "under normal circumstances" this should not be changed an" [puppet] - 10https://gerrit.wikimedia.org/r/694731 (https://phabricator.wikimedia.org/T281390) (owner: 1020after4) [00:27:17] !log phab2001 - restarted apache2 [00:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:03] (03CR) 10Dzahn: "apache2 was restarted on phab2001 and reload on phab1001, please test the actual limit if you can" [puppet] - 10https://gerrit.wikimedia.org/r/694731 (https://phabricator.wikimedia.org/T281390) (owner: 1020after4) [01:43:32] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10Ottomata) Heya! Approval is the same as any SRE shell access, plus a designated Analytics team approver, which is currently me. @jmxixter maybe copy and paste from https://phabricator.... [01:50:34] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:10:16] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10bd808) [02:47:54] PROBLEM - Device not healthy -SMART- on labstore1007 is CRITICAL: cluster=wmcs device=1I:1:5 instance=labstore1007 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops [04:08:28] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:09:32] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:12:47] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [04:17:47] (Traffic on tunnel link) firing: (2) Traffic on tunnel link - https://alerts.wikimedia.org [04:22:47] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [04:27:57] (03PS1) 10Marostegui: Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/694322 [04:30:17] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) p:05Triage→03Medium L3 signed on: Wed, May 26, 02:35 @Bumeh-ctr you'd need to get the approval posted on this task from your manager and from @Ottomata [04:30:37] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) [04:32:46] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) @KFrancis can you confirm @Bumeh-ctr has signed the NDA? I cannot find them on the NDA spreadsheet. [04:33:49] (03CR) 10Marostegui: [C: 03+2] Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/694322 (owner: 10Marostegui) [04:34:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Repool db1160', diff saved to https://phabricator.wikimedia.org/P16210 and previous config saved to /var/cache/conftool/dbconfig/20210526-043424-root.json [04:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106', diff saved to https://phabricator.wikimedia.org/P16211 and previous config saved to /var/cache/conftool/dbconfig/20210526-043439-marostegui.json [04:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:32] !log Deploy schema change on db1106, this will generate lag on s1 (enwiki) on wiki replicas T266486 T268392 T273360 [04:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:37] T273360: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 [04:35:37] T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 [04:35:38] T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 [04:49:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: Repool db1160', diff saved to https://phabricator.wikimedia.org/P16212 and previous config saved to /var/cache/conftool/dbconfig/20210526-044928-root.json [04:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:16] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:04:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Repool db1160', diff saved to https://phabricator.wikimedia.org/P16213 and previous config saved to /var/cache/conftool/dbconfig/20210526-050431-root.json [05:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:52] (03PS1) 10Marostegui: db1148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/694995 [05:07:42] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10Marostegui) p:05Triage→03Medium [05:08:51] (03CR) 10Marostegui: [C: 03+2] db1148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/694995 (owner: 10Marostegui) [05:09:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1148', diff saved to https://phabricator.wikimedia.org/P16214 and previous config saved to /var/cache/conftool/dbconfig/20210526-050919-marostegui.json [05:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Repool db1160', diff saved to https://phabricator.wikimedia.org/P16215 and previous config saved to /var/cache/conftool/dbconfig/20210526-051935-root.json [05:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:17] !log Stop MySQL on clouddb1021 to upgrade mysql [05:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:48] marostegui: <3 [05:52:56] elukey: :***** [05:55:19] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10ArielGlenn) I am tempted to set this to UBN given the developments of last night: wikimedia channels including checkuser ones have been taken over by freenode current staff. An example: https:... [05:56:50] (03PS1) 10Elukey: profile::reportupdater::jobs: fix dependency [puppet] - 10https://gerrit.wikimedia.org/r/695018 [05:58:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29690/console" [puppet] - 10https://gerrit.wikimedia.org/r/695018 (owner: 10Elukey) [06:00:53] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::reportupdater::jobs: fix dependency [puppet] - 10https://gerrit.wikimedia.org/r/695018 (owner: 10Elukey) [06:03:24] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:07:18] (03PS1) 10Elukey: profile::reportupdater::jobs: add systemd env variables [puppet] - 10https://gerrit.wikimedia.org/r/695024 [06:08:37] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29691/console" [puppet] - 10https://gerrit.wikimedia.org/r/695024 (owner: 10Elukey) [06:09:21] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::reportupdater::jobs: add systemd env variables [puppet] - 10https://gerrit.wikimedia.org/r/695024 (owner: 10Elukey) [06:13:24] (03PS1) 10Elukey: role::elasticsearch::cloudelastic: fix threshold hiera value [puppet] - 10https://gerrit.wikimedia.org/r/695046 [06:13:55] dcausse: o/ ok if I merge --^ [06:14:27] (03PS3) 10KartikMistry: Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [06:14:36] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29692/console" [puppet] - 10https://gerrit.wikimedia.org/r/695046 (owner: 10Elukey) [06:15:37] (03CR) 10jerkins-bot: [V: 04-1] Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [06:20:29] (03PS1) 10Elukey: profile::reportupdater::jobs: move logs to /tmp/reportupdater-logs [puppet] - 10https://gerrit.wikimedia.org/r/695054 [06:21:44] (03CR) 10Elukey: [C: 03+2] profile::reportupdater::jobs: move logs to /tmp/reportupdater-logs [puppet] - 10https://gerrit.wikimedia.org/r/695054 (owner: 10Elukey) [06:32:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: reportupdater-published_cx2_translations_mysql.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:05] (03PS4) 10JMeybohm: docker-registry: Add caching config for nginx [puppet] - 10https://gerrit.wikimedia.org/r/694330 (https://phabricator.wikimedia.org/T264209) [06:39:07] (03PS2) 10JMeybohm: httpbb: Add tests for docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/694552 (https://phabricator.wikimedia.org/T273521) [06:39:19] (03CR) 10JMeybohm: docker-registry: Add caching config for nginx (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/694330 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [06:41:42] (03CR) 10Zabe: [C: 04-1] "let the discussion on meta happen a bit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694686 (https://phabricator.wikimedia.org/T283625) (owner: 10Zabe) [06:45:51] (03CR) 10DCausse: role::elasticsearch::cloudelastic: fix threshold hiera value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695046 (owner: 10Elukey) [06:47:20] 10Puppet, 10SRE: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Marostegui) [06:49:12] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: switch to LibreNMS AlertManager paging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685779 (https://phabricator.wikimedia.org/T281095) (owner: 10Filippo Giunchedi) [06:49:19] (03PS2) 10Filippo Giunchedi: icinga: switch to LibreNMS AlertManager paging [puppet] - 10https://gerrit.wikimedia.org/r/685779 (https://phabricator.wikimedia.org/T281095) [06:50:10] (03CR) 10Elukey: [V: 03+1] role::elasticsearch::cloudelastic: fix threshold hiera value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695046 (owner: 10Elukey) [06:50:20] (03PS3) 10JMeybohm: httpbb: Add tests for docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/694552 (https://phabricator.wikimedia.org/T273521) [06:50:24] 10Puppet, 10observability, 10User-jbond: Add additional prometheus metrics to puppet runs - https://phabricator.wikimedia.org/T283585 (10Marostegui) [06:50:26] (03CR) 10Ryan Kemper: role::elasticsearch::cloudelastic: fix threshold hiera value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695046 (owner: 10Elukey) [06:51:15] 10SRE, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Move paging for librenms from icinga to AM - https://phabricator.wikimedia.org/T281095 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done! Paging for librenms is happening through AM now. [06:51:54] 10SRE, 10ops-eqiad, 10DC-Ops: Relabel db1125 to be dbstore1006 - https://phabricator.wikimedia.org/T283300 (10Marostegui) 05Open→03Declined Let's not do this per: https://phabricator.wikimedia.org/T283125#7110913 [06:52:19] (03PS2) 10Muehlenhoff: Add cumin2002 to mysql_root_clients [puppet] - 10https://gerrit.wikimedia.org/r/693376 (https://phabricator.wikimedia.org/T276589) [06:53:07] (03PS2) 10Elukey: role::elasticsearch::cloudelastic: fix threshold hiera value [puppet] - 10https://gerrit.wikimedia.org/r/695046 [06:54:01] (03CR) 10DCausse: [C: 03+1] role::elasticsearch::cloudelastic: fix threshold hiera value [puppet] - 10https://gerrit.wikimedia.org/r/695046 (owner: 10Elukey) [06:54:32] (03CR) 10Filippo Giunchedi: [C: 03+1] rsyslog: try to parse the msg field as json before shipping [puppet] - 10https://gerrit.wikimedia.org/r/694758 (owner: 10Cwhite) [06:55:57] (03CR) 10Elukey: [C: 03+2] role::elasticsearch::cloudelastic: fix threshold hiera value (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695046 (owner: 10Elukey) [06:59:16] (03PS1) 10Elukey: reportupdater::job: swap ::path with ::source_path [puppet] - 10https://gerrit.wikimedia.org/r/695078 [06:59:21] (03CR) 10Muehlenhoff: [C: 03+2] Add cumin2002 to mysql_root_clients [puppet] - 10https://gerrit.wikimedia.org/r/693376 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [07:00:53] (03CR) 10JMeybohm: "This change is ready for review." [software/httpbb] - 10https://gerrit.wikimedia.org/r/694556 (owner: 10JMeybohm) [07:01:07] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29693/console" [puppet] - 10https://gerrit.wikimedia.org/r/695078 (owner: 10Elukey) [07:01:57] (03PS5) 10Jcrespo: Revert "mailman2: Generate a 5-year retention Archive backups of mailman" [puppet] - 10https://gerrit.wikimedia.org/r/694309 [07:02:03] (03CR) 10Elukey: [V: 03+1 C: 03+2] reportupdater::job: swap ::path with ::source_path [puppet] - 10https://gerrit.wikimedia.org/r/695078 (owner: 10Elukey) [07:02:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/693909 (owner: 10Filippo Giunchedi) [07:04:36] 10SRE, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10jcrespo) a:05jcrespo→03None The backup finished, JobId=338470: ` Elapsed time: 14 hours 53 mins 5 secs SD Files Written: 6,117,027 SD Bytes Written:... [07:04:56] (03CR) 10Jcrespo: [C: 03+2] Revert "mailman2: Generate a 5-year retention Archive backups of mailman" [puppet] - 10https://gerrit.wikimedia.org/r/694309 (owner: 10Jcrespo) [07:08:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/693907 (https://phabricator.wikimedia.org/T277064) (owner: 10Hnowlan) [07:09:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:58] !log ryankemper@puppetmaster2001 conftool action : set/pooled=no; selector: name=wdqs2003.codfw.wmnet [07:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:08] !log Pooled `wdqs1013` (caught up on lag), de-pooled `wdqs2003` (should not have been pooled due to reimage failure) [07:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:18] (03CR) 1020after4: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/694731 (https://phabricator.wikimedia.org/T281390) (owner: 1020after4) [07:22:18] 10SRE, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Ladsgroup) It sounds good to me. [07:23:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: remove bootstrap-vz, build bullseye [puppet] - 10https://gerrit.wikimedia.org/r/693918 (https://phabricator.wikimedia.org/T281984) (owner: 10Giuseppe Lavagetto) [07:24:17] 10SRE, 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10jcrespo) Could you give me some meaningful restore operation (subdir). I am guessing recovering all will not be wanted because of time and space available. I can recover... [07:25:27] (03CR) 1020after4: "Apparently that directive only allows reducing the length, not increasing it. Bummer." [puppet] - 10https://gerrit.wikimedia.org/r/694731 (https://phabricator.wikimedia.org/T281390) (owner: 1020after4) [07:29:07] (03CR) 10MSantos: [C: 03+1] postgresql::postgis: use latest packages on buster [puppet] - 10https://gerrit.wikimedia.org/r/693907 (https://phabricator.wikimedia.org/T277064) (owner: 10Hnowlan) [07:32:58] (03CR) 10Awight: [C: 03+1] "Verified in video chat that this comes from Andrew-WMDE" [puppet] - 10https://gerrit.wikimedia.org/r/693428 (https://phabricator.wikimedia.org/T283355) (owner: 10Andrew-WMDE) [07:40:29] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10ema) [07:41:46] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) Thanks, that analysis is very useful. I feel we are making lots of progress already on understan... [07:42:52] _joe_: what the hell happened [07:43:11] <_joe_> RhinosF1: what do you mean? [07:43:31] the freenode thing? [07:43:32] <_joe_> the ##wikimedia-* channels on the doomed network shouldn't be joined :) [07:44:03] <_joe_> what happened is that the crown prince is trying very hard to disrupt the FLOSS communities that moved to libera [07:45:49] <_< [07:46:50] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [07:47:52] _joe_: yeah I only joined because automatic. Won't be in there any longer. Miraheze had same treatment [07:47:57] _joe_: I see the "Crown Prince" thing repeated in a bunch of sketchy reporting, but he was "crowned" in ahem Los Angeles so I guess that bit is a joke: https://en.wikipedia.org/wiki/Andrew_Lee_(entrepreneur) [07:48:29] <_joe_> awight: that's why I call him that [07:48:37] /o\ [07:48:50] <_joe_> it's all fakery :) [07:49:12] RhinosF1: they closed everything that had "libera" in topic, including "We will NOT be moving to Libera.Chat" [07:49:30] (03PS1) 10Jcrespo: Revert "Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only""" [puppet] - 10https://gerrit.wikimedia.org/r/695030 [07:49:31] majavah: that's Rodin [07:49:33] Makes sense. Just requires many clicks to confirm that neither Korea is a monarchy. [07:49:34] Ridiculous [07:52:55] (03PS2) 10Jcrespo: Revert "Revert "Revert "bacula: Reenable read-write ES database backups, disable read-only""" [puppet] - 10https://gerrit.wikimedia.org/r/695030 [07:53:26] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:54:07] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10ema) [07:54:36] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10ema) [07:59:22] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: add proxy support to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/695169 [07:59:37] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10ema) [08:00:26] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "Revert "prometheus: Migrate node_file_count cron to systemd timer"" [puppet] - 10https://gerrit.wikimedia.org/r/691317 (owner: 10Ladsgroup) [08:00:33] (03PS5) 10Filippo Giunchedi: Revert "Revert "prometheus: Migrate node_file_count cron to systemd timer"" [puppet] - 10https://gerrit.wikimedia.org/r/691317 (owner: 10Ladsgroup) [08:02:07] (03PS1) 10Kosta Harlan: Add a link: update button states after acceptance changes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695031 (https://phabricator.wikimedia.org/T283544) [08:02:48] (03PS1) 10Kosta Harlan: Add a link: update button states after acceptance changes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695032 (https://phabricator.wikimedia.org/T283544) [08:04:58] (03PS1) 10Kosta Harlan: Add a link: Hide surface highlight overlay [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695033 (https://phabricator.wikimedia.org/T282899) [08:05:29] (03PS1) 10Kosta Harlan: Add a link: Hide surface highlight overlay [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695034 (https://phabricator.wikimedia.org/T282899) [08:05:59] (03CR) 10Filippo Giunchedi: [C: 03+2] udev: Bullseye compatibility for udevadm [puppet] - 10https://gerrit.wikimedia.org/r/693909 (owner: 10Filippo Giunchedi) [08:11:24] !log running 'optimize table' over parsercache db on pc1007 with replication enabled T282761 [08:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:28] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [08:12:14] <_joe_> !log purging images on deneb [08:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:50] mmhh thanos/prometheus were unhappy for a moment there, seems recovered now though [08:13:51] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29694/console" [puppet] - 10https://gerrit.wikimedia.org/r/695169 (owner: 10Giuseppe Lavagetto) [08:14:54] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: add proxy support to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/695169 [08:16:06] (03PS3) 10Jcrespo: dbbackups: Remove s6 stretch backup source instance on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/692341 (https://phabricator.wikimedia.org/T280751) [08:17:41] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29695/console" [puppet] - 10https://gerrit.wikimedia.org/r/695169 (owner: 10Giuseppe Lavagetto) [08:17:57] (03CR) 10Elukey: [V: 03+1 C: 03+1] docker::baseimages: add proxy support to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/695169 (owner: 10Giuseppe Lavagetto) [08:18:21] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Remove s6 stretch backup source instance on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/692341 (https://phabricator.wikimedia.org/T280751) (owner: 10Jcrespo) [08:24:47] (03CR) 10jerkins-bot: [V: 04-1] Add a link: Hide surface highlight overlay [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695034 (https://phabricator.wikimedia.org/T282899) (owner: 10Kosta Harlan) [08:31:05] (03PS23) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [08:31:08] (03PS8) 10Elukey: Add knative serving and net-istio images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) [08:31:09] (03PS6) 10Elukey: Add base kubeflow kfserving images and kube-rbac-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) [08:31:12] (03PS5) 10Elukey: Add Jetstack's cert-manager base go images. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T280661) [08:31:13] (03PS1) 10Elukey: Add new golang 1.15 image based [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 [08:31:58] (03PS2) 10Elukey: Add new golang 1.15 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 [08:32:00] (03PS24) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [08:32:02] (03PS9) 10Elukey: Add knative serving and net-istio images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) [08:32:04] (03PS7) 10Elukey: Add base kubeflow kfserving images and kube-rbac-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) [08:32:06] (03PS6) 10Elukey: Add Jetstack's cert-manager base go images. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T280661) [08:33:00] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10fgiunchedi) [08:35:59] (03PS3) 10Giuseppe Lavagetto: docker::baseimages: add proxy support to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/695169 [08:36:19] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695034 (https://phabricator.wikimedia.org/T282899) (owner: 10Kosta Harlan) [08:37:46] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29696/console" [puppet] - 10https://gerrit.wikimedia.org/r/695169 (owner: 10Giuseppe Lavagetto) [08:38:53] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) Hi Jamie, Thanks for the feedback. I think given the desire to push the WAN links relatively h... [08:40:28] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10fgiunchedi) [08:41:00] urbanecm would you mind taking a look at https://phabricator.wikimedia.org/T283639 and https://phabricator.wikimedia.org/T283637 ? I suspect they can just be closed - they are about exception from your remove-flow-group.php script [08:41:20] looking [08:42:17] closed both [08:42:30] (03PS4) 10Giuseppe Lavagetto: docker::baseimages: add proxy support to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/695169 [08:44:12] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29697/console" [puppet] - 10https://gerrit.wikimedia.org/r/695169 (owner: 10Giuseppe Lavagetto) [08:47:07] (03CR) 10Elukey: [V: 03+1 C: 03+1] docker::baseimages: add proxy support to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/695169 (owner: 10Giuseppe Lavagetto) [08:48:45] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10fgiunchedi) [08:51:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 25%: Repool db1106', diff saved to https://phabricator.wikimedia.org/P16219 and previous config saved to /var/cache/conftool/dbconfig/20210526-085137-root.json [08:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:34] (03CR) 10Klausman: Add new golang 1.15 image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [08:55:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::baseimages: add proxy support to the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/695169 (owner: 10Giuseppe Lavagetto) [08:57:05] (03CR) 10Elukey: Add new golang 1.15 image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [08:57:36] <_joe_> jbond: I'm getting a strange puppet failure [08:57:41] <_joe_> Warning: Unable to fetch my node definition, but the agent run will continue: [08:58:05] <_joe_> oh le sigh [08:58:08] <_joe_> nevermind [08:58:11] <_joe_> I had http_proxy set [08:58:19] <_joe_> and apparently the puppet agent respects that [08:58:22] <_joe_> :D [08:58:33] How annoyingly respectful [08:58:58] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Volans) [08:59:31] :) [09:01:29] (03PS1) 10Ladsgroup: Wrap list of acceptable site ids with an APCu cache in API [extensions/Wikibase] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695035 [09:03:51] (03PS1) 10Elukey: Add kafka-main[12]00[45] to analytics-in{4,6} filters [homer/public] - 10https://gerrit.wikimedia.org/r/695192 (https://phabricator.wikimedia.org/T225005) [09:05:01] (03CR) 10Ladsgroup: [C: 03+2] "To reach the train" [extensions/Wikibase] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695035 (owner: 10Ladsgroup) [09:06:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 50%: Repool db1106', diff saved to https://phabricator.wikimedia.org/P16220 and previous config saved to /var/cache/conftool/dbconfig/20210526-090640-root.json [09:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:06] (03CR) 10Ayounsi: [C: 03+1] Add kafka-main[12]00[45] to analytics-in{4,6} filters [homer/public] - 10https://gerrit.wikimedia.org/r/695192 (https://phabricator.wikimedia.org/T225005) (owner: 10Elukey) [09:11:33] XioNoX: <3 - ok if I deploy it? [09:11:56] elukey: yup [09:12:05] perfect [09:12:23] (03CR) 10Elukey: [C: 03+2] Add kafka-main[12]00[45] to analytics-in{4,6} filters [homer/public] - 10https://gerrit.wikimedia.org/r/695192 (https://phabricator.wikimedia.org/T225005) (owner: 10Elukey) [09:13:21] !log deploy https://gerrit.wikimedia.org/r/c/operations/homer/public/+/695192 on {cr1|cr2}-eqiad - T225005 [09:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:25] T225005: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 [09:21:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 75%: Repool db1106', diff saved to https://phabricator.wikimedia.org/P16221 and previous config saved to /var/cache/conftool/dbconfig/20210526-092144-root.json [09:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:54] (03PS1) 10David Caro: prometheus: Override retention also when specifying retention by size [puppet] - 10https://gerrit.wikimedia.org/r/695194 [09:22:43] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Override retention also when specifying retention by size [puppet] - 10https://gerrit.wikimedia.org/r/695194 (owner: 10David Caro) [09:24:33] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Joe) [09:24:53] (03PS1) 10Ladsgroup: Wrap list of acceptable site ids with an APCu cache in API [extensions/Wikibase] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695037 [09:25:57] (03PS2) 10David Caro: prometheus: Override retention also when specifying retention by size [puppet] - 10https://gerrit.wikimedia.org/r/695194 [09:26:36] (03CR) 10Ladsgroup: [C: 03+2] Wrap list of acceptable site ids with an APCu cache in API [extensions/Wikibase] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695037 (owner: 10Ladsgroup) [09:28:35] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10elukey) [09:29:19] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10elukey) [09:31:04] (03Merged) 10jenkins-bot: Wrap list of acceptable site ids with an APCu cache in API [extensions/Wikibase] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695035 (owner: 10Ladsgroup) [09:36:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1106 (re)pooling @ 100%: Repool db1106', diff saved to https://phabricator.wikimedia.org/P16222 and previous config saved to /var/cache/conftool/dbconfig/20210526-093647-root.json [09:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:45] (03PS2) 10Marostegui: admin: Add YubiKey SSH key for Andrew Kostka [puppet] - 10https://gerrit.wikimedia.org/r/693428 (https://phabricator.wikimedia.org/T283355) (owner: 10Andrew-WMDE) [09:39:17] (03CR) 10Marostegui: [C: 03+2] admin: Add YubiKey SSH key for Andrew Kostka [puppet] - 10https://gerrit.wikimedia.org/r/693428 (https://phabricator.wikimedia.org/T283355) (owner: 10Andrew-WMDE) [09:41:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Enroll Andrew Kostka’s YubiKey for production access - https://phabricator.wikimedia.org/T283355 (10Marostegui) 05Open→03Resolved a:03Marostegui This has been merged. Give it sometime so puppet runs everywhere before starting to use it. As the step #11... [09:42:02] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/Wikibase: Backport: [[gerrit:695035|Wrap list of acceptable site ids with an APCu cache in API]] (duration: 02m 12s) [09:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:31] !log rm /root/prometheus from prometheus5001 - old transition files [09:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:39] (03PS6) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 [09:46:41] (03PS3) 10David Caro: wmcs.ceph: add cookbook to upgrade all osds [cookbooks] - 10https://gerrit.wikimedia.org/r/682106 (https://phabricator.wikimedia.org/T280641) [09:46:43] (03PS1) 10David Caro: gerrit: using wmcs as the default branch [cookbooks] - 10https://gerrit.wikimedia.org/r/695198 [09:46:51] (03Merged) 10jenkins-bot: Wrap list of acceptable site ids with an APCu cache in API [extensions/Wikibase] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695037 (owner: 10Ladsgroup) [09:47:24] (03CR) 10David Caro: "Don't merge yet" [cookbooks] - 10https://gerrit.wikimedia.org/r/695198 (owner: 10David Caro) [09:47:28] (03CR) 10David Caro: [C: 04-1] gerrit: using wmcs as the default branch [cookbooks] - 10https://gerrit.wikimedia.org/r/695198 (owner: 10David Caro) [09:49:56] (03PS1) 10David Caro: gerrit: using wmcs as the default branch [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695199 [09:50:13] (03Abandoned) 10David Caro: gerrit: using wmcs as the default branch [cookbooks] - 10https://gerrit.wikimedia.org/r/695198 (owner: 10David Caro) [09:50:57] 10SRE, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10hashar) [09:52:05] (03PS1) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695201 [09:52:07] (03PS1) 10David Caro: wmcs.ceph: add cookbook to upgrade all osds [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695202 (https://phabricator.wikimedia.org/T280641) [09:52:23] (03Abandoned) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 (owner: 10David Caro) [09:52:38] (03Abandoned) 10David Caro: wmcs.ceph: add cookbook to upgrade all osds [cookbooks] - 10https://gerrit.wikimedia.org/r/682106 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:53:08] (03PS1) 10Muehlenhoff: Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) [09:54:08] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:54:21] 10SRE, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10hashar) [09:54:30] PROBLEM - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [09:55:05] 10SRE, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10hashar) [09:55:32] (03CR) 10jerkins-bot: [V: 04-1] Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [09:55:49] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.6/extensions/Wikibase: Backport: [[gerrit:695037|Wrap list of acceptable site ids with an APCu cache in API]] (duration: 01m 18s) [09:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:04] you might get an alert for during the time of deployment [09:56:08] but it should recover [09:56:12] (03CR) 10Elukey: [C: 04-1] "I am able to docker-pkg build images/golang/bullseye, but not the golang dir, that fails for "RuntimeError: Trying to reinstantiate the FS" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [09:56:34] 10SRE, 10serviceops, 10Patch-For-Review: Publish wikimedia-bullseye base docker image - https://phabricator.wikimedia.org/T281596 (10Joe) 05Open→03Resolved [09:57:46] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:58:00] RECOVERY - HTTP on labstore1007 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 383 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [09:58:32] (03PS2) 10Muehlenhoff: Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) [09:59:08] (03PS1) 10Kosta Harlan: PostEdit: Fix skip all suggestions on mobile, and don't reset session if task was cancelled [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695038 (https://phabricator.wikimedia.org/T282546) [09:59:10] wnhat happened to labstore1007 I wonder [09:59:11] huh [10:00:01] (03PS1) 10Kosta Harlan: PostEdit: Fix skip all suggestions on mobile, and don't reset session if task was cancelled [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695039 (https://phabricator.wikimedia.org/T282546) [10:00:03] (03CR) 10jerkins-bot: [V: 04-1] Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:01:50] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 180 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045 [10:01:51] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 180 days, 0:00:00 on labstore1007.wikimedia.org with reason: T281045 [10:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:54] T281045: labstore1007 crashed after storage controller errors--replace disk? - https://phabricator.wikimedia.org/T281045 [10:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:40] (03CR) 10Elukey: [C: 04-1] Add new golang 1.15 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [10:03:12] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10ArielGlenn) From conversation in irc with gehel and others: -discovery is now -search, an entry bout public logging needs to be added to the topic and wikibugs needs to be invited to the channe... [10:05:11] (03CR) 10Elukey: [C: 04-1] Add new golang 1.15 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [10:11:10] (03CR) 10Klausman: Add new golang 1.15 image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [10:17:05] (03PS1) 10David Caro: wmcs: add cloudvirt drain cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695205 (https://phabricator.wikimedia.org/T280641) [10:17:29] (03PS1) 10David Caro: wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695206 (https://phabricator.wikimedia.org/T280641) [10:17:31] (03PS1) 10David Caro: wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695207 (https://phabricator.wikimedia.org/T280641) [10:17:33] (03PS1) 10David Caro: wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/695208 (https://phabricator.wikimedia.org/T279076) [10:18:09] (03Abandoned) 10David Caro: wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/684812 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [10:18:14] (03Abandoned) 10David Caro: wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683371 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:18:19] (03Abandoned) 10David Caro: wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683888 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:18:24] (03Abandoned) 10David Caro: wmcs: add cloudvirt drain cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/683370 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:18:32] (03CR) 10Klausman: Add new golang 1.15 image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [10:18:48] (03Abandoned) 10David Caro: wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/695208 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [10:18:58] (03Abandoned) 10David Caro: wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695207 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:19:03] (03Abandoned) 10David Caro: wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695206 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:19:10] (03Abandoned) 10David Caro: wmcs: add cloudvirt drain cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695205 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:23:03] (03PS1) 10David Caro: wmcs: add cloudvirt drain cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695210 (https://phabricator.wikimedia.org/T280641) [10:23:05] (03PS1) 10David Caro: wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695211 (https://phabricator.wikimedia.org/T280641) [10:23:07] (03PS1) 10David Caro: wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695212 (https://phabricator.wikimedia.org/T280641) [10:23:09] (03PS1) 10David Caro: wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/695213 (https://phabricator.wikimedia.org/T279076) [10:24:34] (03Abandoned) 10David Caro: wmcs: add cloudvirt drain cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695210 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:24:48] (03Abandoned) 10David Caro: wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695211 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:24:54] (03Abandoned) 10David Caro: wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/695212 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [10:25:00] (03Abandoned) 10David Caro: wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/695213 (https://phabricator.wikimedia.org/T279076) (owner: 10David Caro) [10:26:13] !log installing lz4 security updates on buster [10:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:53] (03PS1) 10Muehlenhoff: Add library hint for lz4 [puppet] - 10https://gerrit.wikimedia.org/r/695219 [10:32:21] (03PS1) 10David Caro: wmcs: add cloudvirt drain cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695220 (https://phabricator.wikimedia.org/T280641) [10:32:23] (03PS1) 10David Caro: wmcs.openstack: add safe_reboot cloudvirt cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695221 (https://phabricator.wikimedia.org/T280641) [10:32:25] (03PS1) 10David Caro: wmcs.openstack: add live_upgrade cloudvirt cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695222 (https://phabricator.wikimedia.org/T280641) [10:32:27] (03PS1) 10David Caro: wmcs.cloudvirt.safe_reboot: add log to SAL [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/695223 (https://phabricator.wikimedia.org/T279076) [10:36:30] Nikerabbit: that ULS merge failure is a valid one, it is an issue on the CI infra and I will fix it shortly [10:36:36] it is related to the branch deletions for some reason [10:38:06] hashar: thanks for looking! I was afraid it was related [10:38:20] I don't get why they are not pruned though [10:38:24] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "My suggestion is:" (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 (owner: 10Elukey) [10:38:28] I will investigate a bit and then just manually fix it [10:40:15] OH [10:40:27] that is a bug in git :) [10:40:56] (03CR) 10Muehlenhoff: [C: 03+2] Skip Cumin/Homer/Spicerack on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/693130 (https://phabricator.wikimedia.org/T276589) (owner: 10Muehlenhoff) [10:40:58] huh? branch creation and deletion should be a common operation enough to not have bugs [10:47:28] Nikerabbit: yeah well it seems Zuul does not remote prune the branch properly somehow [10:48:53] (03CR) 10Kosta Harlan: "This change is ready for review." [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695042 (https://phabricator.wikimedia.org/T283661) (owner: 10Kosta Harlan) [10:50:12] (03PS2) 10Kosta Harlan: AddLinkSaveDialog: Ensure getActionProcess is called [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695042 (https://phabricator.wikimedia.org/T283661) [10:53:01] (03CR) 10Urbanecm: [C: 03+2] Add a link: update button states after acceptance changes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695031 (https://phabricator.wikimedia.org/T283544) (owner: 10Kosta Harlan) [10:53:03] (03PS1) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [10:53:06] (03CR) 10Urbanecm: [C: 03+2] Add a link: update button states after acceptance changes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695032 (https://phabricator.wikimedia.org/T283544) (owner: 10Kosta Harlan) [10:53:28] (03CR) 10Jbond: system::role: add ability to specify role owners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/692632 (owner: 10Jbond) [10:53:30] (03CR) 10Urbanecm: [C: 03+2] Add a link: Hide surface highlight overlay [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695033 (https://phabricator.wikimedia.org/T282899) (owner: 10Kosta Harlan) [10:53:35] (03CR) 10Urbanecm: [C: 03+2] Add a link: Hide surface highlight overlay [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695034 (https://phabricator.wikimedia.org/T282899) (owner: 10Kosta Harlan) [10:53:44] (03Abandoned) 10Jbond: system::role: add ability to specify role owners [puppet] - 10https://gerrit.wikimedia.org/r/692632 (owner: 10Jbond) [10:53:53] (03Abandoned) 10Jbond: P:role_data: create a new profile to call system::role [puppet] - 10https://gerrit.wikimedia.org/r/692635 (owner: 10Jbond) [10:53:55] (03Abandoned) 10Kosta Harlan: AddLinkSaveDialog: Ensure getActionProcess is called [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695042 (https://phabricator.wikimedia.org/T283661) (owner: 10Kosta Harlan) [10:54:19] (03CR) 10Urbanecm: [C: 03+2] PostEdit: Fix skip all suggestions on mobile, and don't reset session if task was cancelled [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695038 (https://phabricator.wikimedia.org/T282546) (owner: 10Kosta Harlan) [10:54:23] (03CR) 10Urbanecm: [C: 03+2] PostEdit: Fix skip all suggestions on mobile, and don't reset session if task was cancelled [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695039 (https://phabricator.wikimedia.org/T282546) (owner: 10Kosta Harlan) [10:54:30] (03CR) 10jerkins-bot: [V: 04-1] profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (owner: 10Jbond) [10:57:38] (03PS1) 10Muehlenhoff: Readd cumin on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/695233 [10:58:30] Nikerabbit: I have fixed the ULS git repositories in Zuul [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European mid-day backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T1100). [11:00:05] itamarWMDE and kostajh: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] i can deploy today [11:00:12] o/ [11:00:14] I can deploy [11:00:26] hello [11:00:32] two deployers Lucas_WMDE, nice :) [11:00:34] hi hi [11:00:35] hello [11:00:38] hi itamarWMDE :) [11:00:44] (03CR) 10Urbanecm: [C: 03+2] Test Wikidata: Enable empty list to object serialization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694339 (https://phabricator.wikimedia.org/T241422) (owner: 10Itamar Givon) [11:00:53] ok but who’s deploying [11:01:05] Lucas_WMDE: feel free to do the config patch if you wish [11:01:15] sure [11:01:22] i have a lot of backports to do then :) [11:01:46] (03Merged) 10jenkins-bot: Test Wikidata: Enable empty list to object serialization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694339 (https://phabricator.wikimedia.org/T241422) (owner: 10Itamar Givon) [11:02:26] itamarWMDE: change is on mwdebug1001 [11:02:28] testing… [11:02:40] :+1: Thx [11:03:44] seems to be behaving correctly – on test.wikidata.org, `claims: []` turns into `claims: {}`, while www.wikidata.org is unaffected for now [11:04:06] 100% Thank you Lucas_WMDE [11:04:13] alright, syncing then [11:05:28] (03CR) 10Muehlenhoff: [C: 03+2] Readd cumin on cumin2001 [puppet] - 10https://gerrit.wikimedia.org/r/695233 (owner: 10Muehlenhoff) [11:05:44] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:694339|Test Wikidata: Enable empty list to object serialization (T241422)]] (duration: 01m 19s) [11:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:48] T241422: Wikidata forms without statements use empty JSON array instead of empty JSON object - https://phabricator.wikimedia.org/T241422 [11:06:02] (03PS2) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [11:06:04] (03PS1) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [11:06:56] urbanecm: the stage is yours [11:07:00] unless you want me to do the backports too? [11:07:08] i can do 'em :) [11:07:10] thanks [11:07:12] ok :) [11:09:00] (03PS3) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [11:09:14] (03PS2) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [11:09:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:10:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:12:44] (03Merged) 10jenkins-bot: Add a link: update button states after acceptance changes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695031 (https://phabricator.wikimedia.org/T283544) (owner: 10Kosta Harlan) [11:13:11] kostajh: do you want to test 'em at mwdebug? or should i just sync since add a link is not "easily available" anyway? [11:14:39] urbanecm: I could review on mwdebug [11:14:56] kostajh: 695031 was fetched there, please test [11:15:01] waiting for the rest of them to merge [11:15:24] urbanecm: ok looking [11:15:25] (03Merged) 10jenkins-bot: Add a link: update button states after acceptance changes [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695032 (https://phabricator.wikimedia.org/T283544) (owner: 10Kosta Harlan) [11:16:07] also pulled 695032 there (same but for wmf.6) [11:16:07] (03PS1) 10Kosta Harlan: Allow running fixLinkRecommendationData --search-index in production [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695044 (https://phabricator.wikimedia.org/T283606) [11:16:50] (03PS1) 10Kosta Harlan: Allow running fixLinkRecommendationData --search-index in production [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695045 (https://phabricator.wikimedia.org/T283606) [11:17:32] (03Merged) 10jenkins-bot: Add a link: Hide surface highlight overlay [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695033 (https://phabricator.wikimedia.org/T282899) (owner: 10Kosta Harlan) [11:17:34] urbanecm: mwdebug1001? [11:17:34] (03Merged) 10jenkins-bot: Add a link: Hide surface highlight overlay [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695034 (https://phabricator.wikimedia.org/T282899) (owner: 10Kosta Harlan) [11:17:36] (03Merged) 10jenkins-bot: PostEdit: Fix skip all suggestions on mobile, and don't reset session if task was cancelled [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695038 (https://phabricator.wikimedia.org/T282546) (owner: 10Kosta Harlan) [11:17:39] (03Merged) 10jenkins-bot: PostEdit: Fix skip all suggestions on mobile, and don't reset session if task was cancelled [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695039 (https://phabricator.wikimedia.org/T282546) (owner: 10Kosta Harlan) [11:17:41] yes kostajh [11:18:54] (03CR) 10Urbanecm: [C: 03+2] Allow running fixLinkRecommendationData --search-index in production [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695045 (https://phabricator.wikimedia.org/T283606) (owner: 10Kosta Harlan) [11:18:55] PROBLEM - DPKG on cumin2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:19:06] (03CR) 10Urbanecm: [C: 03+2] Allow running fixLinkRecommendationData --search-index in production [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695044 (https://phabricator.wikimedia.org/T283606) (owner: 10Kosta Harlan) [11:19:15] pulled patches that merged to mwdebug1001, too [11:19:38] (so all but the maint script should be there) [11:20:38] urbanecm: ok looking [11:21:30] (03PS4) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [11:21:42] (03CR) 10Muehlenhoff: profile::contacts: add a profile and define for adding contact metadata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (owner: 10Jbond) [11:22:46] urbanecm: all looks good [11:23:08] kostajh: thanks, syncing [11:23:15] (03PS5) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [11:24:55] (03CR) 10Muehlenhoff: profile::contacts: add a profile and define for adding contact metadata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (owner: 10Jbond) [11:25:06] (03PS6) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [11:25:16] (03PS3) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [11:25:40] (03PS1) 10MSantos: maps: fix SQL modules paths in import script [puppet] - 10https://gerrit.wikimedia.org/r/695240 [11:26:29] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.6/extensions/GrowthExperiments/: GrowthExperiments backports (T283544; T282899; T282546) (duration: 01m 19s) [11:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:35] T283544: Add a link: both "Yes" and "No" can be selected - https://phabricator.wikimedia.org/T283544 [11:26:36] T282546: AddLink: Skipped all dialog throws exception on ve.init.target.tryTeardown.then on mobile - https://phabricator.wikimedia.org/T282546 [11:26:36] T282899: [wmf.5] Add link: Unresponsive controls when context item overlaps templates - https://phabricator.wikimedia.org/T282899 [11:27:45] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/: GrowthExperiments backports (T283544; T282899; T282546) (duration: 01m 06s) [11:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:51] (03CR) 10Jbond: "thanks updated" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (owner: 10Jbond) [11:27:54] kostajh: should be done [11:28:21] urbanecm: great, thank you [11:28:37] ACKNOWLEDGEMENT - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: daily_account_consistency_check.service Marostegui see task on the downtime comment https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:13] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Planet reimport [11:30:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Planet reimport [11:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:27] (03CR) 10Muehlenhoff: profile::contacts: add a profile and define for adding contact metadata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (owner: 10Jbond) [11:32:49] (03PS7) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [11:33:41] (03PS4) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [11:36:37] (03Merged) 10jenkins-bot: Allow running fixLinkRecommendationData --search-index in production [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695045 (https://phabricator.wikimedia.org/T283606) (owner: 10Kosta Harlan) [11:39:48] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: 86bba48: Allow running fixLinkRecommendationData --search-index in production (T283606) (duration: 01m 06s) [11:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:52] T283606: Add a link: too many articles have no suggestions upon arrival - https://phabricator.wikimedia.org/T283606 [11:41:08] PROBLEM - Keyholder SSH agent on cumin2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [11:41:08] (03Merged) 10jenkins-bot: Allow running fixLinkRecommendationData --search-index in production [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695044 (https://phabricator.wikimedia.org/T283606) (owner: 10Kosta Harlan) [11:44:04] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.6/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: b3c2941: Allow running fixLinkRecommendationData --search-index in production (T283606) (duration: 01m 07s) [11:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:24] kostajh: maint script done too, feel free to run it if you want to. [11:44:56] urbanecm: I'll leave that for Gergo, but thank you [11:45:04] fine with me :) [11:49:40] (03PS8) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [11:49:44] (03PS5) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [11:51:55] (03PS6) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [11:54:19] (03PS9) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [11:54:39] (03PS7) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [11:54:46] (03PS8) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [11:55:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29705/console" [puppet] - 10https://gerrit.wikimedia.org/r/695236 (owner: 10Jbond) [12:01:38] (03CR) 10Majavah: [C: 04-1] profile::contacts: add a profile and define for adding contact metadata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (owner: 10Jbond) [12:03:29] (03PS10) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [12:05:10] (03CR) 10Jbond: profile::contacts: add a profile and define for adding contact metadata (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695230 (owner: 10Jbond) [12:05:24] (03PS9) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [12:06:35] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10CBogen) >>! In T283632#7113938, @Dzahn wrote: > - get approval from a manager I'm Jeff's manager and I approve this request. He is consulting with us to help analyze WDQS data to provide... [12:10:04] (03PS11) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [12:10:14] (03PS10) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [12:10:37] (03PS1) 10Muehlenhoff: Extend access for amy-wmde [puppet] - 10https://gerrit.wikimedia.org/r/695253 [12:10:38] 10SRE, 10SRE-Access-Requests: access to analytics data for wdqs for jmixter - https://phabricator.wikimedia.org/T283632 (10Marostegui) Thanks @CBogen. As Daniel mentioned, can we get this task to follow the procedure described at https://wikitech.wikimedia.org/wiki/Production_access#Access_Request_Process Tha... [12:11:21] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29706/console" [puppet] - 10https://gerrit.wikimedia.org/r/685496 (https://phabricator.wikimedia.org/T281673) (owner: 10Jbond) [12:11:29] (03CR) 10jerkins-bot: [V: 04-1] profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 (owner: 10Jbond) [12:11:35] (03CR) 10Jbond: "need to update the work on the scafolding for this will do so now" [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [12:12:03] (03CR) 10jerkins-bot: [V: 04-1] (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 (owner: 10Jbond) [12:13:30] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for amy-wmde [puppet] - 10https://gerrit.wikimedia.org/r/695253 (owner: 10Muehlenhoff) [12:14:44] (03PS12) 10Jbond: profile::contacts: add a profile and define for adding contact metadata [puppet] - 10https://gerrit.wikimedia.org/r/695230 [12:14:56] (03PS11) 10Jbond: (Test): Example PR demonstrating the contacts profile [puppet] - 10https://gerrit.wikimedia.org/r/695236 [12:17:07] (03PS5) 10Jbond: (WIP) create a logout.d profile for managing logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/693149 [12:17:09] (03PS1) 10Jbond: (WIP) add logout script for mod_auth_cas [puppet] - 10https://gerrit.wikimedia.org/r/695255 [12:19:13] (03CR) 10jerkins-bot: [V: 04-1] (WIP) create a logout.d profile for managing logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/693149 (owner: 10Jbond) [12:19:23] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10faidon) [12:20:32] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Gehel) [12:20:52] (03PS6) 10Jbond: P:logoutd: create a logout.d profile for managing logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/693149 [12:25:06] (03PS7) 10Jbond: P:logoutd: create a logout.d profile for managing logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/693149 [12:28:06] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:28:17] (03CR) 10David Caro: "Actually, this might not be needed, added a "current retention" panel here:" [puppet] - 10https://gerrit.wikimedia.org/r/695194 (owner: 10David Caro) [12:34:36] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10Marostegui) @diego in order to keep advancing on this task, can you please verify that the above key belongs to @0xkaywong (ie: via videocall) Thanks! [12:35:44] (03PS1) 10Jgreen: remove decommed host mintaka [dns] - 10https://gerrit.wikimedia.org/r/695267 (https://phabricator.wikimedia.org/T282056) [12:37:46] 10SRE, 10netops: Detect IP address collisions - https://phabricator.wikimedia.org/T189522 (10faidon) a:05faidon→03None [12:37:55] 10SRE, 10netops: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10faidon) a:05faidon→03None [12:38:56] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is OK: HTTP OK: HTTP/1.0 200 OK - 23642 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:42:27] (03CR) 10Jbond: Add logout script for sretest (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [12:42:45] (03PS3) 10Jbond: Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [12:43:51] (03PS5) 10JMeybohm: docker-registry: Add caching config for nginx [puppet] - 10https://gerrit.wikimedia.org/r/694330 (https://phabricator.wikimedia.org/T264209) [12:43:53] (03PS4) 10JMeybohm: httpbb: Add tests for docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/694552 (https://phabricator.wikimedia.org/T273521) [12:44:17] (03CR) 10jerkins-bot: [V: 04-1] Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [12:44:33] (03CR) 10Jbond: "have also rebased on the other change" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [12:46:06] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki) a:05jijiki→03None [12:49:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29707/console" [puppet] - 10https://gerrit.wikimedia.org/r/694330 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [12:49:36] (03CR) 10Jgreen: [C: 03+2] remove decommed host mintaka [dns] - 10https://gerrit.wikimedia.org/r/695267 (https://phabricator.wikimedia.org/T282056) (owner: 10Jgreen) [12:51:15] 10SRE, 10Advanced Mobile Contributions, 10Traffic, 10User-Joe: AMC – Opt-in for logged out users - https://phabricator.wikimedia.org/T215624 (10phuedx) a:05phuedx→03None [12:52:14] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10Jgreen) >>! In T282056#7112149, @Papaul wrote: > @Jgreen this server supposed to use for IP address 10.195.0.36 but this IP... [12:53:27] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Let's release this, then we might need to fine-tune the php parameters and the allowed resources." [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [12:55:59] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] docker-registry: Add caching config for nginx [puppet] - 10https://gerrit.wikimedia.org/r/694330 (https://phabricator.wikimedia.org/T264209) (owner: 10JMeybohm) [12:56:05] (03CR) 10JMeybohm: [C: 03+2] httpbb: Add tests for docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/694552 (https://phabricator.wikimedia.org/T273521) (owner: 10JMeybohm) [12:56:27] (03CR) 10Giuseppe Lavagetto: httpd: Add directory for applications to add config (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/691287 (owner: 10Legoktm) [12:59:37] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Volans) @faidon in light of the recent news regarding freenode actions on "closed" channel, should we keep applying the same procedure or do we want to do something different for the few channe... [12:59:48] (03PS1) 10Jbond: pontoon: disable the puppetdb microsite in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/695274 [13:00:05] twentyafterfour and hashar: May I have your attention please! MediaWiki train - American+European Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T1300) [13:00:52] (03CR) 10Kormat: pontoon: disable the puppetdb microsite in pontoon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695274 (owner: 10Jbond) [13:01:06] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: disable the puppetdb microsite in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/695274 (owner: 10Jbond) [13:01:14] (03CR) 10Jbond: [C: 03+2] pontoon: disable the puppetdb microsite in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/695274 (owner: 10Jbond) [13:02:02] (03CR) 10Volans: "I have no context but a question. Given the script mostly does only shell out, would it maybe make more sense to have it as a cookbook ins" [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [13:03:38] (03PS2) 10Jbond: pontoon: disable the puppetdb microsite in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/695274 [13:04:20] (03CR) 10Jbond: pontoon: disable the puppetdb microsite in pontoon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695274 (owner: 10Jbond) [13:04:23] (03PS3) 10Jbond: pontoon: disable the puppetdb microsite in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/695274 [13:05:38] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/695274 (owner: 10Jbond) [13:05:54] (03CR) 10Jbond: [C: 03+2] pontoon: disable the puppetdb microsite in pontoon [puppet] - 10https://gerrit.wikimedia.org/r/695274 (owner: 10Jbond) [13:05:58] (03PS1) 10Ema: ATS: add instance name to traffic_exporter check description [puppet] - 10https://gerrit.wikimedia.org/r/695279 [13:09:56] (03CR) 10Muehlenhoff: Add logout script for sretest (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [13:11:23] (03CR) 10Muehlenhoff: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [13:16:07] (03PS2) 10Ema: ATS: add instance name to traffic_exporter check description [puppet] - 10https://gerrit.wikimedia.org/r/695279 [13:16:28] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/695279 (owner: 10Ema) [13:18:48] (03CR) 10Hashar: [C: 03+1] "Puppet compile https://puppet-compiler.wmflabs.org/compiler1002/757/" [puppet] - 10https://gerrit.wikimedia.org/r/694523 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [13:22:40] 10SRE, 10CAS-SSO, 10Patch-For-Review: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10Volans) > The cookbook would simply traverse /etc/etc/wikimedia/logout.d/* I'm wondering if it would be quicker/simpler at this point to ship also a simple script... [13:23:21] (03PS3) 10Ema: ATS: add instance name to traffic_exporter check description [puppet] - 10https://gerrit.wikimedia.org/r/695279 [13:23:31] (03CR) 10Hashar: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1003/758/ there are a bunch of java related paths that appears to be broken such as the nrpe " [puppet] - 10https://gerrit.wikimedia.org/r/694524 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [13:23:48] (03CR) 10Vgutierrez: [C: 03+1] ATS: add instance name to traffic_exporter check description [puppet] - 10https://gerrit.wikimedia.org/r/695279 (owner: 10Ema) [13:24:28] (03CR) 10Jbond: "gone through and marked ok for the other area im familiar with" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/692869 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [13:26:59] (03CR) 10Ema: [C: 03+2] ATS: add instance name to traffic_exporter check description [puppet] - 10https://gerrit.wikimedia.org/r/695279 (owner: 10Ema) [13:27:12] 10SRE, 10LDAP-Access-Requests: Access for User:Satdeep Wikitech - https://phabricator.wikimedia.org/T283708 (10SGill) [13:27:30] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Ottomata) Approved. [13:27:45] 10SRE, 10LDAP-Access-Requests: Access for User:Satdeep Wikitech - https://phabricator.wikimedia.org/T283708 (10SGill) 05Stalled→03Open [13:29:01] (03PS3) 10Hashar: gerrit: add Java 11 packages [puppet] - 10https://gerrit.wikimedia.org/r/694523 (https://phabricator.wikimedia.org/T268225) [13:29:53] (03PS3) 10Hashar: gerrit: switch to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/694524 (https://phabricator.wikimedia.org/T268225) [13:30:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/694523 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [13:30:11] (03CR) 10Hashar: [C: 04-1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/694524 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [13:33:55] (03PS3) 10Volans: wmf-auto-reimage: check the debian installer env [puppet] - 10https://gerrit.wikimedia.org/r/694366 [13:33:57] (03PS1) 10Volans: homer: deploy only on cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/695297 [13:34:17] 10SRE, 10LDAP-Access-Requests: Access for User:Satdeep Wikitech - https://phabricator.wikimedia.org/T283708 (10SGill) [13:35:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:35:51] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: check the debian installer env [puppet] - 10https://gerrit.wikimedia.org/r/694366 (owner: 10Volans) [13:36:04] 10SRE, 10LDAP-Access-Requests: Access for User:Satdeep Wikitech - https://phabricator.wikimedia.org/T283708 (10Marostegui) p:05Triage→03Medium a:03Marostegui [13:37:06] 10SRE, 10CAS-SSO, 10Patch-For-Review: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10jbond) > I'm wondering if it would be quicker/simpler at this point to ship also a simple script that does the traversal so that the cookbook will simply run that o... [13:37:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:41:12] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for lz4 [puppet] - 10https://gerrit.wikimedia.org/r/695219 (owner: 10Muehlenhoff) [13:44:06] 10SRE, 10CAS-SSO, 10Patch-For-Review: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10Volans) The additional complexity that I foresee in the cookbook is this: ` # host: scripts host1: 10foo, 20bar, 30baz host2: 20bar host3: 10foo, 30baz ` You have... [13:44:08] (03PS1) 10Marostegui: data.yaml: Add Satdeep Gill to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/695298 (https://phabricator.wikimedia.org/T283708) [13:44:51] (03CR) 10jerkins-bot: [V: 04-1] data.yaml: Add Satdeep Gill to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/695298 (https://phabricator.wikimedia.org/T283708) (owner: 10Marostegui) [13:45:45] (03CR) 10Hashar: [C: 03+1] "I have added $java_home to profile::gerrit and removed the reference to the default path. That in turns fix the child change." [puppet] - 10https://gerrit.wikimedia.org/r/694523 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [13:46:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/695230 (owner: 10Jbond) [13:46:17] (03PS1) 10David Caro: ceph: add syslog logging [puppet] - 10https://gerrit.wikimedia.org/r/695299 (https://phabricator.wikimedia.org/T281247) [13:47:23] (03PS2) 10Marostegui: data.yaml: Add Satdeep Gill to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/695298 (https://phabricator.wikimedia.org/T283708) [13:48:01] (03CR) 10Hashar: "I fixed java_home in the parent change https://gerrit.wikimedia.org/r/c/operations/puppet/+/694523/2..3/modules/profile/manifests/gerrit.p" [puppet] - 10https://gerrit.wikimedia.org/r/694524 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [13:54:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/695298 (https://phabricator.wikimedia.org/T283708) (owner: 10Marostegui) [13:55:35] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add Satdeep Gill to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/695298 (https://phabricator.wikimedia.org/T283708) (owner: 10Marostegui) [13:58:34] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/683517 (owner: 10Hashar) [13:59:00] (03Abandoned) 10Hashar: ci: add docker0 IP to /etc/hosts entry [puppet] - 10https://gerrit.wikimedia.org/r/684965 (https://phabricator.wikimedia.org/T281737) (owner: 10Hashar) [14:03:26] !log hashar@deploy1002 Started deploy [integration/docroot@ebee5d3]: composer/npm updates [14:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:35] !log hashar@deploy1002 Finished deploy [integration/docroot@ebee5d3]: composer/npm updates (duration: 00m 09s) [14:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:01] (03PS1) 10David Caro: ceph: send logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/695329 [14:05:37] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@5d7c993]: (no justification provided) [14:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:51] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@5d7c993]: (no justification provided) (duration: 00m 14s) [14:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:16] (03PS2) 10Volans: Add python-build-bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/685462 [14:06:17] RECOVERY - kartotherian endpoints health on maps1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [14:06:18] (03PS1) 10Volans: python-build/buster: fix typo in changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695330 [14:06:20] (03PS1) 10Giuseppe Lavagetto: fix changelog for python-build-buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695331 [14:06:56] (03CR) 10Volans: "Tested locally, it seems to build it successfully." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/685462 (owner: 10Volans) [14:07:04] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Access for User:Satdeep Wikitech - https://phabricator.wikimedia.org/T283708 (10Marostegui) 05Open→03Resolved Done: ` # ldapsearch -x cn=wmf | grep -i sat member: uid=satdeep ... ` [14:07:21] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] fix changelog for python-build-buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695331 (owner: 10Giuseppe Lavagetto) [14:08:53] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:09:36] (03Abandoned) 10Volans: python-build/buster: fix typo in changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695330 (owner: 10Volans) [14:17:18] (03PS2) 10Krinkle: Allow talk pages to have a different ParserCache expiry [extensions/DiscussionTools] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/694314 (https://phabricator.wikimedia.org/T280605) [14:18:42] !log otto@deploy1002 Started deploy [analytics/refinery@e536abd]: Regular analytics weekly train [analytics/refinery@e536abd] [14:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:49] !log otto@deploy1002 deploy aborted: Regular analytics weekly train [analytics/refinery@e536abd] (duration: 00m 06s) [14:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:16] !log otto@deploy1002 Started deploy [analytics/refinery@b787999]: Regular analytics weekly train [analytics/refinery@e536abd] [14:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:44] (03PS4) 10Muehlenhoff: Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) [14:22:20] (03CR) 10jerkins-bot: [V: 04-1] Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [14:22:22] (03PS5) 10Ssingh: wikidough: update role to work towards anycast support [puppet] - 10https://gerrit.wikimedia.org/r/692368 (https://phabricator.wikimedia.org/T283027) [14:23:26] 10SRE, 10Release-Engineering-Team, 10SRE-tools, 10Patch-For-Review: Support running puppet Beaker on CI - https://phabricator.wikimedia.org/T253635 (10hashar) [14:23:49] (03PS1) 10Ottomata: Bump refinery::job::canary_events to 0.1.12 [puppet] - 10https://gerrit.wikimedia.org/r/695340 (https://phabricator.wikimedia.org/T270138) [14:24:09] (03PS4) 10Superyetkin: Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) [14:24:51] (03PS1) 10Jbond: logoutd: create logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) [14:25:14] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/695329 (owner: 10David Caro) [14:27:46] (03CR) 10jerkins-bot: [V: 04-1] logoutd: create logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [14:29:34] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10diego) Hi @Marostegui , Keys confirmed, and I also can confirm that @0xkaywong signed the contract with us, so NDA should be covered. @KFrancis , pls can you c... [14:29:39] (03PS2) 10Jbond: logoutd: create logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) [14:29:41] (03CR) 10RLazarus: "Thanks for adding this!" [software/httpbb] - 10https://gerrit.wikimedia.org/r/694556 (owner: 10JMeybohm) [14:31:30] !log updated bullseye d-i image to 2021-05-26 daily image T275873 [14:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:34] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [14:31:49] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10schoenbaechler) [14:32:09] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10schoenbaechler) Thanks a lot @Ottomata — I filled it out! 👍 [14:32:27] (03CR) 10jerkins-bot: [V: 04-1] logoutd: create logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [14:32:58] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={list,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:34:19] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10fgiunchedi) [14:35:33] (03CR) 10Jbond: Add logout script for sretest (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [14:35:54] (03PS3) 10Elukey: Add new golang 1.15 image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/695173 [14:35:56] (03PS25) 10Elukey: Add istio base images build support [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/688211 (https://phabricator.wikimedia.org/T278192) [14:35:58] (03PS10) 10Elukey: Add knative serving and net-istio images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/692899 (https://phabricator.wikimedia.org/T278194) [14:36:00] (03PS8) 10Elukey: Add base kubeflow kfserving images and kube-rbac-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693644 (https://phabricator.wikimedia.org/T272919) [14:36:02] (03PS7) 10Elukey: Add Jetstack's cert-manager base go images. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T280661) [14:36:46] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:39:33] (03PS3) 10JMeybohm: Add per test timeouts [software/httpbb] - 10https://gerrit.wikimedia.org/r/694556 [14:40:31] (03CR) 10Muehlenhoff: Add logout script for sretest (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [14:40:53] (03PS3) 10Jbond: logoutd: create logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) [14:41:00] (03CR) 10JMeybohm: "> Patch Set 2:" [software/httpbb] - 10https://gerrit.wikimedia.org/r/694556 (owner: 10JMeybohm) [14:43:28] (03CR) 10jerkins-bot: [V: 04-1] logoutd: create logoutd base class [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [14:45:26] (03PS3) 10Hnowlan: osm: fix import issues in imposm-initial-import, run imposm as osmupdater [puppet] - 10https://gerrit.wikimedia.org/r/693876 [14:45:56] 10SRE: some hosts provisioned with 127.0.1.1 entries in /etc/hosts - https://phabricator.wikimedia.org/T84366 (10fgiunchedi) 05Open→03Declined Unlikely this is still relevant [14:46:24] (03CR) 10Jbond: Add logout script for sretest (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [14:46:34] 10SRE, 10observability: Update prometheus-node-exporter NTP metrics - https://phabricator.wikimedia.org/T208875 (10fgiunchedi) a:05fgiunchedi→03None [14:46:42] 10SRE, 10SRE-swift-storage: swift backend machines load spike: cause and remediation - https://phabricator.wikimedia.org/T84385 (10fgiunchedi) 05Open→03Invalid In the meantime XFS got better and load average is generally legit load on the host [14:47:41] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [14:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:17] (03PS3) 10Cwhite: rsyslog: enable ecs_170 template and transition prometheus [puppet] - 10https://gerrit.wikimedia.org/r/689160 (https://phabricator.wikimedia.org/T234565) [14:48:19] (03CR) 10Volans: "Few comments inline. Don't forget to add tests once the API is defined ;)" (034 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [14:49:38] (03CR) 10MSantos: [C: 03+1] osm: fix import issues in imposm-initial-import, run imposm as osmupdater [puppet] - 10https://gerrit.wikimedia.org/r/693876 (owner: 10Hnowlan) [14:49:38] !log otto@deploy1002 Finished deploy [analytics/refinery@b787999]: Regular analytics weekly train [analytics/refinery@e536abd] (duration: 30m 22s) [14:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:49] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [14:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:55] !log otto@deploy1002 Started deploy [analytics/refinery@b787999] (thin): Regular analytics weekly train THIN [14:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:02] !log otto@deploy1002 Finished deploy [analytics/refinery@b787999] (thin): Regular analytics weekly train THIN (duration: 00m 07s) [14:50:03] (03CR) 10RLazarus: [C: 03+1] Add per test timeouts [software/httpbb] - 10https://gerrit.wikimedia.org/r/694556 (owner: 10JMeybohm) [14:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:17] (03PS5) 10Effie Mouzeli: (WIP) mwdebug: add helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 [14:50:56] (03PS1) 10Cwhite: rsyslog: use message rather than log.original for raw message field [puppet] - 10https://gerrit.wikimedia.org/r/695348 [14:51:02] (03CR) 10Effie Mouzeli: (WIP) mwdebug: add helmfile configuration (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 (owner: 10Effie Mouzeli) [14:51:39] RECOVERY - tilerator on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [14:51:45] (03CR) 10jerkins-bot: [V: 04-1] (WIP) mwdebug: add helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 (owner: 10Effie Mouzeli) [14:51:48] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: Python 3's eventlet.green getaddrinfo timeout in Cloud VPS + Bullseye - https://phabricator.wikimedia.org/T283714 (10fgiunchedi) +#SRE for visibility as this will be true in production too [14:53:34] (03CR) 10Hnowlan: [C: 03+2] osm: fix import issues in imposm-initial-import, run imposm as osmupdater [puppet] - 10https://gerrit.wikimedia.org/r/693876 (owner: 10Hnowlan) [14:53:43] !log otto@deploy1002 Started deploy [analytics/refinery@b787999] (hadoop-test): Regular analytics weekly train TEST [14:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:55] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: Python 3's eventlet.green getaddrinfo timeout in Bullseye - https://phabricator.wikimedia.org/T283714 (10fgiunchedi) [14:54:02] (03CR) 10Cwhite: [C: 03+2] rsyslog: use message rather than log.original for raw message field [puppet] - 10https://gerrit.wikimedia.org/r/695348 (owner: 10Cwhite) [14:54:10] (03PS2) 10Cwhite: rsyslog: use message rather than log.original for raw message field [puppet] - 10https://gerrit.wikimedia.org/r/695348 [14:54:30] (03PS1) 10Papaul: DNS: Add DNS entries for fran2001 [dns] - 10https://gerrit.wikimedia.org/r/695349 [14:55:42] (03Abandoned) 10Jbond: (WIP): clean out certs for old hosts [puppet] - 10https://gerrit.wikimedia.org/r/693424 (owner: 10Jbond) [14:57:04] (03CR) 10Papaul: [C: 03+2] DNS: Add DNS entries for fran2001 [dns] - 10https://gerrit.wikimedia.org/r/695349 (owner: 10Papaul) [14:57:10] jouncebot: now [14:57:10] For the next 0 hour(s) and 2 minute(s): MediaWiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T1300) [14:57:13] jouncebot: next [14:57:13] In 0 hour(s) and 2 minute(s): Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T1500) [14:57:45] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10Papaul) [14:58:25] urbanecm: can I sync the interwiki map first? [14:58:29] (03CR) 10Ottomata: [C: 03+2] Bump refinery::job::canary_events to 0.1.12 [puppet] - 10https://gerrit.wikimedia.org/r/695340 (https://phabricator.wikimedia.org/T270138) (owner: 10Ottomata) [14:58:44] legoktm: go for it, ping me when ready [14:58:53] *clear for me [14:58:54] 10SRE, 10netops: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10ayounsi) a:05faidon→03wiki_willy When trying to login to the Ethervision dashboard I'm getting: > Additional Provisioning Required. Please contact your Customer Success Manager to request access. @wik... [14:59:06] 10SRE, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install fran2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T282056 (10Papaul) 05Open→03Resolved @Jgreen this is ready for install. [14:59:07] !log otto@deploy1002 Finished deploy [analytics/refinery@b787999] (hadoop-test): Regular analytics weekly train TEST (duration: 05m 24s) [14:59:09] thanks [14:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:42] (03PS1) 10Legoktm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695354 [14:59:44] (03CR) 10Legoktm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695354 (owner: 10Legoktm) [15:00:05] Urbanecm and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T1500). [15:00:17] o/ [15:00:18] half around [15:00:34] ack Amir1. hopefully nothing will break anyway :) [15:01:13] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695354 (owner: 10Legoktm) [15:02:41] !log legoktm@deploy1002 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 03m 18s) [15:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:05] urbanecm: all done [15:03:13] thanks legoktm [15:03:17] Amir1: ok for me to start? [15:03:32] with 690680: Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 1) | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/690680 [15:03:33] sure [15:03:42] (03PS3) 10Urbanecm: Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690680 (https://phabricator.wikimedia.org/T280400) [15:03:47] (03CR) 10Urbanecm: [C: 03+2] Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690680 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [15:04:00] (03CR) 10Muehlenhoff: "Looks good, a few typos inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/693149 (owner: 10Jbond) [15:04:43] (03Merged) 10jenkins-bot: Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/690680 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [15:05:49] (03CR) 10Ayounsi: [C: 03+1] homer: deploy only on cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/695297 (owner: 10Volans) [15:06:07] https://vrt-wiki.wikimedia.org/ redirects to otrs-wiki.wikimedia.org at mwdebug1001, syncing [15:06:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/695297 (owner: 10Volans) [15:06:56] 100% now [15:07:39] (03PS3) 10Urbanecm: Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692452 (https://phabricator.wikimedia.org/T280400) [15:07:42] (03CR) 10Urbanecm: [C: 03+2] Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692452 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [15:08:14] !log urbanecm@deploy1002 Synchronized multiversion/MWMultiVersion.php: 945ee9c5e88166984bf12e4039d692fe06498e40: Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (T280400; 1/2) (duration: 01m 06s) [15:08:16] (03CR) 10Ssingh: [C: 03+2] wikidough: update role to work towards anycast support [puppet] - 10https://gerrit.wikimedia.org/r/692368 (https://phabricator.wikimedia.org/T283027) (owner: 10Ssingh) [15:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:18] T280400: Change the user-visible domain of OTRS wiki - https://phabricator.wikimedia.org/T280400 [15:08:22] (03PS11) 10Effie Mouzeli: (WIP) profile::memcached::instance: Add TLS support (2) [puppet] - 10https://gerrit.wikimedia.org/r/694465 (https://phabricator.wikimedia.org/T271967) [15:08:31] (03Merged) 10jenkins-bot: Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/692452 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [15:09:15] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:09:18] and now http://otrs-wiki.wikimedia.org/ redirects to vrt-wiki.wikimedia.org [15:09:39] (03CR) 10Volans: [C: 03+2] homer: deploy only on cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/695297 (owner: 10Volans) [15:09:48] logging in works too [15:10:28] so...syncing, too [15:10:31] does it? Maybe its cause I can't access it but clicking that link still just seems to send me to that url? [15:10:50] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Gehel) [15:10:58] Asartea: with the debug host, mwdebug1001 [15:11:13] for you to see it it needs to be also synced [15:11:19] Ah okay [15:11:29] PROBLEM - Check systemd state on backup1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsyslog.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:51] !log urbanecm@deploy1002 Synchronized wmf-config/: 490435edb4ea4cc10ba435125ba547231fc7f1e7: Move otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org (T280400) (duration: 01m 07s) [15:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:07] !log Merging https://gerrit.wikimedia.org/r/c/operations/homer/public/+/694305/ - Add Wikidough Anycast range to network config [15:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:22] (03CR) 10Cathal Mooney: [C: 03+2] Added Wikidough VMs to BGP Anycast codfw [homer/public] - 10https://gerrit.wikimedia.org/r/694305 (https://phabricator.wikimedia.org/T283503) (owner: 10Cathal Mooney) [15:12:34] ...and sounds it all worked [15:12:58] (03PS8) 10Jbond: P:logoutd: create a logout.d profile for managing logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/693149 [15:13:16] (03Merged) 10jenkins-bot: Added Wikidough VMs to BGP Anycast codfw [homer/public] - 10https://gerrit.wikimedia.org/r/694305 (https://phabricator.wikimedia.org/T283503) (owner: 10Cathal Mooney) [15:13:19] (03CR) 10Jbond: "updated thanks" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/693149 (owner: 10Jbond) [15:13:59] legoktm: I'd appreciate if you could merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/692454, but it can wait :) [15:14:28] yes [15:14:49] (03CR) 10Legoktm: [C: 03+2] Add redirect from otrs-wiki.wikimedia.org to vrt-wiki.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/692454 (https://phabricator.wikimedia.org/T280400) (owner: 10Urbanecm) [15:15:08] (03CR) 10Jbond: logoutd: create logoutd base class (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/695341 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [15:15:19] urbanecm: do you need me to do a puppet run everywhere or should I let it gradually roll out? [15:15:37] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Papaul) a:05Papaul→03Marostegui @Marostegui disk replaced [15:15:45] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10Marostegui) Thanks @diego! [15:15:59] legoktm: i think it can just roll out gradually [15:16:28] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10Marostegui) [15:16:38] 10SRE, 10wikimedia-irc-libera: Move SRE-related IRC channels to Libera - https://phabricator.wikimedia.org/T283230 (10Gehel) [15:16:47] thanks Amir1, i think we're done here [15:16:55] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10Marostegui) [15:16:55] ^^ [15:17:31] (03PS6) 10Effie Mouzeli: (WIP) mwdebug: add helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 [15:17:44] I thought it wasn't supposed to be this smooth or easy [15:17:58] * urbanecm too [15:18:11] it's still possible we missed something :) [15:18:24] but also otrs_wikiwiki is a special wiki, not a language one [15:18:44] !log otrs_wikiwiki was moved to vrt-wiki.wikimedia.org (T280400) [15:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:51] T280400: Change the user-visible domain of OTRS wiki - https://phabricator.wikimedia.org/T280400 [15:19:42] (03CR) 10jerkins-bot: [V: 04-1] (WIP) mwdebug: add helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 (owner: 10Effie Mouzeli) [15:19:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:08] yeah all the big problems iirc were relating to the language-project wikis [15:20:18] special ones are ok, have done a couple of those in the past without big fanfare [15:20:36] the WebAuthn point on the task was a good one though, it might be harder for other wikis in future because of that [15:21:10] especially with SUL wikis. I can't find a simple way to see which domain was webauthn registered on. [15:21:37] * urbanecm goes to try to re-enable webauthn 2FA at the wiki [15:22:45] !log cmooney@cumin1001 Running homer to deploy Gerrit 694305 changes to cr1-codfw - Wikidough Anycast [15:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:27] webauthn appears to work on my side [15:23:33] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:45:00 on malmok.wikimedia.org with reason: applying anycast update: T283503 [15:23:33] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on malmok.wikimedia.org with reason: applying anycast update: T283503 [15:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:36] T283503: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 [15:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/693149 (owner: 10Jbond) [15:25:02] (03CR) 10Jbond: [C: 03+2] P:logoutd: create a logout.d profile for managing logout scripts [puppet] - 10https://gerrit.wikimedia.org/r/693149 (owner: 10Jbond) [15:26:09] (03PS1) 10Gergő Tisza: GrowthExperiments: Enable Add Links for 50% of new users and all old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695364 (https://phabricator.wikimedia.org/T277356) [15:26:09] thanks for taking care of this task urbanecm [15:26:31] thanks for the help in the task Krenair [15:27:35] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) Thanks @Papaul - however the disk doesn't look to be rebuilding: ` seqNum: 0x000002e2 Time: Wed May 26 15:14:34 2021 Code: 0x000000b9 Class: 2 Locale: 0x04 Event Description: Enclosure PD 20(c None/... [15:27:44] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) a:05Marostegui→03Papaul [15:30:33] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:31:01] !log Cold reset db2107 idrac T283727 [15:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:04] (03PS7) 10Effie Mouzeli: (WIP) mwdebug: add helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 [15:31:05] T283727: db2107 idrac not responding - https://phabricator.wikimedia.org/T283727 [15:32:13] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [15:32:27] (03CR) 10jerkins-bot: [V: 04-1] (WIP) mwdebug: add helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 (owner: 10Effie Mouzeli) [15:33:19] (03PS1) 10Jbond: P:base: add logoutd profile to base [puppet] - 10https://gerrit.wikimedia.org/r/695365 (https://phabricator.wikimedia.org/T283242) [15:34:49] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [15:35:11] (03PS1) 10Ema: alertmanager: route Traffic team alerts [puppet] - 10https://gerrit.wikimedia.org/r/695367 (https://phabricator.wikimedia.org/T282806) [15:35:25] (03PS8) 10Effie Mouzeli: (WIP) mwdebug: add helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 [15:35:55] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) [15:36:07] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [15:36:49] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:37:06] (03CR) 10jerkins-bot: [V: 04-1] (WIP) mwdebug: add helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 (owner: 10Effie Mouzeli) [15:37:46] (03PS2) 10Jbond: P:base: add logoutd profile to base [puppet] - 10https://gerrit.wikimedia.org/r/695365 (https://phabricator.wikimedia.org/T283242) [15:38:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29709/console" [puppet] - 10https://gerrit.wikimedia.org/r/695365 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [15:38:34] (03CR) 10Cwhite: [V: 03+1 C: 03+2] "Tested on two hosts and output looks largely unchanged (except prometheus)." [puppet] - 10https://gerrit.wikimedia.org/r/689160 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:39:21] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:39:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base: add logoutd profile to base [puppet] - 10https://gerrit.wikimedia.org/r/695365 (https://phabricator.wikimedia.org/T283242) (owner: 10Jbond) [15:40:32] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: Python 3's eventlet.green getaddrinfo timeout in Bullseye - https://phabricator.wikimedia.org/T283714 (10Marostegui) p:05Triage→03Medium [15:41:41] !log enable puppet on mc2019 [15:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:59] (03PS5) 10Muehlenhoff: Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) [15:42:23] (03PS1) 10Jbond: Create empty directory [puppet] - 10https://gerrit.wikimedia.org/r/695369 [15:42:37] (03CR) 10Jbond: [V: 03+2 C: 03+2] Create empty directory [puppet] - 10https://gerrit.wikimedia.org/r/695369 (owner: 10Jbond) [15:43:26] (03CR) 10jerkins-bot: [V: 04-1] Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [15:44:23] (03CR) 10Giuseppe Lavagetto: (WIP) mwdebug: add helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 (owner: 10Effie Mouzeli) [15:44:23] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.04448 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:45:15] (03PS6) 10Muehlenhoff: Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) [15:45:25] 10Puppet, 10SRE, 10User-jbond: facter3: use structured facts - https://phabricator.wikimedia.org/T222160 (10jbond) a:05jbond→03None [15:46:51] (03CR) 10jerkins-bot: [V: 04-1] Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [15:46:54] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10JAnstee_WMF) [15:48:19] (03PS5) 10Dzahn: initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) [15:48:55] (03CR) 10jerkins-bot: [V: 04-1] initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:49:20] jbond: it seems some hosts didn't like the mkdir_p [15:49:30] (03PS7) 10Muehlenhoff: Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) [15:49:38] (see the above widespread puppet failures) [15:49:43] volans: allready sent a fix [15:49:49] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10Marostegui) @diego when does the contract expires? Thanks [15:50:06] ack, thx [15:50:20] np, running puppet on failed nodes now [15:51:08] (03CR) 10jerkins-bot: [V: 04-1] Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [15:52:20] (03PS8) 10Muehlenhoff: Add logout script for sretest [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) [15:52:50] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10JAnstee_WMF) This is approved on GDI end however, I am not sure if you also need our Director (sbodington) to also sign off Also regarding NDA, @KFrancis Benjamin still... [15:53:09] (03PS1) 10Marostegui: data.yaml: Add Kay Wong to analytics-privatedata-users (with kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/695371 (https://phabricator.wikimedia.org/T283486) [15:53:38] (03CR) 10Marostegui: [C: 04-2] "Needs NDA to be verified" [puppet] - 10https://gerrit.wikimedia.org/r/695371 (https://phabricator.wikimedia.org/T283486) (owner: 10Marostegui) [15:53:50] (03PS1) 10Cathal Mooney: wikidough: correct IP address and domain for Bird BGP in role [puppet] - 10https://gerrit.wikimedia.org/r/695372 [15:54:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10KFrancis) @diego Hi Diego, if Kay is a contractor with the WMF, once the contract has been fully executed, Kay would be covered under that... [15:54:50] 10Puppet, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Radar), 10User-brennen: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10jbond) @brennen sounds find to me, its possible when the play book is run that there may still be some puppetised b... [15:55:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/695203 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [15:56:02] 10Puppet, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Radar), 10User-brennen: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10brennen) > @brennen sounds fine to me, its possible when the play book is run that there may still be some puppetis... [15:56:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10Marostegui) @KFrancis Can you confirm whether we can proceed with granting access? It seems you said we should wait and then you said we ca... [15:58:45] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.001779 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:59:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10KFrancis) @Marostegui The signed agreement was confirmed, so please move forward with the access request. Thank you! [16:00:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10Marostegui) Thank you @KFrancis @diego please confirm when does the contract expires. I will add your email as `expiry_contact` too if th... [16:01:07] (03CR) 10Marostegui: [C: 04-2] "Pending also the expiry date confirmation." [puppet] - 10https://gerrit.wikimedia.org/r/695371 (https://phabricator.wikimedia.org/T283486) (owner: 10Marostegui) [16:01:18] (03Abandoned) 10Cathal Mooney: wikidough: correct IP address and domain for Bird BGP in role [puppet] - 10https://gerrit.wikimedia.org/r/695372 (owner: 10Cathal Mooney) [16:01:31] !log powerdown ms-be2038 for BBU replacement [16:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10Marostegui) 05Stalled→03Open [16:03:41] PROBLEM - Host ms-be2038 is DOWN: PING CRITICAL - Packet loss = 100% [16:03:43] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:03:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [16:05:17] (03PS1) 10Muehlenhoff: Extend access for S&F contractors [puppet] - 10https://gerrit.wikimedia.org/r/695375 [16:06:07] ACKNOWLEDGEMENT - Host ms-be2038 is DOWN: PING CRITICAL - Packet loss = 100% Marostegui Know, onsite maintenance [16:08:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [16:09:09] RECOVERY - Host ms-be2038 is UP: PING WARNING - Packet loss = 71%, RTA = 33.02 ms [16:09:11] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) [16:09:15] PROBLEM - Wikidough DoT Check on malmok is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough [16:09:26] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for S&F contractors [puppet] - 10https://gerrit.wikimedia.org/r/695375 (owner: 10Muehlenhoff) [16:09:34] ^ expected [16:09:45] PROBLEM - Wikidough DoH Check on malmok is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough [16:09:47] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 0:45:00 on malmok.wikimedia.org with reason: [WIP] applying anycast update: T283503 [16:09:48] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on malmok.wikimedia.org with reason: [WIP] applying anycast update: T283503 [16:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:52] T283503: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 [16:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:56] !log Reboot db2103 (codfw master) T282072 [16:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:00] T282072: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 [16:11:19] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:12:51] !log Reboot db2107 (codfw master) T282072 [16:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:07] (03CR) 10Superyetkin: "Where is the merge conflict here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [16:14:24] (03PS9) 10Effie Mouzeli: (WIP) mwdebug: add helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/693875 [16:14:33] (03PS1) 10Jbond: WIP: add notls support for external addresses to memcached (1) [puppet] - 10https://gerrit.wikimedia.org/r/695377 (https://phabricator.wikimedia.org/T271967) [16:15:37] (03PS2) 10Jbond: WIP: add notls support for external addresses to memcached (1) [puppet] - 10https://gerrit.wikimedia.org/r/695377 (https://phabricator.wikimedia.org/T271967) [16:15:59] RECOVERY - HP RAID on ms-be2038 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:16:31] 10SRE, 10ops-codfw, 10User-fgiunchedi: Degraded RAID on ms-be2038 - https://phabricator.wikimedia.org/T283401 (10Papaul) 05Open→03Resolved @fgiunchedi BBU replaced, server is happy now [16:19:44] (03CR) 10JMeybohm: [C: 03+2] Add per test timeouts [software/httpbb] - 10https://gerrit.wikimedia.org/r/694556 (owner: 10JMeybohm) [16:19:55] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) After the reboot I can see the disk: ` Raw Size: 1.746 TB [0xdf8fe2b0 Sectors] Non Coerced Size: 1.745 TB [0xdf7fe2b0 Sectors] Coerced Size: 1.745 TB [0xdf7c0000 Sectors] Sector Size: 512 Logical Se... [16:21:48] (03Merged) 10jenkins-bot: Add per test timeouts [software/httpbb] - 10https://gerrit.wikimedia.org/r/694556 (owner: 10JMeybohm) [16:22:38] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) Received the second test PDU {F34469889} [16:22:48] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10KFrancis) @JAnstee_WMF Hi Jaime, Benjamin is on the current WMF contractors list, so his NDA is covered under the contractor employee agreement. Please proceed with the a... [16:23:35] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:23:40] 10SRE, 10Advanced Mobile Contributions, 10Traffic, 10Readers-Web-Backlog (Tracking), 10User-Joe: AMC – Opt-in for logged out users - https://phabricator.wikimedia.org/T215624 (10Jdlrobson) [16:24:33] (03PS1) 10Jbond: hiera - cloud: add defaults for logoutd [puppet] - 10https://gerrit.wikimedia.org/r/695381 [16:24:45] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [16:24:59] (03CR) 10Jbond: [V: 03+2 C: 03+2] hiera - cloud: add defaults for logoutd [puppet] - 10https://gerrit.wikimedia.org/r/695381 (owner: 10Jbond) [16:25:37] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:28:55] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:29:11] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) [16:30:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:13] (03CR) 10Jbond: "See comments inline, I also drafted the following which i think achives the same end goal" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/693474 (https://phabricator.wikimedia.org/T271967) (owner: 10Effie Mouzeli) [16:30:15] (03PS1) 10Ssingh: wikidough: fix typo in wikidough.yaml (see detailed commit notes) [puppet] - 10https://gerrit.wikimedia.org/r/695387 [16:30:59] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) @KFrancis or @JAnstee_WMF I would need the expiration date for the contract so I can prepare the patch with that info. [16:31:03] RECOVERY - Wikidough DoH Check on malmok is OK: OK - Certificate wikimedia-dns.org will expire on Thu 19 Aug 2021 01:55:11 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Wikidough [16:31:05] 10ops-codfw, 10DC-Ops: test fs.com and rahi new power cables - https://phabricator.wikimedia.org/T283739 (10RobH) p:05Triage→03High [16:31:13] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:31:27] 10ops-codfw, 10DC-Ops: test fs.com and rahi new power cables - https://phabricator.wikimedia.org/T283739 (10RobH) [16:31:36] (03CR) 10Ssingh: [C: 03+2] wikidough: fix typo in wikidough.yaml (see detailed commit notes) [puppet] - 10https://gerrit.wikimedia.org/r/695387 (owner: 10Ssingh) [16:33:10] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) @Ottomata does this user need to have kerberos? [16:33:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:08] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Ottomata) Yes, thank you. [16:34:24] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Ottomata) Basically, if ssh + analytics-privatedata-users, kerberos is needed. [16:34:56] 10ops-codfw, 10DC-Ops: test fs.com and rahi new power cables - https://phabricator.wikimedia.org/T283739 (10RobH) 05Open→03Resolved > robh: 10:25 < papaul> wiki_willy: hello both cables work great on all 3 PDU's. but i prefere the fs > awesome, he did this before i filed the task ;D [16:35:13] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [16:38:14] !log cmooney@cumin1001 Running homer to deploy Gerrit 694305 changes to cr2-codfw - Wikidough Anycast [16:38:15] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) thanks! [16:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:32] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Papaul) a:05Papaul→03Marostegui [16:42:31] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:43:41] RECOVERY - Wikidough DoT Check on malmok is OK: TCP OK - 0.067 second response time on 185.71.138.138 port 853 https://wikitech.wikimedia.org/wiki/Wikidough [16:43:47] ACKNOWLEDGEMENT - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast Cathal Mooney doh2001 and doh2002 being enabled shortly. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:44:07] RECOVERY - Check systemd state on backup1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:42] ACKNOWLEDGEMENT - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast Cathal Mooney doh2001 and doh2002 being activated shortly, please ingore this for now. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:45:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:28] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693897 (https://phabricator.wikimedia.org/T283119) (owner: 10Esanders) [16:45:30] (03PS1) 10Bartosz Dziewoński: Enable wgCiteResponsiveReferences on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695389 (https://phabricator.wikimedia.org/T281622) [16:45:32] (03PS1) 10Bartosz Dziewoński: Enable VisualEditor on ptwikinews by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695390 (https://phabricator.wikimedia.org/T282846) [16:45:34] (03PS1) 10Bartosz Dziewoński: Enable VisualEditor on plwikinews by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695391 (https://phabricator.wikimedia.org/T283033) [16:48:17] 10SRE, 10ops-codfw, 10DBA: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) This keeps progressing well: ` root@db2107:~# megacli -pdrbld -showprog -physdrv\[32:5\] -aALL Rebuild Progress on Device at Enclosure 32, Slot 5 Completed 37% in 31 Minutes. ` [16:48:55] 10ops-codfw, 10DBA: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Papaul) [16:49:06] 10ops-codfw, 10DBA: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Papaul) p:05Triage→03Medium [16:49:44] (03PS1) 10Volans: script interface automation: fix re-assign of IP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/695399 (https://phabricator.wikimedia.org/T276760) [16:50:03] (03CR) 10Volans: [V: 03+1] "Tested on netbox-next" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/695399 (https://phabricator.wikimedia.org/T276760) (owner: 10Volans) [16:50:05] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:50:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:35] 10ops-codfw, 10DBA: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Marostegui) This is s8 master, so it needs some coordination. Let me know a day/time when you'd like to tackle this and I can have the host ready for you! [16:51:01] !log installing libx11 security updates [16:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) After firmware upgrade, server is not powering up anymore when pressing the power button [16:52:36] (03PS1) 10Ssingh: Update tests for the new Wikidough host and IP [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/695400 (https://phabricator.wikimedia.org/T283027) [16:53:26] (03CR) 10Ssingh: [C: 03+2] Update tests for the new Wikidough host and IP [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/695400 (https://phabricator.wikimedia.org/T283027) (owner: 10Ssingh) [16:54:01] 10SRE, 10Traffic, 10Patch-For-Review: Offer Wikidough as an anycasted service - https://phabricator.wikimedia.org/T283027 (10ssingh) [16:55:48] 10SRE, 10serviceops: codfw appserver latency alerts flapping - https://phabricator.wikimedia.org/T283744 (10CDanis) [16:55:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Enable TLS on memcached for cross-dc replication - https://phabricator.wikimedia.org/T271967 (10jijiki) >>! In T271967#7112446, @jbond wrote: >> i wonder if we have considered just having the TLS port every where accept l... [16:57:17] (03CR) 10Bstorm: [C: 03+2] "I'll merge this (partly for record keeping) and send up another patch to improve the language in the README" [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/693500 (https://phabricator.wikimedia.org/T283385) (owner: 10Bstorm) [16:58:43] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [16:59:13] 10SRE, 10Traffic, 10netops: Please configure the routers for Wikidough's anycasted IP - https://phabricator.wikimedia.org/T283503 (10cmooney) Merged and pushed with homer to cr1-codfw and cr2-codfw, working ok with the first VM (Bird being enabled on others shortly): ` cmooney@re0.cr2-codfw> show route rece... [16:59:44] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) [17:00:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Contractor (Kay Wong) - https://phabricator.wikimedia.org/T283486 (10diego) @Marostegui contract expires on June 30th. [17:00:23] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:01:53] (03PS2) 10Ssingh: site: update and consolidate Wikidough hosts [puppet] - 10https://gerrit.wikimedia.org/r/693933 [17:02:30] !log restarting FPM on mw canaries to pick up libx11 update [17:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:31] (03CR) 10Ssingh: [C: 03+2] site: update and consolidate Wikidough hosts [puppet] - 10https://gerrit.wikimedia.org/r/693933 (owner: 10Ssingh) [17:10:50] jouncebot: now [17:10:50] No deployments scheduled for the next 0 hour(s) and 49 minute(s) [17:11:02] I'm going to set the shellbox secret key now [17:11:56] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10KFrancis) @Marostegui Hello, It's 8/31/2021 [17:14:55] (03PS6) 10Cwhite: logstash: add openstack ECS transition config and tests [puppet] - 10https://gerrit.wikimedia.org/r/689262 (https://phabricator.wikimedia.org/T234565) [17:15:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=wikidough site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:15:21] ^ expected, resolving [17:15:46] thx [17:16:37] !log legoktm@deploy1002 Synchronized private/PrivateSettings.php: Set $wgShellboxSecretKey - T281423 (duration: 01m 14s) [17:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:40] T281423: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 [17:16:54] applied the wikidough role to the new hosts for the first time and puppet seems unhappy :P [17:17:25] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 96, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:17:42] (03CR) 10Cwhite: [C: 03+1] ceph: send logs to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/695329 (owner: 10David Caro) [17:17:51] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 69, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:18:57] (03CR) 10Legoktm: [C: 03+2] Add helmfile.d for shellbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [17:20:30] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete profile cache:ssl:unified [puppet] - 10https://gerrit.wikimedia.org/r/685811 (owner: 10Muehlenhoff) [17:20:38] (03PS2) 10Muehlenhoff: Remove obsolete profile cache:ssl:unified [puppet] - 10https://gerrit.wikimedia.org/r/685811 [17:21:51] (03Merged) 10jenkins-bot: Add helmfile.d for shellbox [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [17:22:16] sukhe: puppet on dough2001 seems to work fine [17:22:39] created a bunch of certs [17:22:45] nice, that is what it was missing [17:22:56] wonder why it didn't happen the first time? puppetboard says "skipped" and I was trying to find out why [17:23:28] ok! it resolved itself when I ran it again [17:23:47] 10SRE, 10Machine-Learning-Team, 10Release-Engineering-Team (Radar): Contact number of some WMDE staff should be avalible to SRE/RelEng - https://phabricator.wikimedia.org/T210721 (10WMDE-leszek) 05Open→03Resolved wikidata-emergency@wikimedia.de email address is meant to be used as a means of reaching WMD... [17:23:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:24:01] yea, I dunno why "skipped" but it seemed like the "site: update and consolidate Wikidough hosts" patch was applied the first time [17:24:37] yep and https://puppetboard.wikimedia.org/report/doh2002.wikimedia.org/1bd8f7c6f6d6586d2987bc01876a125561aae414 tells me it skipped the acme_chief certs during the first run [17:24:44] something about acme_chief needing 2 runs? dunno [17:24:45] (03CR) 10Legoktm: Add helmfile.d for shellbox (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/692736 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [17:24:49] mutante: yeah that's my theory [17:24:53] *nod* [17:25:49] (03PS1) 10Legoktm: shellbox: Switch to main release for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/695423 [17:26:23] shouldn't it be named dough2002? [17:26:33] it was malmok before [17:26:55] it could be that acmechief needs a puppet run on the acmechief servers, to update the ACL for downloading the keys to include the new hosts [17:27:01] oh, yea [17:27:05] PROBLEM - etcd request latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:27:23] similar to how for monitoring changes you need puppet run on both icinga server and the host [17:27:37] right [17:27:52] for acmechief, the ACL part is managed there locally (it has its own puppet fileserver) [17:29:15] RECOVERY - etcd request latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:30:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:23] (03CR) 10Legoktm: [C: 03+2] shellbox: Switch to main release for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/695423 (owner: 10Legoktm) [17:32:15] (03PS6) 10Dzahn: initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) [17:32:56] (03Merged) 10jenkins-bot: shellbox: Switch to main release for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/695423 (owner: 10Legoktm) [17:33:11] (03CR) 10jerkins-bot: [V: 04-1] initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [17:33:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:38] 10SRE, 10Traffic: Offer Wikidough as an anycasted service - https://phabricator.wikimedia.org/T283027 (10ssingh) [17:39:58] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [17:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:48] (03PS1) 10MewOphaswongse: Add Link: Suppress the blue dot on the edit button [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695430 (https://phabricator.wikimedia.org/T283094) [17:42:23] (03PS1) 10MewOphaswongse: Add Link: Suppress the blue dot on the edit button [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695431 (https://phabricator.wikimedia.org/T283094) [17:44:00] (03PS1) 10Legoktm: shellbox: Set tls.public_port to 4008 as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/695432 [17:45:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:16] (03CR) 10Legoktm: [C: 03+2] shellbox: Set tls.public_port to 4008 as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/695432 (owner: 10Legoktm) [17:50:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:49] (03Merged) 10jenkins-bot: shellbox: Set tls.public_port to 4008 as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/695432 (owner: 10Legoktm) [17:51:19] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [17:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:10] *twiddles thumbs* [17:55:42] jouncebot: refresh [17:55:43] I refreshed my knowledge about deployments. [17:58:19] legoktm I can try breaking the wiki in 3 interesting ways if you need something to do ;) [17:58:32] no I already broke what I'm doing [17:58:39] RECOVERY - MegaRAID on db2107 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:58:59] its not a bug, its a feature which spontaneously generated [18:00:04] twentyafterfour and hashar: (Dis)respected human, time to deploy Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T1800). Please do the needful. [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T1800). [18:00:04] MatmaRex and mewoph: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:37] I can deploy today [18:00:55] MatmaRex: hi, around? [18:01:34] (03PS1) 10Dzahn: CI/pipeline: do not run docker image, just build it [container/miscweb] - 10https://gerrit.wikimedia.org/r/695435 (https://phabricator.wikimedia.org/T281538) [18:02:01] (03CR) 10jerkins-bot: [V: 04-1] CI/pipeline: do not run docker image, just build it [container/miscweb] - 10https://gerrit.wikimedia.org/r/695435 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:02:09] hi urbanecm [18:02:11] (03CR) 10Dzahn: [C: 03+2] CI/pipeline: do not run docker image, just build it [container/miscweb] - 10https://gerrit.wikimedia.org/r/695435 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:02:30] (03PS2) 10Dzahn: CI/pipeline: do not run docker image, just build it [container/miscweb] - 10https://gerrit.wikimedia.org/r/695435 (https://phabricator.wikimedia.org/T281538) [18:02:46] (03CR) 10jerkins-bot: [V: 04-1] CI/pipeline: do not run docker image, just build it [container/miscweb] - 10https://gerrit.wikimedia.org/r/695435 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:02:48] (03CR) 10Urbanecm: [C: 03+2] Add Link: Suppress the blue dot on the edit button [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695430 (https://phabricator.wikimedia.org/T283094) (owner: 10MewOphaswongse) [18:02:50] (03CR) 10Urbanecm: [C: 03+2] Add Link: Suppress the blue dot on the edit button [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695431 (https://phabricator.wikimedia.org/T283094) (owner: 10MewOphaswongse) [18:03:11] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693897 (https://phabricator.wikimedia.org/T283119) (owner: 10Esanders) [18:03:35] hello mewoph ! [18:03:40] yay im back [18:03:45] welcome :) [18:04:20] (03Merged) 10jenkins-bot: Enable DiscussionTools on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/693897 (https://phabricator.wikimedia.org/T283119) (owner: 10Esanders) [18:05:15] (03PS5) 10Majavah: Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [18:05:23] (03PS3) 10Dzahn: CI/pipeline: do not run docker image, just build it [container/miscweb] - 10https://gerrit.wikimedia.org/r/695435 (https://phabricator.wikimedia.org/T281538) [18:05:42] (03CR) 10jerkins-bot: [V: 04-1] CI/pipeline: do not run docker image, just build it [container/miscweb] - 10https://gerrit.wikimedia.org/r/695435 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:05:52] MatmaRex: unfortunately, wikitech patches cannot be tested at mwdebug. I'll have to sync. Please let me know if it doesn't work once synced. [18:05:56] (03CR) 10Urbanecm: [C: 03+2] Enable wgCiteResponsiveReferences on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695389 (https://phabricator.wikimedia.org/T281622) (owner: 10Bartosz Dziewoński) [18:06:09] urbanecm: okay [18:06:45] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695389 (https://phabricator.wikimedia.org/T281622) (owner: 10Bartosz Dziewoński) [18:07:22] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 07b804b48d65057a66461808f2647fee9aca12b7: Enable DiscussionTools on wikitech (T283119) (duration: 01m 05s) [18:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:28] T283119: Offer the Reply and New Discussion Tools at Wikitech - https://phabricator.wikimedia.org/T283119 [18:07:29] (03PS1) 10MewOphaswongse: Add Link: Prevent double-opening of the post-edit dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695436 (https://phabricator.wikimedia.org/T283120) [18:07:38] MatmaRex: should be live. I hope it works. [18:08:13] MatmaRex: `Enable wgCiteResponsiveReferences on svwiki` is available at mwdebug, please have a look. [18:08:16] (03PS1) 10MewOphaswongse: Add Link: Prevent double-opening of the post-edit dialog [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695437 (https://phabricator.wikimedia.org/T283120) [18:08:50] wikitech looks good [18:09:11] great [18:10:29] svwiki also looks good [18:10:33] (03PS7) 10Dzahn: initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) [18:10:37] thanks, syncing [18:10:45] (03CR) 10Urbanecm: [C: 03+2] Enable VisualEditor on ptwikinews by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695390 (https://phabricator.wikimedia.org/T282846) (owner: 10Bartosz Dziewoński) [18:10:47] (03Abandoned) 10Dzahn: CI/pipeline: do not run docker image, just build it [container/miscweb] - 10https://gerrit.wikimedia.org/r/695435 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:10:49] (03CR) 10Urbanecm: [C: 03+2] Enable VisualEditor on plwikinews by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695391 (https://phabricator.wikimedia.org/T283033) (owner: 10Bartosz Dziewoński) [18:11:23] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:11:54] (03PS1) 10JMeybohm: httpbb: Allow tests to be templates [puppet] - 10https://gerrit.wikimedia.org/r/695439 (https://phabricator.wikimedia.org/T264209) [18:12:10] (03Merged) 10jenkins-bot: Enable VisualEditor on ptwikinews by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695390 (https://phabricator.wikimedia.org/T282846) (owner: 10Bartosz Dziewoński) [18:12:20] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3f66b3b: Enable wgCiteResponsiveReferences on svwiki (T281622) (duration: 01m 06s) [18:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:26] T281622: Convert reference lists over to `responsive` on svwiki - https://phabricator.wikimedia.org/T281622 [18:13:08] (03Merged) 10jenkins-bot: Enable VisualEditor on plwikinews by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695391 (https://phabricator.wikimedia.org/T283033) (owner: 10Bartosz Dziewoński) [18:13:35] MatmaRex: both visualeditor patches are ready at mwdebug1001, please test. [18:13:45] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:13:46] PROBLEM - LVS sessionstore codfw port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet IPv4 #page on sessionstore.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:13:52] uhoh [18:13:59] uhhh [18:14:08] hi [18:14:21] (03PS1) 10Andrew Bogott: Convert cloudvirt1018 to a local-storage hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/695441 (https://phabricator.wikimedia.org/T283296) [18:14:22] codfw had such spike in last 3 hours, so hopefully not something _i_ did [18:14:38] but i'm waiting on clarification whether it's safe to continue deployment [18:14:39] sessionstore paged [18:14:42] https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1 has a drop off [18:15:17] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes2007.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2015.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:15:25] I can still login [18:15:39] looks like a bunch of deletes just before the dropoff? [18:15:57] oh no, I misread the graph [18:16:01] (ptwiki and plwiki changes look good, btw) [18:16:01] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [18:16:21] is it eqiad or codfw? [18:16:25] <_joe_> codfw [18:16:27] ack MatmaRex, not syncing until the story behind alerts is clearer [18:16:29] might be me [18:16:31] yeah [18:16:37] <_joe_> jayme: what did you do? [18:16:42] the sessionstore alert is in eqiad though [18:16:43] * jbond here [18:16:45] here [18:16:46] the icinga message said both sessionstore.svc.eqiad.wmnet and sessionstore.svc.codfw.wmnet [18:16:49] maybe overloading the nodes _joe_ [18:17:07] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:17:16] RECOVERY - LVS sessionstore codfw port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet IPv4 #page on sessionstore.svc.codfw.wmnet is OK: OK - Certificate sessionstore.discovery.wmnet will expire on Tue 28 May 2024 05:38:58 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:17:16] <_joe_> jayme: a lot of evictions indeed [18:17:18] here [18:17:31] :) nice [18:17:32] <_joe_> I'm looking at kubectl get pods [18:17:38] a key calculation change? [18:17:49] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [18:17:51] we were talking about it on meet [18:18:16] codfw-only and caused by ongoing work [18:18:19] _joe_: I did not stop the tests by now [18:18:40] <_joe_> jayme: this is a common issue for kask it seems [18:18:44] don't know what exactly happened that could have caused the nodes to drain pods [18:18:50] <_joe_> it's getting evicted a lot, and not just now [18:18:56] oh [18:18:57] just catching up, does anything still need doing? [18:19:13] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:19:19] I can't tell if this is still a mystery or if it's figured out :) [18:19:22] <_joe_> rzl: yes, understanding why sessionstore pods keep getitng evicted [18:19:29] ack [18:19:39] anyways, the eqiad part in the sessionstore alert is a red herring [18:19:45] mutante: the alert mentioned eqiad, "PROBLEM - LVS sessionstore codfw port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet, are you sure it's "codfw-only"? [18:19:46] why did the error message say both eqiad? was it checking the certs with .eqiad. name on codfw pods? [18:20:02] urbanecm: the host is codfw only, the service on the host just contains both strings [18:20:11] <_joe_> no, it's an error in some stupid icinga check [18:20:13] in hieradata/common/service.yaml, there's a description line: [18:20:14] description: Session store, sessionstore.svc.%{::site}.wmnet [18:20:21] <_joe_> we can think about it later [18:20:25] and that %site is templated from the icinga server's site, not the host being tested [18:20:31] <_joe_> yes that [18:20:36] the second hostname in the message is what matters [18:20:46] <_joe_> ok so, let's focus on the actual issues [18:21:00] <_joe_> jayme: do you remember if sessionstore has a pod affinity rule? [18:21:11] _joe_: could it be that re-sheduling is just really slow because of my tests? [18:21:16] <_joe_> 33m Warning Evicted pod/kask-production-6d6869b697-zxwlj The node had condition: [DiskPressure]. [18:21:18] <_joe_> sigh [18:21:22] <_joe_> disk pressure [18:21:31] <_joe_> jayme: that might well be your tests [18:21:42] hmm...that shouldn't be me as I delete the image right away [18:22:15] <_joe_> jayme: wasn't sessionstore running only on reserved nodes? [18:22:15] so test should not accumulate diskspace at all [18:22:23] ah, yeah. Indeed [18:22:49] tests stopped btw [18:23:24] <_joe_> we had also [18:23:27] <_joe_> Warning Unhealthy 19m kubelet, kubernetes2005.codfw.wmnet Readiness probe failed: Get https://10.192.71.14:8081/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers) [18:23:39] (03Merged) 10jenkins-bot: Add Link: Suppress the blue dot on the edit button [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695430 (https://phabricator.wikimedia.org/T283094) (owner: 10MewOphaswongse) [18:23:42] (03Merged) 10jenkins-bot: Add Link: Suppress the blue dot on the edit button [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695431 (https://phabricator.wikimedia.org/T283094) (owner: 10MewOphaswongse) [18:23:46] <_joe_> that's bad, and seems to be a sessionstore/cassandra issue rather than a k8s one [18:24:02] but might ofc all be related to high load [18:24:37] <_joe_> can someone look at the cassandra metrics and pick up the investigation? and maybe involve urandom et al. :) [18:24:52] the reserved nodes are ganeti VMs [18:25:03] <_joe_> It's past 8 pm here and I had a couple beers, I'd rather not be driving this [18:25:08] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10JAnstee_WMF) @KFrancis thanks for confirming that! @Ottomata Yes, Benjamin will need kerberos, he has also filed a separate [[ https://phabricator.wikimedia.org/T283710... [18:25:30] with only 20GB storage. So might be a candidate for disk pressure [18:26:00] I joined Giuseppe with beers already :-| [18:27:20] <_joe_> jayme: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=6&orgId=1&var-server=kubernetes2005&var-datasource=thanos&var-cluster=kubernetes this looks a lot like you btw [18:27:26] <_joe_> :P [18:28:15] on kubernetes2005 is just 10G (and using 50% of it) and the docker overlay fs is what we exclude from Icinga DISK alerts afair [18:28:16] yup. I wasn't aware that they are that low on disk space [18:30:41] so we're confident in that the issue was most likely caused by the registry pull testing? [18:31:12] I thought so, but I'm not sure about the cassandra angle yet [18:32:08] so it all sounds like it was the tests (matches what we see in disk utilization graph).. just why isn't that back to normal yet [18:32:36] it might be that docker daemon finishes the last pull [18:33:08] as I don't use the docker client directly (but curl) and did just kill the curl commands [18:33:10] if there is not another new spike after that now... [18:34:38] per jayme: image pulled fine.. and disk utilization looks down [18:35:02] sorry all for the noise :-| [18:35:24] looks like we can call it over [18:35:37] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:36:03] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [18:36:43] is the codfw appserver noise possibly related? [18:36:59] urbanecm: no, been flapping for a while now https://phabricator.wikimedia.org/T283744 [18:37:03] PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: /sessions/v1/{key} (Get value for key) timed out before a response was received: /sessions/v1/{key} (Store value for key) timed out before a response was received https://www.mediawiki.org/wiki/Kask [18:37:05] was just checking - in this instance I think yes, unlike the false alarms we were talking about earlier :) should clear though [18:37:30] dont think it's related, what cdanis said [18:37:52] okay, great. OK to resume my (MW) deployments? [18:38:00] urbanecm: wait please [18:38:07] sure, i'm waiting [18:39:29] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - sessionstore_8081: Servers kubernetes2007.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:39:43] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [18:40:36] I'll look at the sessionstore deployment in codfw [18:40:40] disk utilization is still low as it should be [18:40:45] I'm messing up with mwdebug1001 [18:41:06] PROBLEM - LVS sessionstore codfw port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet IPv4 #page on sessionstore.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:41:38] (03PS1) 10Bstorm: nfs: fix the scratch mount setup [puppet] - 10https://gerrit.wikimedia.org/r/695447 (https://phabricator.wikimedia.org/T224747) [18:42:01] confirmed still codfw-only [18:42:13] https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus%2Fservices&var-cluster=sessionstore&var-keyspace=sessions&var-table=values&var-quantile=99p [18:42:48] shows high read latency in cassandra ^ [18:43:05] codfw kubernetes nodes are all in ready state [18:43:07] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:43:20] <_joe_> jayme: yeah it's cassandra indeed [18:43:21] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [18:43:27] * volans here [18:43:53] <_joe_> as I said, please take a look at what's going on in cassandra (start from the logs) and summon urandom :) [18:43:58] contacted urandom [18:43:59] session store deployment looked okay apart from one pod that was crashlooping [18:44:02] <_joe_> and / or hnowlan [18:44:09] * urandom manifests [18:44:12] I restarted it and the recovery came in [18:44:18] we think it was the pybal check leaving the connection open [18:44:38] RECOVERY - LVS sessionstore codfw port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet IPv4 #page on sessionstore.svc.codfw.wmnet is OK: OK - Certificate sessionstore.discovery.wmnet will expire on Tue 28 May 2024 05:38:58 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:44:39] <_joe_> urandom: sessionstore in codfw is farting because cassandra has higher than normal latencies, it seems [18:44:48] (03CR) 10Bstorm: "A bit of commentary on this:" [puppet] - 10https://gerrit.wikimedia.org/r/695447 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [18:45:53] could it be that cassandra has high latency because the bunch of sessionstores where restarting (probably quite often) [18:46:19] RECOVERY - Sessionstore codfw on sessionstore.svc.codfw.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fb9b460ae80, Connection to sessionstore.svc.codfw.wmnet timed out. (connect timeout=15)): /openapi https://www.mediawiki.org/wiki/Kask [18:47:40] is there anything I can help with? If not dinner is coming up around here... ;) [18:48:33] one pod was in "crashloopbackoff" state, was deleted, then pybal recovery came in [18:48:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_sessionstore_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:49:36] https://logstash.wikimedia.org/goto/5c8ebea56fce7b230499893bdc80b20c we had calico restarting on kubernetes2015 [18:49:57] that might have caused LVS health checks to get stuck as well [18:50:24] volans: don't think so, people are on it and the checks have recovered. all pods are in healthy states [18:50:28] and nodes [18:50:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:50:48] ack, feel free to ping me in VO and I can come back in a couple of minutes [18:50:51] ttyl [18:51:24] now it was just about why the health check was stuck / that pod had to be deleted [18:51:39] PROBLEM - Sessionstore codfw on sessionstore.svc.codfw.wmnet is CRITICAL: /sessions/v1/{key} (Store value for key) timed out before a response was received https://www.mediawiki.org/wiki/Kask [18:52:06] PROBLEM - LVS sessionstore codfw port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet IPv4 #page on sessionstore.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:52:45] it looks like it is cassandra having issues? [18:52:50] so sessionstore service and k8s are both in a good state now. This is cassandra [18:52:50] urandom: cassandra has not recovered [18:53:46] RECOVERY - LVS sessionstore codfw port 8081/tcp - Session store- sessionstore.svc.eqiad.wmnet IPv4 #page on sessionstore.svc.codfw.wmnet is OK: OK - Certificate sessionstore.discovery.wmnet will expire on Tue 28 May 2024 05:38:58 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:54:20] mutante: ?? [18:54:41] so we're looking at https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1&from=now-1h&to=now&var-datasource=codfw%20prometheus%2Fservices&var-cluster=sessionstore&var-keyspace=sessions&var-table=values&var-quantile=99p [18:54:50] where there's higher read latency and lower read rate [18:55:11] RECOVERY - Sessionstore codfw on sessionstore.svc.codfw.wmnet is OK: All endpoints are healthy https://www.mediawiki.org/wiki/Kask [18:55:19] and the sessionstore pods are no longer having issues but it seems like cassandra in codfw is? [18:55:38] (I'm not familiar with sessionstore so not really sure) [18:56:16] so ss2002 in /var/log/cassandra/system-a.log has some interesting bits like: [18:56:25] INFO [CompactionExecutor:224244] 2021-05-26 13:53:43,120 NoSpamLogger.java:91 - Maximum memory usage reached (536870912), cannot allocate chunk of 1048576 [18:56:33] (03CR) 10Bstorm: "Also, note that I've changed to using nfs-maps.wikimedia.org. That IP was there to serve the maps hard mounts for home and project. Since " [puppet] - 10https://gerrit.wikimedia.org/r/695447 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [18:56:42] Krinkle: I'm trying to drain a hypervisor and the VM named 'integration-agent-qemu-1001 ' is giving me a lot of trouble. Is it easy to rebuild that if I delete it or should I keep trying to rescue it? [18:57:02] bblack: that sounds scary, but isn't [18:57:09] ok [18:57:50] I'm trying to see what I can see, but the Cassandra dashboards look pretty normal, and I see nothing above INFO being logged [18:57:59] may i ask what is the status of the backport window? was the issue resolved, or determined to be unrelated? [18:58:24] MatmaRex: on hold while we sort this out -- the problem is almost certainly unrelated to anything that was deployed, but we want to get this resolved before continuing [18:58:52] thanks [18:59:22] soo.. the read latency on cassandra started to go up but only since the tests were stopped and sessionstore itself is healthy again [18:59:38] kask has logged "Error writing to storage (gocql: no hosts available in the pool)" twice in the last 10 minutes [19:00:04] twentyafterfour and hashar: Time to snap out of that daydream and deploy MediaWiki train - American+European Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T1900). [19:00:33] twentyafterfour, hashar: see above, please hold -- sorry to back up the calendar, we'll be through here as soon as we safely can [19:00:47] to what eric was saying https://logstash.wikimedia.org/goto/b319d90fe4434b79b94970df8b745ff5 [19:01:09] rzl: no problem, standing by [19:01:11] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:01:15] mutante: where are you seeing that? [19:01:29] disk utilization on kubernetes2005 (VM, one of the reserved nodes) completely stopped since around 18:30.. that is when read latency on cassandra started to go up [19:02:05] urandom: https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1&from=now-1h&to=now&var-datasource=codfw%20prometheus%2Fservices&var-cluster=sessionstore&var-keyspace=sessions&var-table=values&var-quantile=99p [19:02:20] twentyafterfour: note I have few patches merged at deployment host from B&C, so once this resolves, I'd like to sync them (should be fast enough, won't be long). [19:02:21] <_joe_> urandom: did we let the cert expire? [19:02:39] tcp retransmits looks high, could be side-effect though https://grafana-rw.wikimedia.org/d/000000377/host-overview?viewPanel=31&orgId=1&var-server=sessionstore2001&var-datasource=thanos&var-cluster=sessionstore [19:02:47] urbanecm: ack [19:03:06] mutante: what am I missing, I don't seen any increase in read latency [19:03:07] those read latencies seem not-crazy, they only went up from ~150us to ~180us [19:03:48] <_joe_> please look at the logs jayme posted [19:04:06] <_joe_> it says that the pods have trouble completing the tls handshake to cassandra [19:04:07] oh I see, the top-5 99p went up a few usecs [19:04:19] (03PS2) 10Andrew Bogott: Convert cloudvirt1018 to a local-storage hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/695441 (https://phabricator.wikimedia.org/T283296) [19:04:40] <_joe_> it's what you should looking at (and might be connected to tcp retransmits, re: shdubsh) [19:05:00] (03CR) 10Andrew Bogott: [C: 03+2] Convert cloudvirt1018 to a local-storage hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/695441 (https://phabricator.wikimedia.org/T283296) (owner: 10Andrew Bogott) [19:06:09] is the tls handshake in question, one to sessionstore.svc.codfw.wmnet:8081? [19:06:29] yeah, that's "normal", it's a health test as I remember [19:07:26] <_joe_> oh gosh [19:08:20] curl -S https://sessionstore.svc.codfw.wmnet:8081/healthz works for me [19:08:28] <_joe_> the tls handshake is to apertium.svc.codfw.wmnet [19:08:32] <_joe_> this makes no sense [19:08:35] I think the normal background noise might be more like: [19:08:36] http: TLS handshake error from 10.2.1.11:50812: EOF [19:08:39] not: [19:08:44] Error reading from storage (gocql: no hosts available in the pool) [19:08:49] what's apertium? [19:08:57] bblack: yes [19:08:59] <_joe_> it's a completely unrelated service [19:09:12] Error reading from storage (gocql: no hosts available in the pool) <-- that's not normal [19:09:12] confirmed, sessionstore logs that tls handshake error 70k times per day [19:09:15] going back 14 days [19:09:19] <_joe_> sigh [19:09:26] <_joe_> why on earth [19:09:33] it's the underlying http library in Go that logs it [19:09:35] <_joe_> anyways, scratch that then [19:10:24] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Enable Add Links for 50% of new users and all old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695364 (https://phabricator.wikimedia.org/T277356) (owner: 10Gergő Tisza) [19:12:51] <_joe_> so yeah, it seems that one specific pod can't connect right now [19:13:33] <_joe_> kask-production-6d6869b697-m2qjs [19:14:16] <_joe_> you might want to delete that pod, or try to figure out what's wrong there [19:14:41] _joe_: who is "you" in this context? :) [19:14:49] it's me :) [19:14:53] 👍 [19:14:56] <_joe_> it's running on kubernetes2005, you can just go to the server and look at it [19:14:59] !log otto@deploy1002 Started deploy [analytics/refinery@b787999]: Regular analytics weekly train [19:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:10] <_joe_> jayme: I was thinking more an sre in a more forgiving TZ [19:15:12] <_joe_> :) [19:15:28] the one definitely has the highest log volume [19:15:39] ohai [19:15:43] <_joe_> legoktm / mutante how do you feel about killing a pod ? [19:15:48] should I just delete the pod? [19:15:50] (03PS1) 10STran: Disable partial action blocks on beta dewiki instead of eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695459 (https://phabricator.wikimedia.org/T283716) [19:16:17] <_joe_> legoktm: the alternative is ssh to kubernetes2005 and figure out what is going on with that process :) [19:16:22] !log otto@deploy1002 Finished deploy [analytics/refinery@b787999]: Regular analytics weekly train (duration: 01m 23s) [19:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:45] <_joe_> legoktm: just delete it :P [19:17:08] <_joe_> you will have to use the admin credentials [19:17:11] !log legoktm@deploy1002:~$ sudo -E kubectl delete pod kask-production-6d6869b697-m2qjs -n sessionstore [19:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:19] > pod "kask-production-6d6869b697-m2qjs" deleted [19:17:31] (03CR) 10Tchanders: [C: 03+1] Disable partial action blocks on beta dewiki instead of eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695459 (https://phabricator.wikimedia.org/T283716) (owner: 10STran) [19:17:53] running afk a few minutes -- if everything is healthy before I get back please remember to give urbanecm and twentyafterfour the go-ahead to catch up on backports and then train [19:18:52] <_joe_> I'm keeping an eye on logstash [19:19:40] the logstash seems to have dropped off [19:19:54] (so far!) [19:20:06] * urandom knocks on wood [19:20:07] <_joe_> yeah no messages from any other pod over the last 15 minutes [19:21:05] seeing Evicted pods but learning we dont need to worry about them, just dangling API objects [19:21:13] <_joe_> ok, I think we can give the deployers the green light? [19:21:19] +1 [19:21:29] !log otto@deploy1002 Started deploy [analytics/refinery@c02cef1]: Regular analytics weekly train take 2 [19:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:31] <_joe_> urandom, twentyafterfour please resume your activities [19:21:37] :) [19:21:40] <_joe_> as you can see, ottomata never stopped [19:21:43] _joe_: i think you mean urbanecm ? [19:21:48] <_joe_> urbanecm: you too sorry :D [19:21:56] thanks [19:22:00] I plan to resume my activities too tho [19:22:04] MatmaRex: syncing the ve patches [19:22:16] oh, thanks [19:22:21] so somehow that one pod resolved the cassandra hosts wrongly..? [19:22:42] <_joe_> cdanis: no, somehow multiple pods had issues connecting to cassandra at some point [19:22:46] eh sorry ok reading backscroll.. [19:23:02] <_joe_> but after the storm was over, one pod kept seeing all cassandra nodes unhealthy [19:23:12] <_joe_> I'd bet some bug in the go cassandra driver [19:23:13] (03CR) 10Wolfgang Kandek: "Igor, the third contractor from speed and function does not have an expiration set." [puppet] - 10https://gerrit.wikimedia.org/r/695375 (owner: 10Muehlenhoff) [19:23:39] fwiw, compactions on each sessionstore node kicked off near the time the alerts started coming in. [19:23:42] (sorry i think refinery deploy is totally separate from mw deployt train? don't have full context) [19:23:55] _joe_: do we know what started all of this? [19:23:55] <_joe_> shdubsh: there *was* some cassandra pressure [19:24:06] <_joe_> urandom: nope [19:24:09] latency dropped off around the time the last node finished its compaction [19:24:20] <_joe_> this isn't the first time I see some issue with the codfw sessionstore btw [19:24:25] so, side-track followup: a lot of services in hieradata/common/service.yaml have %{::site}.wmnet in their descriptions, and some even in their check_command [19:24:35] <_joe_> there was some timeouts during the weekend while I was at the hackathon [19:24:39] (03PS1) 10Bstorm: clarify the language in the README a bit [debs/nfsd-ldap] - 10https://gerrit.wikimedia.org/r/695462 [19:24:43] shdubsh: that was 1 compaction! [19:24:43] AFAICS it's all a superficial confusion problem, but should probably clean those up [19:25:38] if it were any lower it'd be no compaction :) [19:25:38] !log urbanecm@deploy1002 Synchronized dblists/visualeditor-nondefault.dblist: 80abdf9: 92d2952: Enable VisualEditor by default at ptwikinews and plwikinews (T282846, T283033) (duration: 01m 09s) [19:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:44] T283033: Visual Editor for Polish WIkinews - https://phabricator.wikimedia.org/T283033 [19:25:44] T282846: Add the Visual Editor to ptwikinews - https://phabricator.wikimedia.org/T282846 [19:26:26] mewoph: your patches are at mwdebug1001 if you can check. [19:26:32] MatmaRex: your VE patches are live [19:27:02] thanks [19:27:06] at some point we lost calico components in k8s (https://logstash.wikimedia.org/goto/9bce82ec46f538de3396b3f3e31a0917) might be due to the pressure as well. Will take a closer look into that tomorrow [19:27:32] _joe_: are you seeing something I am not, or just using a generous version of "pressure"? [19:27:58] i.e. something one minute strictly greater than the previous minute [19:28:11] lgtm, thanks! [19:28:33] <_joe_> urandom: the most outstanding thing I see are the tcp retransmits tbh [19:29:07] <_joe_> so I'm not sure if this was just a networking issue (something wrong with that pod's networking for some reason) [19:29:17] (03PS6) 10Superyetkin: Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) [19:29:33] mewoph: syncing [19:29:33] k [19:29:37] <_joe_> or something re:cassandra we don't see, because what we see is mild at best [19:29:54] (03CR) 10Superyetkin: "Can this be merged now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [19:29:57] if the theory is true that some calico components were lost then that would explain pod networking issues [19:30:28] <_joe_> mutante: you can try and dig in the calico logs :) [19:30:47] _joe_: yeah, on a scale of 1-10 that cluster is usually doing a .01, and hasn't gone over .03, and the margin for error is like .8 [19:31:36] <_joe_> even the number of retransmits is not big by any measure [19:31:41] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.6/extensions/GrowthExperiments/modules/homepage/suggestededits/ext.growthExperiments.SuggestedEdits.Guidance.js: 512d72e8df4ce0325778035d0bc6107e6e5dedf0: Add Link: Suppress the blue dot on the edit button (T283094) (duration: 01m 07s) [19:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:45] T283094: Implement code for hiding the blue dot on Edit tab for users who opt-in to link recommendation experiment - https://phabricator.wikimedia.org/T283094 [19:32:41] <_joe_> urandom: so, we have a lot of hypotheses here, the only thing that seems somewhat clear is that there is something funny going on with go's cassandra driver which seemed to be in a unrecoverable failure loop [19:33:00] <_joe_> now if this happens again, I'll dig a bit deeper [19:33:08] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/modules/homepage/suggestededits/ext.growthExperiments.SuggestedEdits.Guidance.js: 9f3410b1fc5535b34d49e287846c0b3c08882bc5: Add Link: Suppress the blue dot on the edit button (T283094) (duration: 01m 07s) [19:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:21] twentyafterfour: the floor is yours. Sorry for stealing 30 mins from train :) [19:33:25] mewoph: your patches are live [19:33:32] urbanecm: no problem at all [19:33:34] _joe_: meaning, when it gets to the point when it's logging "no more hosts" that it never recovers, when we expect that it would? [19:33:42] urbanecm thanks! [19:33:46] <_joe_> urandom: exactly [19:33:53] <_joe_> for now, I'm going back to my evening though :) [19:33:54] _joe_: but didn't we restart everything? [19:34:06] mewoph: any time :) [19:35:49] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={LIST,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:36:43] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:39:31] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:39:47] 10SRE, 10netops: Netbox has incorrect email address for GTT - https://phabricator.wikimedia.org/T246564 (10wiki_willy) Sure, no prob. I just sent out an email, for our rep to loop in the Customer Success Manager. You're copied on it, so feel free to chime in on the reply. Thanks, Willy >>! In T246564#711608... [19:40:01] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:42:00] !log otto@deploy1002 Started deploy [analytics/refinery@c02cef1]: Regular analytics weekly train take 3 [19:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:01] !log otto@deploy1002 Finished deploy [analytics/refinery@c02cef1]: Regular analytics weekly train take 3 (duration: 01m 00s) [19:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:22] !log otto@deploy1002 Started deploy [analytics/refinery@c02cef1] (thin): Regular analytics weekly train THIN [19:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:29] !log otto@deploy1002 Finished deploy [analytics/refinery@c02cef1] (thin): Regular analytics weekly train THIN (duration: 00m 07s) [19:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:57] !log train is unblocked, proceeding to deploy wmf.7 to group1 wikis refs T281148 [19:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:00] T281148: 1.37.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T281148 [19:45:41] !log otto@deploy1002 Started deploy [analytics/refinery@c02cef1] (hadoop-test): Regular analytics weekly train [19:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:15] (03PS1) 1020after4: group1 wikis to 1.37.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695475 [19:46:17] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.37.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695475 (owner: 1020after4) [19:48:25] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695475 (owner: 1020after4) [19:50:33] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.7 [19:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:53] !log otto@deploy1002 Finished deploy [analytics/refinery@c02cef1] (hadoop-test): Regular analytics weekly train (duration: 05m 12s) [19:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:41] !log twentyafterfour@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.7 (duration: 01m 07s) [19:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:36] (03PS4) 10Ottomata: [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) [19:55:27] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1018.eqiad.wmnet with reason: REIMAGE [19:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:01] (03CR) 10jerkins-bot: [V: 04-1] [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [19:57:33] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1018.eqiad.wmnet with reason: REIMAGE [19:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:31] !log finished deploying wmf.7 and error levels appear unchanged. refs T281148 [19:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:34] T281148: 1.37.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T281148 [19:59:16] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for BUmeh - https://phabricator.wikimedia.org/T283648 (10Marostegui) @Bumeh-ctr can you post your ssh key on wikitech with your bumeh-ctr account on your, user page for instance, so it can be verified? Alternatively, it can also... [20:00:04] twentyafterfour and hashar: (Dis)respected human, time to deploy MediaWiki train - American+European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T1900). Please do the needful. [20:00:04] chrisalbon and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T2000). [20:01:43] (03PS1) 10MewOphaswongse: Always delete from search index in AddLinkSubmissionHandler [extensions/GrowthExperiments] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695478 (https://phabricator.wikimedia.org/T283606) [20:03:12] (03PS1) 10MewOphaswongse: Always delete from search index in AddLinkSubmissionHandler [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695479 (https://phabricator.wikimedia.org/T283606) [20:06:35] (03PS1) 10Ottomata: Bump refinery::job::canary_events to 0.1.13 [puppet] - 10https://gerrit.wikimedia.org/r/695480 (https://phabricator.wikimedia.org/T270138) [20:06:50] (03CR) 10jerkins-bot: [V: 04-1] Bump refinery::job::canary_events to 0.1.13 [puppet] - 10https://gerrit.wikimedia.org/r/695480 (https://phabricator.wikimedia.org/T270138) (owner: 10Ottomata) [20:07:24] (03PS2) 10Ottomata: Bump refinery::job::canary_events to 0.1.13 [puppet] - 10https://gerrit.wikimedia.org/r/695480 (https://phabricator.wikimedia.org/T270138) [20:09:56] (03CR) 10Ottomata: [C: 03+2] Bump refinery::job::canary_events to 0.1.13 [puppet] - 10https://gerrit.wikimedia.org/r/695480 (https://phabricator.wikimedia.org/T270138) (owner: 10Ottomata) [20:32:02] (03PS5) 10Ottomata: [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) [20:34:13] (03CR) 10Dzahn: [C: 03+2] initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [20:34:28] (03CR) 10jerkins-bot: [V: 04-1] [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [20:35:17] (03Merged) 10jenkins-bot: initial Blubberfile, placeholders for prod,stage,test HTML and httpd.conf [container/miscweb] - 10https://gerrit.wikimedia.org/r/690768 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [20:45:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:27] 10SRE: Icinga alerts mention the wrong data center - https://phabricator.wikimedia.org/T283762 (10RLazarus) [20:58:33] (03PS6) 10Ottomata: [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) [20:59:59] (03CR) 10jerkins-bot: [V: 04-1] [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [21:07:54] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:08:03] (03PS7) 10Ottomata: [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) [21:09:32] (03CR) 10jerkins-bot: [V: 04-1] [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [21:13:10] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:18:25] 10SRE, 10serviceops: Redirect docker-registry URLs with tags in them to the static /tags/ HTML page - https://phabricator.wikimedia.org/T283764 (10Legoktm) [21:19:52] (03PS8) 10Ottomata: [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) [21:21:19] (03CR) 10jerkins-bot: [V: 04-1] [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [21:22:21] !log T283606: running mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki={ar,bn,cs,vi}wiki --verbose --search-index [21:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:26] T283606: Add a link: too many articles have no suggestions upon arrival - https://phabricator.wikimedia.org/T283606 [21:31:15] (03PS9) 10Ottomata: [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) [21:32:40] (03CR) 10jerkins-bot: [V: 04-1] [WIP] airflow 2 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [21:34:08] 10SRE, 10serviceops: codfw appserver latency alerts flapping - https://phabricator.wikimedia.org/T283744 (10jijiki) Ι think there is a correlation between the latency spikes and TCP retransmits [[https://grafana-rw.wikimedia.org/d/5E7tdiGWz/xxxx-effie?viewPanel=34&orgId=1&var-datasource=codfw%20prometheus%2Fo... [21:40:47] (03CR) 10Ottomata: "Looking good!" [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [21:41:03] (03PS10) 10Ottomata: Airflow puppetization + airflow@analytics on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) [21:41:32] (03PS1) 10Ladsgroup: resourceloader: Avoid opening a connection to master when not needed [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695320 [21:41:49] (03PS1) 10Ladsgroup: resourceloader: Avoid opening a connection to master when not needed [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695321 [21:49:27] (03CR) 10Ladsgroup: [C: 03+2] resourceloader: Avoid opening a connection to master when not needed [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695320 (owner: 10Ladsgroup) [21:49:31] (03CR) 10Ladsgroup: [C: 03+2] resourceloader: Avoid opening a connection to master when not needed [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695321 (owner: 10Ladsgroup) [22:01:03] (03CR) 10jerkins-bot: [V: 04-1] resourceloader: Avoid opening a connection to master when not needed [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695321 (owner: 10Ladsgroup) [22:04:28] (03PS11) 10Ottomata: Airflow puppetization + airflow@analytics on an-test-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/694514 (https://phabricator.wikimedia.org/T272973) [22:07:48] (03Merged) 10jenkins-bot: resourceloader: Avoid opening a connection to master when not needed [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695320 (owner: 10Ladsgroup) [22:08:34] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:10:08] (03PS1) 10Jforrester: InfoAction: Cast wgNamespaceProtection to array [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695322 (https://phabricator.wikimedia.org/T283751) [22:10:36] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.6/includes/resourceloader/dependencystore/SqlModuleDependencyStore.php: Backport: [[gerrit:695320|resourceloader: Avoid opening a connection to master when not needed]] (duration: 01m 07s) [22:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:08] (03Merged) 10jenkins-bot: resourceloader: Avoid opening a connection to master when not needed [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695321 (owner: 10Ladsgroup) [22:17:20] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.7/includes/resourceloader/dependencystore/SqlModuleDependencyStore.php: Backport: [[gerrit:695321|resourceloader: Avoid opening a connection to master when not needed]] (duration: 01m 06s) [22:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:22] (03CR) 10Dzahn: [V: 03+1] "checked LDAP from mwmaint1002. UID matches, email matches, key does NOT match which is what we want (separate key cloud vs prod), looks go" [puppet] - 10https://gerrit.wikimedia.org/r/695371 (https://phabricator.wikimedia.org/T283486) (owner: 10Marostegui) [22:32:27] (03PS2) 10Dzahn: data.yaml: Add Kay Wong to analytics-privatedata-users (with kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/695371 (https://phabricator.wikimedia.org/T283486) (owner: 10Marostegui) [22:34:57] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "NDA confirmed by KFrancis on ticket. Expiry date added by Diego. Amended to add expiry_date and expiry_contact" [puppet] - 10https://gerrit.wikimedia.org/r/695371 (https://phabricator.wikimedia.org/T283486) (owner: 10Marostegui) [22:40:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:42:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:42:47] (03PS1) 10Ladsgroup: resourceloader: Avoid primary connection in SqlModuleDependencyStore (2) [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695324 [22:42:54] (03CR) 10Ladsgroup: [C: 03+2] resourceloader: Avoid primary connection in SqlModuleDependencyStore (2) [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695324 (owner: 10Ladsgroup) [22:43:05] (03PS1) 10Ladsgroup: resourceloader: Avoid primary connection in SqlModuleDependencyStore (2) [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695325 [22:43:17] (03CR) 10Ladsgroup: [C: 03+2] resourceloader: Avoid primary connection in SqlModuleDependencyStore (2) [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695325 (owner: 10Ladsgroup) [22:46:40] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10Dzahn) As one who worked on the installserver puppet roles in the past I'd say the cons of B aren't so bad. We should be able to reuse existing profiles and just combine th... [22:49:04] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10RobH) >>! In T283771#7117937, @Dzahn wrote: > As one who worked on the installserver puppet roles in the past I'd say the cons of B aren't so bad. We should be able to reus... [22:56:55] 10SRE, 10serviceops: httpd-fcgi image is not using numerical UIDs - https://phabricator.wikimedia.org/T283774 (10Legoktm) [22:57:32] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10Dzahn) But wouldn't the dedicated ganeti server need hardware then instead? [22:59:09] 10SRE, 10serviceops: httpd-fcgi image is not using numerical UIDs - https://phabricator.wikimedia.org/T283774 (10Legoktm) [23:00:04] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.592e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210526T2300). [23:00:05] mewoph and stran: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:59] (03PS1) 10Cwhite: logstash: add ECS transition support for Oslo structured logs [puppet] - 10https://gerrit.wikimedia.org/r/695563 (https://phabricator.wikimedia.org/T234565) [23:01:38] (03Merged) 10jenkins-bot: resourceloader: Avoid primary connection in SqlModuleDependencyStore (2) [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/695324 (owner: 10Ladsgroup) [23:01:59] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Legoktm) [23:02:02] 10SRE, 10serviceops: httpd-fcgi image is not using numerical UIDs - https://phabricator.wikimedia.org/T283774 (10Legoktm) [23:02:23] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10RobH) >>! In T283771#7117961, @Dzahn wrote: > But wouldn't the dedicated ganeti server need hardware then instead? My initial task description is stating ganeti instance no... [23:02:38] 10SRE, 10serviceops: httpd-fcgi image is not using numerical UIDs - https://phabricator.wikimedia.org/T283774 (10Legoktm) [23:03:53] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.6/includes/resourceloader/dependencystore/SqlModuleDependencyStore.php: Backport: [[gerrit:695324|resourceloader: Avoid primary connection in SqlModuleDependencyStore (2)]] (duration: 01m 06s) [23:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:18] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10RobH) [23:05:06] 10SRE, 10DC-Ops, 10netops: allow mgmt network to access tftp servers for firmware updates - https://phabricator.wikimedia.org/T283771 (10Dzahn) ACK, sorry, I read " create a ganeti server" as a server running ganeti that then hosts VMs on it. Yea, so then the biggest part of this would be the "make the mgmt... [23:06:17] (03Merged) 10jenkins-bot: resourceloader: Avoid primary connection in SqlModuleDependencyStore (2) [core] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/695325 (owner: 10Ladsgroup) [23:07:51] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.7/includes/resourceloader/dependencystore/SqlModuleDependencyStore.php: Backport: [[gerrit:695325|resourceloader: Avoid primary connection in SqlModuleDependencyStore (2)]] (duration: 01m 06s) [23:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:21] I'm done [23:09:31] Amir1: Any chance you could also do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/695459 for us? [23:09:53] Tchanders: it's labs [23:10:05] I can just merge it and it'll be automatically there in ten minute-ish [23:10:26] (03CR) 10Ladsgroup: [C: 03+2] Disable partial action blocks on beta dewiki instead of eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695459 (https://phabricator.wikimedia.org/T283716) (owner: 10STran) [23:10:56] Great - thank you! [23:11:55] (03Merged) 10jenkins-bot: Disable partial action blocks on beta dewiki instead of eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/695459 (https://phabricator.wikimedia.org/T283716) (owner: 10STran) [23:12:31] Tchanders: now it's rebased on deploy1002, we are all good [23:12:37] just wait ten minutes [23:13:05] Thanks [23:38:11] (03PS1) 10Dzahn: docker_registry_ha: add nginx rewrite for URLs with tags [puppet] - 10https://gerrit.wikimedia.org/r/695598 (https://phabricator.wikimedia.org/T283764)