[00:04:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:11:40] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:28:14] (03PS1) 10BryanDavis: toolhub: add LOGGING_CONSOLE_FORMATTER env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/714656 (https://phabricator.wikimedia.org/T276374) [00:29:29] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714657 [00:32:16] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714658 [00:35:02] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10Bstorm) Since I had a random conversation about this in IRC today with @... [00:42:31] (03PS1) 10Platonides: Add another deploy message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/714659 [00:43:33] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714657 (owner: 10PipelineBot) [00:43:40] (03CR) 10Legoktm: [C: 03+2] shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714658 (owner: 10PipelineBot) [00:46:31] (03Merged) 10jenkins-bot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/714658 (owner: 10PipelineBot) [00:47:20] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox' for release 'main' . [00:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:05] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox' for release 'main' . [00:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:33] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox' for release 'main' . [00:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:55:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:03:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:10:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:12:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:13:20] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:26:50] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:32:34] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:36:18] PROBLEM - MariaDB Replica Lag: s4 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1171.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:38:20] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:40:14] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:47:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:55:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:59:22] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:00:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:01:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:06:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:10:50] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:12:46] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:37:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:40:56] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:46:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:48:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:56:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:58:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:00:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:02:24] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:13:38] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:19:24] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:48:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:50:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:54:18] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:56:14] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:01:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:03:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:44] RECOVERY - MariaDB Replica Lag: s4 on db2097 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:25:24] 10SRE, 10MediaWiki-extensions-Score, 10TestMe: Contrabass MIDI instrument is unusable - https://phabricator.wikimedia.org/T199356 (10Legoktm) We now use fluidsynth, fluid-soundfont-gs, and fluid-soundfont-gm so it seems like this might be solved now, but someone should test it to verify. [04:26:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:31:32] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:37:08] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:38:56] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:44:18] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:46:04] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:49:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:51:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:56:34] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:56:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:58:36] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:00:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:10:56] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:13:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:15:00] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:16:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:24:16] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:24:50] (03PS2) 10Kormat: db2121: Promote to s7 primary [puppet] - 10https://gerrit.wikimedia.org/r/713625 (https://phabricator.wikimedia.org/T289129) [05:25:18] (03PS2) 10Kormat: wmnet: Update s7-master to db2121 [dns] - 10https://gerrit.wikimedia.org/r/713626 (https://phabricator.wikimedia.org/T289129) [05:26:05] marostegui14: going to start the prep steps now [05:27:03] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:04:00 on 27 hosts with reason: Primary switchover s7 T289129 [05:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:08] T289129: Switchover s7 from db2118 to db2121 - https://phabricator.wikimedia.org/T289129 [05:27:24] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:04:00 on 27 hosts with reason: Primary switchover s7 T289129 [05:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:46] !log kormat@cumin1001 dbctl commit (dc=all): 'Set db2121 with weight 0 T289129', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20210825-052741-kormat.json [05:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:40] !log Moving s7 codfw replicas under db2121 - T289129 [05:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:12] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:34:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:41:34] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:42:10] (03CR) 10Kormat: [C: 03+2] db2121: Promote to s7 primary [puppet] - 10https://gerrit.wikimedia.org/r/713625 (https://phabricator.wikimedia.org/T289129) (owner: 10Kormat) [05:43:00] marostegui: all prep done. [05:43:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:43:50] kormat sweet [05:57:45] jouncebot: now [05:57:45] No deployments scheduled for the next 5 hour(s) and 2 minute(s) [05:57:52] oops. i didn't put in the entry.. [05:59:12] We need to add that heartbeat needs to be fixed for orchestrator, to the checklist template [05:59:49] ack [06:00:38] ok, it's too early for my brain to remember how to enter deployments, so i'm going to do that retroactively [06:00:48] !log Starting s7 codfw failover from db2118 to db2121 - T289129 [06:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:53] T289129: Switchover s7 from db2118 to db2121 - https://phabricator.wikimedia.org/T289129 [06:00:53] let's go! [06:01:13] !log kormat@cumin1001 dbctl commit (dc=all): 'Set s7 codfw as read-only for maintenance - T289129', diff saved to https://phabricator.wikimedia.org/P17075 and previous config saved to /var/cache/conftool/dbconfig/20210825-060112-kormat.json [06:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:24] confirmed RO [06:01:30] ty [06:02:23] !log kormat@cumin1001 dbctl commit (dc=all): 'Promote db2121 to s7 primary and set section read-write T289129', diff saved to https://phabricator.wikimedia.org/P17076 and previous config saved to /var/cache/conftool/dbconfig/20210825-060222-kormat.json [06:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:31] orchestratortree looks good [06:02:32] exited RO [06:02:44] I can write [06:02:45] frwiki shows an error "Wikimedia\Rdbms\DBReadOnlyError" instead of a page [06:02:54] Not anymore [06:03:08] But it's the first time I've seen this [06:03:16] sobanski: probably worth reporting as a task [06:03:23] Not on edit, when loading [06:03:25] sobanski: it might be related to centralauth being RO [06:03:28] Ah [06:03:37] kormat: want me to fix orchestrator heartbeat? [06:03:53] marostegui: which one did you check? [06:03:59] marostegui: where's the fun in that (i'll get it) [06:03:59] sobanski: eswiki [06:04:07] And that loaded fine? [06:04:11] sobanski: do you have a stack trace for that? [06:04:12] kormat: cool! [06:04:44] sobanski: I didn't try reads, only writes [06:04:59] majavah: sadly, no. I just copied the text above and then it went back to a working state [06:05:00] orchestrator look good  [06:05:01] marostegui: heartbeat cleaned up [06:05:10] sobanski: I can try to find that trace later in logstash [06:05:37] if you do and it's related to CA, I would be happy to take a look if we can do anything to it [06:05:53] I can login fine to eswiki btw [06:06:10] majavah: Thanks [06:06:23] kormat: so the script worked fine? [06:06:37] marostegui: no problems today at least 🤞 [06:06:44] (03CR) 10Kormat: [C: 03+2] wmnet: Update s7-master to db2121 [dns] - 10https://gerrit.wikimedia.org/r/713626 (https://phabricator.wikimedia.org/T289129) (owner: 10Kormat) [06:07:42] !log kormat@cumin1001 dbctl commit (dc=all): 'Depool db2118 until it's reimaged to buster T289129', diff saved to https://phabricator.wikimedia.org/P17077 and previous config saved to /var/cache/conftool/dbconfig/20210825-060742-kormat.json [06:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:46] T289129: Switchover s7 from db2118 to db2121 - https://phabricator.wikimedia.org/T289129 [06:07:51] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [06:07:51] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [06:08:22] Hi [06:08:27] (03CR) 10Volans: [C: 03+1] "Looks sane to me too" [cookbooks] - 10https://gerrit.wikimedia.org/r/713931 (https://phabricator.wikimedia.org/T280221) (owner: 10Gehel) [06:08:27] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2118.codfw.wmnet with reason: Reimaging T288244 [06:08:29] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2118.codfw.wmnet with reason: Reimaging T288244 [06:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:31] T288244: Upgrade s7 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T288244 [06:08:33] looking [06:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:43] Phew, bad timing for a page ;) [06:08:48] * volans here too [06:09:04] false positive [06:09:10] It's mr1 again? [06:09:16] same as previous one but in the other direction [06:09:25] I'll fix it too [06:10:33] done [06:10:50] :) [06:12:25] thx [06:17:51] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [06:17:51] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [06:19:04] mr1-esams afaics from librenms [06:19:06] majavah: the stack trace I found mentions CentralAuth [06:19:11] Here's the task: https://phabricator.wikimedia.org/T289649 [06:19:31] thx [06:19:54] And we only found out by accident because I meant to check fawiki and not frwiki ;) [06:20:06] Any suggestions on what teams to tag on this? [06:20:17] dear $deity, editting the deployments schedule is confusing AF [06:20:29] jouncebot: refresh [06:20:29] I refreshed my knowledge about deployments. [06:20:32] jouncebot: now [06:20:32] For the next 0 hour(s) and 9 minute(s): Database primary switchover for s7 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T0600) [06:20:49] sobanski: none :( T252244 [06:20:50] T252244: CentralAuth extension: Code stewardship review - https://phabricator.wikimedia.org/T252244 [06:21:09] it's practically maintained by me and a few other volunteers these days [06:21:40] Ah [06:22:08] looks like you tried to visit frwiki for the first time on that account? [06:22:54] https://meta.wikimedia.org/wiki/Special:CentralAuth/LSobanski_(WMF) supports that theory too [06:23:06] Ah that would explain it indeed [06:23:20] As that'd use centralauth [06:23:34] yeah, we might be able to fail a bit more gracefully but not fix the real issue [06:24:23] I thought that might be the case as I saw something about account creation [06:24:52] I think there might have been a similar one for one of the other wikis I checked in logstash, maybe jawiki? [06:24:56] But that wasn't me [06:28:17] (03PS1) 10Marostegui: install_server: Reimage db1160 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/714694 (https://phabricator.wikimedia.org/T288803) [06:30:08] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1160 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/714694 (https://phabricator.wikimedia.org/T288803) (owner: 10Marostegui) [06:30:52] 10Puppet, 10Infrastructure-Foundations, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review: Add a fact holding the type of a disk (spinning/ssd) - https://phabricator.wikimedia.org/T288509 (10Volans) It seems to me that looking only at `/sys/block/*/queue/rotational` would create quite large fact for hosts wi... [06:31:38] (03PS1) 10Kormat: dbtools/switchover-tmpl.sh: Phab template generator [software] - 10https://gerrit.wikimedia.org/r/714695 [06:32:25] (03CR) 10Kormat: [C: 03+2] dbtools/switchover-tmpl.sh: Phab template generator [software] - 10https://gerrit.wikimedia.org/r/714695 (owner: 10Kormat) [06:33:57] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:35:17] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:38:19] (03CR) 10Volans: "Some post-merge question/comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/692286 (owner: 10Jbond) [06:42:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1160 for reimage T288803', diff saved to https://phabricator.wikimedia.org/P17078 and previous config saved to /var/cache/conftool/dbconfig/20210825-064319-marostegui.json [06:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:25] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [06:44:15] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:27] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:46:09] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:59:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1160.eqiad.wmnet with reason: REIMAGE [06:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:41] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:01:48] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1160.eqiad.wmnet with reason: REIMAGE [07:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:12] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8225399984 and 50198 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:06:24] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4882900848 and 50210 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:15:04] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 160 and 49 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:16:14] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 558344 and 119 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:16:46] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:19:06] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 25981711592 and 50973 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:20:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:23:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:51] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) >>! In T251305#7304773, @JMeybohm wrote: > That is pretty cool, thanks! Did you actually deploy something using helm3 with `tillerNamespace:` still set? Is it just ignored in that... [08:00:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:56] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:02:22] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:04:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:07:15] (03CR) 10MMandere: varnish: Containerize varnish test environment (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [08:07:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:08:15] (03PS5) 10MMandere: varnish: Containerize varnish test environment [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) [08:15:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:34] (03PS3) 10Jcrespo: dbbackups: Switch s7 backups from stretch (db2100) to buster (db2098) [puppet] - 10https://gerrit.wikimedia.org/r/710981 (https://phabricator.wikimedia.org/T288244) [08:17:38] !log swift codfw add ms-be20[62-65] with initial weight - T288458 [08:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:44] T288458: Put ms-be20[62-65] in service - https://phabricator.wikimedia.org/T288458 [08:18:58] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Switch s7 backups from stretch (db2100) to buster (db2098) [puppet] - 10https://gerrit.wikimedia.org/r/710981 (https://phabricator.wikimedia.org/T288244) (owner: 10Jcrespo) [08:19:22] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove thanos alerts, moved to alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/714541 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:19:28] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: add alerts ported from icinga/upstream [alerts] - 10https://gerrit.wikimedia.org/r/714543 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:19:39] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] o11y: add alerts ported from icinga/upstream [alerts] - 10https://gerrit.wikimedia.org/r/714543 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:19:45] (03PS2) 10Filippo Giunchedi: o11y: add alerts ported from icinga/upstream [alerts] - 10https://gerrit.wikimedia.org/r/714543 (https://phabricator.wikimedia.org/T288726) [08:20:28] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] o11y: add alerts ported from icinga/upstream [alerts] - 10https://gerrit.wikimedia.org/r/714543 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [08:23:46] Emperor: FYI I've started the first rebalance for https://phabricator.wikimedia.org/T288458 [08:24:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:25:04] 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10I18n: Add encoding HTML header to LilyPond output - https://phabricator.wikimedia.org/T184871 (10TheDJ) Can #sre-swift-storage assist with this or point to the correct project tag ? [08:26:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:30:15] (03CR) 10JMeybohm: "After some discussion and time spend to get them working at all, I removed the spec for this fact again being unable to create a working t" [puppet] - 10https://gerrit.wikimedia.org/r/714572 (https://phabricator.wikimedia.org/T288509) (owner: 10JMeybohm) [08:32:18] 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10I18n: Add Content-Encoding HTTP header to LilyPond file output - https://phabricator.wikimedia.org/T184871 (10TheDJ) [08:33:26] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 84203077584 and 1436 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:33:37] (03PS1) 10Jcrespo: dbbackups: Move s4 backup generation from db1130 to db1150/dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/714704 (https://phabricator.wikimedia.org/T288803) [08:35:02] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 90386042648 and 1533 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:35:48] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 93590331392 and 1578 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:35:54] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 94335928616 and 1586 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:36:23] (03CR) 10Vgutierrez: varnish: Containerize varnish test environment (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [08:37:05] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2033.codfw.wmnet'... [08:37:16] (ThanosSidecarUploadFailure) firing: (6) Thanos Sidecar is not uploading blocks. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [08:37:52] (ThanosCompactIsDown) firing: (7) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:37:52] (ThanosQueryIsDown) firing: (6) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:37:52] (ThanosRuleIsDown) firing: (6) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:37:56] (ThanosSidecarIsDown) firing: (6) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:38:01] (ThanosStoreIsDown) firing: (6) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:38:10] uh [08:38:26] mhh that's new alerts, I'll check [08:38:56] "new" as in, prometheus-based checks ported from icinga to AM [08:40:08] (03CR) 10JMeybohm: [C: 03+1] envoyproxy: Support ECDH curves configuration [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [08:40:18] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:18] 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet (WIP) - https://phabricator.wikimedia.org/T289657 (10jijiki) I will update description when I have performed the service owner actions [08:41:20] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] envoyproxy: Support ECDH curves configuration [puppet] - 10https://gerrit.wikimedia.org/r/710581 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [08:41:57] (03PS7) 10JMeybohm: k8s::apiserver: Add admission controller config file [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) [08:42:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:42:13] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Audid puppet usage in cloud hosts - https://phabricator.wikimedia.org/T289658 (10jbond) [08:42:20] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Audid puppet usage in cloud hosts - https://phabricator.wikimedia.org/T289658 (10jbond) p:05Triage→03Medium [08:43:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/714572 (https://phabricator.wikimedia.org/T288509) (owner: 10JMeybohm) [08:44:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30828/console" [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [08:46:56] (03PS7) 10Vgutierrez: envoyproxy: Add STEK configuration support [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) [08:47:16] (ThanosSidecarUploadFailure) firing: (12) Thanos Sidecar is not uploading blocks. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [08:47:17] (03CR) 10Vgutierrez: envoyproxy: Add STEK configuration support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [08:47:52] (ThanosCompactIsDown) firing: (14) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:47:52] (ThanosQueryIsDown) firing: (13) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:47:52] (ThanosSidecarIsDown) firing: (12) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:47:56] (ThanosRuleIsDown) firing: (13) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:48:01] (ThanosStoreIsDown) firing: (13) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:48:59] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30829/console" [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [08:49:11] (03PS1) 10Filippo Giunchedi: o11y: move thanos-absent to global rules [alerts] - 10https://gerrit.wikimedia.org/r/714706 [08:51:37] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: move thanos-absent to global rules [alerts] - 10https://gerrit.wikimedia.org/r/714706 (owner: 10Filippo Giunchedi) [08:51:51] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30830/console" [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [08:52:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:52:52] (ThanosCompactIsDown) firing: (16) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:52:52] (ThanosRuleIsDown) firing: (15) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:52:52] (ThanosQueryIsDown) firing: (15) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:52:56] (ThanosStoreIsDown) firing: (15) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [08:53:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:53:30] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] "PCC DIFF 10 is fine, for all hosts apart from kubestagemaster2001 this is the absence of the admission config file." [puppet] - 10https://gerrit.wikimedia.org/r/714071 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [08:53:33] 10SRE-swift-storage, 10MediaWiki-extensions-Score, 10I18n: Fix mime type and text encoding in Content-Type HTTP header of LilyPond .ly file output - https://phabricator.wikimedia.org/T184871 (10TheDJ) [08:53:45] 10SRE, 10SRE-swift-storage, 10MediaWiki-extensions-Score: upload.wikimedia.org does not set content-encoding headers for Score-generated lilypond files - https://phabricator.wikimedia.org/T287326 (10TheDJ) Somewhat related: {T184871} [08:54:21] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Audit usages or the realm variable with a view to drop it - https://phabricator.wikimedia.org/T289661 (10jbond) [08:57:11] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2033.codfw.wmnet with reason: REIMAGE [08:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:23] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc2033.codfw.wmnet with reason: REIMAGE [08:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:35] (03CR) 10MMandere: varnish: Containerize varnish test environment (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [09:04:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:19] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:06:25] (03CR) 10Vgutierrez: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [09:06:32] (03PS1) 10Filippo Giunchedi: o11y: temp disable thanos sidecar not configured to upload blocks [alerts] - 10https://gerrit.wikimedia.org/r/714708 (https://phabricator.wikimedia.org/T289662) [09:08:55] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:09:29] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: temp disable thanos sidecar not configured to upload blocks [alerts] - 10https://gerrit.wikimedia.org/r/714708 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [09:09:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2033.codfw.wmnet'] ` and were **ALL** successful. [09:10:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:59] 10SRE, 10vm-requests: Site: Eqiad - 1 VM request for analytics test cluster - coordinator replica role - https://phabricator.wikimedia.org/T289664 (10BTullis) [09:12:44] 10SRE, 10 Data-Engineering, 10Analytics-Clusters, 10Analytics-Kanban, and 2 others: Site: Eqiad - 1 VM request for analytics test cluster - coordinator replica role - https://phabricator.wikimedia.org/T289664 (10BTullis) p:05Triage→03Medium a:03BTullis [09:12:52] (ThanosCompactIsDown) firing: (16) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:12:52] (ThanosQueryIsDown) firing: (15) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:12:52] (ThanosRuleIsDown) firing: (15) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:12:56] (ThanosStoreIsDown) firing: (15) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:13:22] 10SRE, 10 Data-Engineering, 10Analytics-Clusters, 10Analytics-Kanban, and 2 others: Site: Eqiad - 1 VM request for analytics test cluster - coordinator replica role - https://phabricator.wikimedia.org/T289664 (10BTullis) [09:13:51] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Normalise hiera default values - https://phabricator.wikimedia.org/T289665 (10jbond) [09:14:06] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Normalise hiera default values - https://phabricator.wikimedia.org/T289665 (10jbond) p:05Triage→03Medium [09:14:27] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Normalise hiera default values - https://phabricator.wikimedia.org/T289665 (10jbond) [09:14:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:14:50] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Normalise hiera default values - https://phabricator.wikimedia.org/T289665 (10jbond) [09:15:51] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Audit usages or the realm variable with a view to drop it - https://phabricator.wikimedia.org/T289661 (10jbond) p:05Triage→03Medium [09:16:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:17:16] (ThanosSidecarUploadFailure) firing: (12) Thanos Sidecar is not uploading blocks. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [09:17:52] (ThanosCompactIsDown) firing: (15) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:17:52] (ThanosQueryIsDown) firing: (14) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:17:52] (ThanosRuleIsDown) firing: (14) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:17:56] (ThanosSidecarIsDown) firing: (12) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:18:01] (ThanosStoreIsDown) firing: (14) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:21:03] (03CR) 10Zfilipin: [C: 03+1] "Why didn't this merge after +2? 😕" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/713217 (https://phabricator.wikimedia.org/T282237) (owner: 10Sahilgrewalhere) [09:22:52] (ThanosCompactIsDown) resolved: (8) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:22:52] (ThanosQueryIsDown) resolved: (7) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:22:52] (ThanosRuleIsDown) resolved: (7) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:22:56] (ThanosSidecarIsDown) resolved: (6) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:23:01] (ThanosStoreIsDown) resolved: (7) Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org [09:26:55] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Gather a list of puppet modules shared between cloud and production - https://phabricator.wikimedia.org/T289666 (10jbond) [09:29:16] 10SRE, 10 Data-Engineering, 10Analytics-Clusters, 10Analytics-Kanban, and 2 others: Site: Eqiad - 1 VM request for analytics test cluster - coordinator replica role - https://phabricator.wikimedia.org/T289664 (10BTullis) I realize that this is a bit of a big VM at 32 GB, but I'm not sure that the required... [09:29:32] (03PS1) 10Kosta Harlan: ApiVisualEditorEdit: data-{plugin} is not multi [extensions/VisualEditor] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714670 (https://phabricator.wikimedia.org/T289652) [09:30:35] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet [09:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:38] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Gather a list of puppet modules shared between cloud and production - https://phabricator.wikimedia.org/T289666 (10jbond) p:05Triage→03Medium [09:31:52] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:32:46] (03PS1) 10Ladsgroup: Introduce concept of generateHTMLOnEdit() for ContentHandler [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714671 (https://phabricator.wikimedia.org/T285987) [09:33:07] (03PS1) 10Ladsgroup: Introduce concept of generateHTMLOnEdit() for ContentHandler [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714672 (https://phabricator.wikimedia.org/T285987) [09:33:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:34:39] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:02] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:09] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet [09:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:38] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Add more rspec test to the puppet code - https://phabricator.wikimedia.org/T289668 (10jbond) [09:35:40] (03CR) 10JMeybohm: [C: 03+2] kubernetes: Enable Priority admission plugin [puppet] - 10https://gerrit.wikimedia.org/r/713807 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [09:35:59] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:14] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [09:36:19] 10SRE, 10Continuous-Integration-Infrastructure, 10MW-1.37-notes (1.37.0-wmf.15; 2021-07-19), 10Patch-For-Review, 10Release-Engineering-Team (Done by Fri 03 Sep): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10awight) >>! In T209149#7303060... [09:36:27] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Add more rspec test to the puppet code - https://phabricator.wikimedia.org/T289668 (10jbond) p:05Triage→03Medium [09:36:52] (03CR) 10Ladsgroup: [C: 03+2] Introduce concept of generateHTMLOnEdit() for ContentHandler [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714671 (https://phabricator.wikimedia.org/T285987) (owner: 10Ladsgroup) [09:37:16] (ThanosSidecarUploadFailure) resolved: (6) Thanos Sidecar is not uploading blocks. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [09:41:14] (03CR) 10Jbond: [C: 03+1] "lgtm minor typo" [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [09:44:06] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [09:44:15] (03PS1) 10JMeybohm: kubernetes/staging: Limit use of PriorityClass [puppet] - 10https://gerrit.wikimedia.org/r/714717 (https://phabricator.wikimedia.org/T289131) [09:44:18] (03PS1) 10JMeybohm: kubernetes: Limit use of PriorityClass [puppet] - 10https://gerrit.wikimedia.org/r/714718 (https://phabricator.wikimedia.org/T289131) [09:44:37] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1011.eqiad.wmnet [09:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:47] (03CR) 10JMeybohm: [C: 03+2] kubernetes/staging: Limit use of PriorityClass [puppet] - 10https://gerrit.wikimedia.org/r/714717 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [09:46:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:47:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:48:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:49:21] (03PS1) 10Dzahn: admin: remove access for Jim Maddock [puppet] - 10https://gerrit.wikimedia.org/r/714719 [09:49:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1011.eqiad.wmnet [09:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:40] (03PS5) 10Volans: Class API: add rollback() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 [09:50:43] (03CR) 10Volans: "done, thx" [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [09:52:21] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Audit puppet usage in cloud hosts - https://phabricator.wikimedia.org/T289658 (10Volans) [09:52:40] (03CR) 10Dzahn: [C: 03+2] "expired" [puppet] - 10https://gerrit.wikimedia.org/r/714719 (owner: 10Dzahn) [09:53:04] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:53:12] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [09:53:21] (03PS1) 10Bartosz Dziewoński: Enable topic subscriptions as a beta feature on Wikipedias except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714720 (https://phabricator.wikimedia.org/T287801) [09:56:43] addshore: hi! re: T287741 just to clarify, using grafana for alerts is still supported, but alerts should be directed to AM, or alternatively alerts as prometheus rules deployed to alerts.git [09:56:44] T287741: Convert wikidata-alerts grafana dashboard to AlertManager - https://phabricator.wikimedia.org/T287741 [09:57:28] the latter obviously doesn't work for graphite-based alerts [09:59:35] (03Merged) 10jenkins-bot: Introduce concept of generateHTMLOnEdit() for ContentHandler [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714671 (https://phabricator.wikimedia.org/T285987) (owner: 10Ladsgroup) [09:59:58] !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-test-coord1002.eqiad.wmnet [10:00:00] 10SRE, 10 Data-Engineering, 10Analytics-Clusters, 10Analytics-Kanban, and 2 others: Site: Eqiad - 1 VM request for analytics test cluster - coordinator replica role - https://phabricator.wikimedia.org/T289664 (10BTullis) Proceeding with this now. ` btullis@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad... [10:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:50] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [10:01:21] !log - removed jmads from wmf group [10:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:54] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:02:11] (03CR) 10Dzahn: "removed from wmf LDAP group with ldiff created by offboard-user" [puppet] - 10https://gerrit.wikimedia.org/r/714719 (owner: 10Dzahn) [10:03:19] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.20/includes: Backport: [[gerrit:714671|Introduce concept of generateHTMLOnEdit() for ContentHandler (T285987)]] (duration: 02m 17s) [10:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:22] T285987: Do not generate full html parser output at the end of Wikibase edit requests - https://phabricator.wikimedia.org/T285987 [10:04:45] (03PS2) 10Bartosz Dziewoński: Enable topic subscriptions as a beta feature on Wikipedias except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714720 (https://phabricator.wikimedia.org/T287801) [10:04:47] (03PS1) 10Bartosz Dziewoński: Disable upcoming DiscussionTools automatic topic subscriptions for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714721 [10:05:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:15] (03PS1) 10David Caro: nova_fullstack: rephrase log message [puppet] - 10https://gerrit.wikimedia.org/r/714722 (https://phabricator.wikimedia.org/T289663) [10:07:38] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:07:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:21] (03PS1) 10Dzahn: admin: extend access for Benjamin Umeh until 2021-11-30 [puppet] - 10https://gerrit.wikimedia.org/r/714723 [10:09:46] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mc2035.codfw.wmnet'... [10:09:50] (03PS2) 10Dzahn: admin: extend access for Benjamin Umeh until 2021-11-30 [puppet] - 10https://gerrit.wikimedia.org/r/714723 [10:10:41] (03PS3) 10Dzahn: admin: extend access for Benjamin Umeh until 2021-11-30 [puppet] - 10https://gerrit.wikimedia.org/r/714723 [10:11:18] godog: ty, i just updated the wording of the ticket [10:11:57] (03CR) 10Dzahn: [C: 03+2] admin: extend access for Benjamin Umeh until 2021-11-30 [puppet] - 10https://gerrit.wikimedia.org/r/714723 (owner: 10Dzahn) [10:12:10] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:12:12] addshore: neato, thank you ! [10:12:56] (03CR) 10Volans: [C: 03+2] Class API: add rollback() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [10:13:15] (03PS1) 10Kormat: db2118: Disable notifications during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/714724 (https://phabricator.wikimedia.org/T288244) [10:14:12] (03PS1) 10Kormat: install_server: switch db2118 to buster [puppet] - 10https://gerrit.wikimedia.org/r/714725 (https://phabricator.wikimedia.org/T288244) [10:14:29] (03CR) 10Kormat: [C: 03+2] db2118: Disable notifications during reimage. [puppet] - 10https://gerrit.wikimedia.org/r/714724 (https://phabricator.wikimedia.org/T288244) (owner: 10Kormat) [10:14:58] (03PS1) 10H.krishna123: bernard: Changes to dashboard, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714726 (https://phabricator.wikimedia.org/T289441) [10:15:07] (03CR) 10jerkins-bot: [V: 04-1] bernard: Changes to dashboard, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714726 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [10:16:00] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:16:09] (03Abandoned) 10H.krishna123: bernard: Changes to dashboard, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714726 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [10:17:01] 10SRE, 10Infrastructure-Foundations, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [10:17:16] 10SRE, 10Infrastructure-Foundations, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [10:17:46] (03CR) 10Marostegui: [C: 03+1] dbbackups: Move s4 backup generation from db1130 to db1150/dbprov1003 [puppet] - 10https://gerrit.wikimedia.org/r/714704 (https://phabricator.wikimedia.org/T288803) (owner: 10Jcrespo) [10:17:48] (03CR) 10Ladsgroup: [C: 03+2] Introduce concept of generateHTMLOnEdit() for ContentHandler [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714672 (https://phabricator.wikimedia.org/T285987) (owner: 10Ladsgroup) [10:18:01] (03CR) 10Kormat: [C: 03+2] install_server: switch db2118 to buster [puppet] - 10https://gerrit.wikimedia.org/r/714725 (https://phabricator.wikimedia.org/T288244) (owner: 10Kormat) [10:18:42] 10SRE: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10jcrespo) @akosiaris Any thoughts on applying T95705#7118417, even if it were "do as you see fit, not managing this anymore" 0:-) [10:19:08] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:21:31] !log rolling out openssl updates [10:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:27] (03Merged) 10jenkins-bot: Class API: add rollback() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [10:23:26] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:27:53] 10SRE: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10akosiaris) >>! In T95705#7307855, @jcrespo wrote: > @akosiaris Any thoughts on applying T95705#7118417, even if it were "do as you see fit, not managing this anymore" 0:-) I 've never tested "Allow Mixed Priority" so... [10:30:58] (03PS1) 10David Caro: nova_fullstack: Add last error output when timing out puppet check [puppet] - 10https://gerrit.wikimedia.org/r/714733 (https://phabricator.wikimedia.org/T289663) [10:31:44] (03CR) 10jerkins-bot: [V: 04-1] nova_fullstack: Add last error output when timing out puppet check [puppet] - 10https://gerrit.wikimedia.org/r/714733 (https://phabricator.wikimedia.org/T289663) (owner: 10David Caro) [10:34:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:36:47] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.58 [software/spicerack] - 10https://gerrit.wikimedia.org/r/714734 [10:36:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:37:52] (03CR) 10Dzahn: "thank you for the extra effort to remove the tilde files! Let me just take the liberty and turn this from a module into a profile and then" [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [10:38:51] (03Merged) 10jenkins-bot: Introduce concept of generateHTMLOnEdit() for ContentHandler [core] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714672 (https://phabricator.wikimedia.org/T285987) (owner: 10Ladsgroup) [10:40:30] PROBLEM - Host mc2035 is DOWN: PING CRITICAL - Packet loss = 100% [10:41:36] RECOVERY - Host mc2035 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [10:41:59] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mc2035.codfw.wmnet'] ` and were **ALL** successful. [10:43:57] (03PS2) 10David Caro: nova_fullstack: Add last error output when timing out puppet check [puppet] - 10https://gerrit.wikimedia.org/r/714733 (https://phabricator.wikimedia.org/T289663) [10:44:23] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.58 [software/spicerack] - 10https://gerrit.wikimedia.org/r/714734 (owner: 10Volans) [10:44:46] (03CR) 10jerkins-bot: [V: 04-1] nova_fullstack: Add last error output when timing out puppet check [puppet] - 10https://gerrit.wikimedia.org/r/714733 (https://phabricator.wikimedia.org/T289663) (owner: 10David Caro) [10:44:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:45:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:45:21] works on mwdebug, deploying [10:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:51] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2008.codfw.wmnet [10:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:46:22] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:46:27] (03PS6) 10Dzahn: Create profile for emacs with disabled backup files, use on releases [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [10:46:41] 10SRE, 10Platform Engineering, 10serviceops, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster and away from the memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [10:46:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) 05Open→03Resolved [10:46:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:48] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.19/includes/content/ContentHandler.php: Backport: [[gerrit:714672|Introduce concept of generateHTMLOnEdit() for ContentHandler (T285987)]], Part I (duration: 01m 08s) [10:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:52] T285987: Do not generate full html parser output at the end of Wikibase edit requests - https://phabricator.wikimedia.org/T285987 [10:47:57] (03CR) 10jerkins-bot: [V: 04-1] Create profile for emacs with disabled backup files, use on releases [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [10:48:04] (03PS7) 10Dzahn: Create profile for emacs with disabled backup files, use on releases [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [10:49:16] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:49:33] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.19/includes/Storage/DerivedPageDataUpdater.php: Backport: [[gerrit:714672|Introduce concept of generateHTMLOnEdit() for ContentHandler (T285987)]], Part II (duration: 01m 04s) [10:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:06] (03CR) 10jerkins-bot: [V: 04-1] Create profile for emacs with disabled backup files, use on releases [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [10:50:24] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.58 [software/spicerack] - 10https://gerrit.wikimedia.org/r/714734 (owner: 10Volans) [10:53:25] (03PS8) 10Dzahn: Create profile for emacs with disabled backup files, use on releases [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [10:54:07] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2008.codfw.wmnet [10:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:00] (03PS1) 10Ladsgroup: Set EntityHandler::generateHTMLOnEdit to false [extensions/Wikibase] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714674 (https://phabricator.wikimedia.org/T285987) [10:56:08] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/30834/releases2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [10:56:14] (03PS1) 10Ladsgroup: Set EntityHandler::generateHTMLOnEdit to false [extensions/Wikibase] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714675 (https://phabricator.wikimedia.org/T285987) [10:57:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:57:27] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet [10:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:08] (03PS1) 10Volans: Upstream release v0.0.58 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/714740 [10:58:49] 10Puppet, 10Infrastructure-Foundations, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review: Add a fact holding the type of a disk (spinning/ssd) - https://phabricator.wikimedia.org/T288509 (10jbond) > I'm wondering if we should instead either iterate or even expand the built-in disks The implmentation JMeyboh... [10:58:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:59:52] (03CR) 10Dzahn: "@Ahmon: Done! emacs-nox has been installed on releases2002 by puppet.on releases1002 it just changed the content of the existing 99disable" [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T1100). [11:00:04] kostajh: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] \o [11:00:11] o/ [11:00:15] I can deploy today, but... [11:00:24] kostajh might want to do it himself? [11:00:57] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Audit puppet usage in cloud hosts - https://phabricator.wikimedia.org/T289658 (10jbond) >>! In T285539#7306974, @Bstorm wrote: > Since I had a random conversation about this in IRC today with @nskaggs, I thought... [11:01:23] (03CR) 10Dzahn: "> At the very least, I'd configure emacs not to leave tilde files, at least when running as root." [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [11:01:44] is anyone around who knows how to deploy Mathoid? (T289674) [11:01:45] T289674: Deploy new Mathoid version to production - https://phabricator.wikimedia.org/T289674 [11:01:51] I’d be interested in attempting a deployment myself, as long as someone can hold my hand ^^ [11:02:02] (I guess that would be a deployment training, kubernetes version) [11:02:06] hi [11:02:11] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:02:26] urbanecm: um, sure, I should probably get some experience doing this. are you around to help out if I get stuck? [11:02:40] kostajh: certainly! [11:02:41] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 3 others: Easing pain points caused by divergence between cloudservices and production puppet usecases - https://phabricator.wikimedia.org/T285539 (10jbond) >>! In T285539#7306974, @Bstorm wrote: > Since I had a random con... [11:02:53] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2010.codfw.wmnet [11:02:54] let me know if you want to open a screen share or something [11:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:00] * kostajh looks at documentation [11:03:10] urbanecm: ack. I'll see if I can get through it via the docs first [11:03:53] (03CR) 10Dzahn: [C: 03+1] hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver. [puppet] - 10https://gerrit.wikimedia.org/r/714646 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [11:04:12] kostajh: for that, I recommend https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers :-). https://deploy-commands.toolforge.org/bacc/714670 is also a TLDR version with commands to run [11:04:31] (03CR) 10Kosta Harlan: [C: 03+2] "backport" [extensions/VisualEditor] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714670 (https://phabricator.wikimedia.org/T289652) (owner: 10Kosta Harlan) [11:05:06] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.58 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/714740 (owner: 10Volans) [11:06:20] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:08:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:09:23] urbanecm: I'm looking at https://deploy-commands.toolforge.org/bacc/714670 . So, I'll do the first two blocks of commands, then after the `scap pull`, I can verify on mwdebug1002 (or 2001?) using WikimediaDebug extension? And after that I do the `scap sync-file` command? [11:09:46] (03PS1) 10Dzahn: admin: extend access for Kay Wong until 2021-10-30 [puppet] - 10https://gerrit.wikimedia.org/r/714742 [11:10:07] kostajh: that is correct. Since MW runs now from codfw, you need to use mwdebug2xxx hosts. [11:10:20] (the eqiad ones will work, but they're set to RO) [11:11:04] ok [11:11:34] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:13:31] (03Merged) 10jenkins-bot: Upstream release v0.0.58 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/714740 (owner: 10Volans) [11:13:53] 10SRE, 10Analytics, 10Patch-For-Review: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10BTullis) OK, in that case I've done the following to clear this bit of cron spam temporarily. ` btullis@an-test-client1001:~$ ls -l /srv/home ls: cannot access '/srv/home': No suc... [11:14:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:14:43] (03PS2) 10Dzahn: admin: extend access for Kay Wong until 2021-10-30 [puppet] - 10https://gerrit.wikimedia.org/r/714742 [11:15:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:16:10] (03CR) 10Dzahn: [C: 03+2] admin: extend access for Kay Wong until 2021-10-30 [puppet] - 10https://gerrit.wikimedia.org/r/714742 (owner: 10Dzahn) [11:18:28] !log uploaded spicerack_0.0.58 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [11:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:10] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:20:18] (03CR) 10Dzahn: [C: 03+1] osm: migrate cron osm_sync_lag to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:20:51] (03Merged) 10jenkins-bot: ApiVisualEditorEdit: data-{plugin} is not multi [extensions/VisualEditor] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714670 (https://phabricator.wikimedia.org/T289652) (owner: 10Kosta Harlan) [11:22:23] 10SRE: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10jcrespo) a:03jcrespo I am going to try it and test it. CC @LSobanski [11:22:34] 10SRE: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10jcrespo) p:05Low→03Medium [11:24:43] kostajh: i see that the patch merged -- let me know if i can be of any help [11:25:03] urbanecm: I just ran scap pull on mwdebug2001, should I see some output that indicates that the particular file I care about was synced? [11:25:27] I just see started/finished rsync common and started/finished scap-cdb-rebuild [11:25:34] kostajh: no, that's all what you're supposed to see [11:25:36] k [11:28:36] urbanecm: the `logspam-watch` command is showing the errors (just reproducing before switching to mwdebug2001) but I don't see in the logstash dashboard for mwdebug [11:28:48] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:28:49] oh right, because i'm not using mwdebug yet [11:28:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:00] * kostajh facepalm [11:29:11] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) [11:29:13] happens to all of us :) [11:29:14] (03PS1) 10H.krishna123: bernard: Changes to dashboard, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714746 (https://phabricator.wikimedia.org/T289441) [11:29:22] (03CR) 10jerkins-bot: [V: 04-1] bernard: Changes to dashboard, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714746 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [11:29:48] (03Abandoned) 10H.krishna123: bernard: Changes to dashboard, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714746 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [11:30:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:36] urbanecm: ok it looks good to me [11:31:11] kostajh: great! so, let's sync? [11:31:16] urbanecm: yep, syncing [11:31:19] (y) [11:32:46] !log kharlan@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/VisualEditor/includes/ApiVisualEditorEdit.php: Backport: [[gerrit:714670|ApiVisualEditorEdit: data-{plugin} is not multi (T289652)]] (duration: 01m 06s) [11:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:50] T289652: PHP Warning: json_decode() expects parameter 1 to be string, array given - https://phabricator.wikimedia.org/T289652 [11:32:57] (03PS1) 10H.krishna123: bernard: Changes to dashboard, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) [11:33:06] urbanecm: do i need to do anything else now? [11:33:07] (03CR) 10Hnowlan: [C: 04-1] osm: migrate cron osm_sync_lag to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [11:33:14] kostajh: only if you have more patches to sync :) [11:33:58] (03PS2) 10H.krishna123: bernard: Changes to dashboard, add indidiual section data, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) [11:34:00] (03CR) 10JMeybohm: [C: 03+2] kubernetes: Limit use of PriorityClass [puppet] - 10https://gerrit.wikimedia.org/r/714718 (https://phabricator.wikimedia.org/T289131) (owner: 10JMeybohm) [11:34:07] urbanecm: I'm good :) [11:34:12] so then no :) [11:34:19] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [11:34:42] urbanecm: do I need to do the `!log EU deploys done` bit? [11:35:39] kostajh: you can, but not all deployers do it. [11:35:55] (03CR) 10jerkins-bot: [V: 04-1] bernard: Changes to dashboard, add indidiual section data, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [11:35:56] then I'll leave it as is [11:36:01] thanks for your help! [11:36:14] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:38:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-test-coord1002.eqiad.wmnet [11:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:12] !log slowly restarting all pods in kube-system namespace in eqiad k8s cluster - T289131 [11:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:16] T289131: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 [11:40:06] (03CR) 10JMeybohm: [C: 03+1] envoyproxy: Add STEK configuration support [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:40:49] (03CR) 10JMeybohm: [C: 03+2] "Thanks for your time and help @jbond!" [puppet] - 10https://gerrit.wikimedia.org/r/714572 (https://phabricator.wikimedia.org/T288509) (owner: 10JMeybohm) [11:41:51] 10SRE, 10 Data-Engineering, 10Analytics-Clusters, 10Analytics-Kanban, and 2 others: Site: Eqiad - 1 VM request for analytics test cluster - coordinator replica role - https://phabricator.wikimedia.org/T289664 (10BTullis) `ganeti1016` was allocated as the primary and `ganeti1017` as the secondary, so this s... [11:42:54] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:44:32] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:44:40] (03PS1) 10Btullis: Add replica hadoop coordinator role in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/714753 (https://phabricator.wikimedia.org/T287864) [11:44:57] (03PS2) 10Btullis: Add replica hadoop coordinator role in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/714753 (https://phabricator.wikimedia.org/T287864) [11:45:18] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [11:47:54] (03CR) 10MMandere: varnish: Containerize varnish test environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) (owner: 10MMandere) [11:48:46] (03PS6) 10MMandere: varnish: Containerize varnish test environment [puppet] - 10https://gerrit.wikimedia.org/r/713445 (https://phabricator.wikimedia.org/T286639) [11:48:56] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) all of these happened but have been linked to a wrong ticket by accident: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/713639 - fix whi... [11:49:57] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Audit puppet usage in cloud hosts - https://phabricator.wikimedia.org/T289658 (10jbond) following on from brokes script i create a small python script to give us a list of used classes ` lang=python,lines=10 #!/... [11:50:08] 10Puppet, 10Infrastructure-Foundations, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review: Add a fact holding the type of a disk (spinning/ssd) - https://phabricator.wikimedia.org/T288509 (10JMeybohm) The merged implementation creates a new fact `disk_type` based on built-in `disks` fact, using the same keys... [11:50:11] 10Puppet, 10Infrastructure-Foundations, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review: Add a fact holding the type of a disk (spinning/ssd) - https://phabricator.wikimedia.org/T288509 (10JMeybohm) 05Open→03Resolved [11:51:17] 10Puppet, 10Infrastructure-Foundations, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review: Add a fact holding the type of a disk (spinning/ssd) - https://phabricator.wikimedia.org/T288509 (10JMeybohm) Oh and FTR: With Puppet 7 this is part of the `disks` fact already: https://puppet.com/docs/puppet/7/core_fa... [11:54:10] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:54:28] potentially me, looking [11:55:05] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) [11:55:12] (03PS1) 10Dzahn: miscweb: bump staging version to 2021-08-24-074849-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/714755 (https://phabricator.wikimedia.org/T281538) [11:56:18] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 79, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:57:20] (03PS1) 10Jbond: wmflib::role_hosts: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/714756 [11:57:41] (03CR) 10Jbond: "fixed typos in" [puppet] - 10https://gerrit.wikimedia.org/r/692286 (owner: 10Jbond) [11:59:45] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) a:03EYener @Jmando You don't seem to have an ssh key defined on wikitech/horizon (WMFCloud). That's ok- it is absolutely not needed for production access, but note that as per what y... [12:00:34] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) p:05Triage→03High [12:01:42] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:02:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:09:51] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:13:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:13:34] PROBLEM - Host kafka-jumbo1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:18:36] RECOVERY - Host kafka-jumbo1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [12:21:50] !log kormat@cumin1001 START - Cookbook sre.dns.netbox [12:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:27] !log kormat@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:16] (03CR) 10Btullis: [V: 03+2 C: 03+2] Remove dummy keytabs for decommissioned druid servers [labs/private] - 10https://gerrit.wikimedia.org/r/714023 (https://phabricator.wikimedia.org/T255148) (owner: 10Btullis) [12:30:50] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:31:04] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:32:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:34:00] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [12:37:16] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:44:34] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: should we move $site global to a fact - https://phabricator.wikimedia.org/T289678 (10jbond) p:05Triage→03Medium [12:44:57] (03CR) 10Jbond: "Thanks for the input i have created a task to further discuss $site https://phabricator.wikimedia.org/T289678" [puppet] - 10https://gerrit.wikimedia.org/r/692286 (owner: 10Jbond) [12:45:06] (03PS1) 10David Caro: nova_fullstack: try to get the puppet state from a couple places [puppet] - 10https://gerrit.wikimedia.org/r/714761 (https://phabricator.wikimedia.org/T289663) [12:45:39] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] envoyproxy: Add STEK configuration support [puppet] - 10https://gerrit.wikimedia.org/r/711399 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:46:11] (03CR) 10jerkins-bot: [V: 04-1] nova_fullstack: try to get the puppet state from a couple places [puppet] - 10https://gerrit.wikimedia.org/r/714761 (https://phabricator.wikimedia.org/T289663) (owner: 10David Caro) [12:48:01] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: should we move $site global to a fact - https://phabricator.wikimedia.org/T289678 (10RhinosF1) [12:51:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:52:09] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:54:09] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:54:09] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 101, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:54:21] codfw was me again [12:55:16] he did it again! 🎶 [12:56:11] should do calico restarts in smaller batches to lower the risk of icinga catching one :) [12:56:32] (03CR) 10Volans: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/714756 (owner: 10Jbond) [12:56:59] 10SRE, 10Infrastructure-Foundations, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [13:00:41] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:01:43] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10EYener) Thank you for the notes, and absolutely approved! @JMando is a 40-hour / week. long-term contractor in a senior level position with a full confidentiality agreement who is a core membe... [13:01:53] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:02:58] !log restarted all pods in kube-system namespace in codfw k8s cluster - T289131 [13:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:03] T289131: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 [13:07:51] 10Puppet, 10Infrastructure-Foundations, 10netops, 10User-jbond: LLDP: Ganeti hosts dont correctly report lldp_parent - https://phabricator.wikimedia.org/T289679 (10jbond) p:05Triage→03Medium [13:08:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:10:36] !log joal@deploy1002 Started deploy [analytics/refinery@7bed213]: Regular analytics weekly train [analytics/refinery@7bed213] [13:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:12:26] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) a:05EYener→03KFrancis @EYener a simple approve was enough, at least for us SREs :-). Thank you! This is mostly a formality to make sure we check he is who he says he is. Actual ap... [13:14:43] (03PS3) 10Zabe: osm: migrate cron osm_sync_lag to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) [13:16:00] (03CR) 10Zabe: osm: migrate cron osm_sync_lag to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/713087 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [13:16:19] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:17:07] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [13:17:59] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:18:02] Something happened at :15 as it jumped straight up & back down [13:18:16] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) [13:18:47] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [13:25:26] (03PS1) 10Jbond: labstore::drdb_role fact: update facter implementation to ignore stderr [puppet] - 10https://gerrit.wikimedia.org/r/714762 (https://phabricator.wikimedia.org/T289679) [13:25:36] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:25:54] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:26:08] well...that's interesting [13:27:51] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) BTW, @Jmando 'researchers' is a deprecated role, DE team is likely to suggest a different group for your access, probably the other one you are asking for: analytics-privatedata-users... [13:29:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Should be okay to backport." [extensions/Wikibase] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714675 (https://phabricator.wikimedia.org/T285987) (owner: 10Ladsgroup) [13:30:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Should be okay to backport." [extensions/Wikibase] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714674 (https://phabricator.wikimedia.org/T285987) (owner: 10Ladsgroup) [13:31:01] !log joal@deploy1002 Finished deploy [analytics/refinery@7bed213]: Regular analytics weekly train [analytics/refinery@7bed213] (duration: 20m 25s) [13:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:08] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:31:34] !log joal@deploy1002 Started deploy [analytics/refinery@7bed213] (thin): Regular analytics weekly train THIN [analytics/refinery@7bed213] [13:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:41] !log joal@deploy1002 Finished deploy [analytics/refinery@7bed213] (thin): Regular analytics weekly train THIN [analytics/refinery@7bed213] (duration: 00m 07s) [13:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:04] !log joal@deploy1002 Started deploy [analytics/refinery@7bed213] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@7bed213] [13:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:46] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:34:23] (03CR) 10Dzahn: [C: 03+2] miscweb: bump staging version to 2021-08-24-074849-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/714755 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [13:37:09] (03Merged) 10jenkins-bot: miscweb: bump staging version to 2021-08-24-074849-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/714755 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [13:37:40] (03CR) 10Volans: "Thanks for the replies!" [puppet] - 10https://gerrit.wikimedia.org/r/692286 (owner: 10Jbond) [13:37:59] !log joal@deploy1002 Finished deploy [analytics/refinery@7bed213] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@7bed213] (duration: 05m 55s) [13:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:06] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:39:36] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:40:34] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:41:08] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:42:12] (03CR) 10Elukey: Add replica hadoop coordinator role in the test cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714753 (https://phabricator.wikimedia.org/T287864) (owner: 10Btullis) [13:45:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:48:54] (03PS1) 10Urbanecm: Deploy Growth features to 44 new Wikipedias in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714765 (https://phabricator.wikimedia.org/T289680) [13:48:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:50:08] (03CR) 10David Caro: [C: 03+1] labstore::drdb_role fact: update facter implementation to ignore stderr [puppet] - 10https://gerrit.wikimedia.org/r/714762 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [13:51:36] !log upgraded spicerack to 0.0.58 on cumin2002 [13:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:47] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin1001.eqiad.wmnet with reason: apostrophe's test [13:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:49] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin1001.eqiad.wmnet with reason: apostrophe's test [13:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:30] (03PS3) 10Btullis: Add replica hadoop coordinator role in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/714753 (https://phabricator.wikimedia.org/T287864) [13:57:34] (03PS3) 10David Caro: nova_fullstack: Add last error output when timing out puppet check [puppet] - 10https://gerrit.wikimedia.org/r/714733 (https://phabricator.wikimedia.org/T289663) [13:57:36] (03PS2) 10David Caro: nova_fullstack: try to get the puppet state from a couple places [puppet] - 10https://gerrit.wikimedia.org/r/714761 (https://phabricator.wikimedia.org/T289663) [13:58:05] (03CR) 10Btullis: "Great. Thanks for those comments. Edited accordingly." [puppet] - 10https://gerrit.wikimedia.org/r/714753 (https://phabricator.wikimedia.org/T287864) (owner: 10Btullis) [13:58:13] (03PS4) 10Btullis: Add replica hadoop coordinator role in the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/714753 (https://phabricator.wikimedia.org/T287864) [13:58:48] (03CR) 10jerkins-bot: [V: 04-1] nova_fullstack: Add last error output when timing out puppet check [puppet] - 10https://gerrit.wikimedia.org/r/714733 (https://phabricator.wikimedia.org/T289663) (owner: 10David Caro) [13:59:06] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin2001.codfw.wmnet with reason: apostrophe's test failure [13:59:07] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:05:00 on cumin2001.codfw.wmnet with reason: apostrophe's test failure [13:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:16] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:02:37] 10SRE, 10DNS, 10Traffic: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10Asaf) [14:03:10] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [14:04:02] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:04:08] there was a spike a while ago, but not during the past half an hour [14:04:11] is icinga late? [14:04:20] (for mw appservers I mean) [14:05:24] (03PS1) 10Jbond: lldp fact: updated lldp parent fact to fall back to routers [puppet] - 10https://gerrit.wikimedia.org/r/714767 (https://phabricator.wikimedia.org/T289679) [14:05:41] elukey: it alerted for the last spike [14:05:48] At the time [14:06:39] (03PS4) 10David Caro: nova_fullstack: Add last error output when timing out puppet check [puppet] - 10https://gerrit.wikimedia.org/r/714733 (https://phabricator.wikimedia.org/T289663) [14:06:40] RhinosF1: at the time? it fired a couple of minutes ago [14:06:41] (03PS3) 10David Caro: nova_fullstack: try to get the puppet state from a couple places [puppet] - 10https://gerrit.wikimedia.org/r/714761 (https://phabricator.wikimedia.org/T289663) [14:07:03] (03CR) 10Jbond: [C: 03+2] wmflib::role_hosts: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/714756 (owner: 10Jbond) [14:07:25] elukey: we had one at 13:17 UTC too [14:07:54] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:08:48] !log klausman@cumin2001 START - Cookbook sre.hosts.reboot-single for host ml-etcd2001.codfw.wmnet [14:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:14] !log klausman@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd2001.codfw.wmnet [14:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:30] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-etcd2002.codfw.wmnet [14:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:40] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd2002.codfw.wmnet [14:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:59] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-etcd2003.codfw.wmnet [14:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:56] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack downtime methods fail when the admin reason includes an apostrophe - https://phabricator.wikimedia.org/T288558 (10Volans) 05Open→03Resolved New spicerack release done, I've deployed it to the cumin hosts and tested that I can now... [14:19:24] (03PS1) 10Nikerabbit: Rename wgTranslateBlacklist to wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714770 [14:20:12] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-etcd2003.codfw.wmnet [14:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:18] !log Create GrowthExperiments DB tables for wikis listed in P17081 (T289680) [14:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:22] T289680: Deploy Growth features to Round 4 wikis - https://phabricator.wikimedia.org/T289680 [14:21:45] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2001.codfw.wmnet [14:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:46] !log Apply https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/714765/ at mwmaint2002 temporarily (T289680) [14:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:16] (03CR) 10Andrew Bogott: [C: 03+1] "Thank you for giving attention to this test!" [puppet] - 10https://gerrit.wikimedia.org/r/714722 (https://phabricator.wikimedia.org/T289663) (owner: 10David Caro) [14:23:22] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/initWikiConfig.php # T289680 # r714765 applied at mwmaint2002 [14:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:08] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl2001.codfw.wmnet [14:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:45] (03CR) 10Andrew Bogott: [C: 03+1] nova_fullstack: Add last error output when timing out puppet check [puppet] - 10https://gerrit.wikimedia.org/r/714733 (https://phabricator.wikimedia.org/T289663) (owner: 10David Caro) [14:26:26] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2002.codfw.wmnet [14:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:13] (03PS1) 10Urbanecm: WikiPageConfigWriter: Fix `autopatrol` right name [extensions/GrowthExperiments] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714676 (https://phabricator.wikimedia.org/T288886) [14:28:19] jouncebot: nowandnext [14:28:19] No deployments scheduled for the next 3 hour(s) and 31 minute(s) [14:28:19] In 3 hour(s) and 31 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T1800) [14:28:19] In 3 hour(s) and 31 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T1800) [14:28:36] (03CR) 10Urbanecm: [C: 03+2] WikiPageConfigWriter: Fix `autopatrol` right name [extensions/GrowthExperiments] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714676 (https://phabricator.wikimedia.org/T288886) (owner: 10Urbanecm) [14:28:38] (03PS1) 10Elukey: kubeflow: change storage-init's AWS_DEFAULT_REGION value [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/714773 (https://phabricator.wikimedia.org/T272919) [14:29:48] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl2002.codfw.wmnet [14:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:04] (03CR) 10Elukey: [C: 03+2] kubeflow: change storage-init's AWS_DEFAULT_REGION value [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/714773 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [14:30:06] (03CR) 10Elukey: [V: 03+2 C: 03+2] kubeflow: change storage-init's AWS_DEFAULT_REGION value [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/714773 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [14:30:13] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet [14:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:44] !log mwmaint2002: scap pull # clearing temporary config changes [14:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:50] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:34:58] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:15] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet [14:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:06] 10SRE, 10MediaWiki-Uploading, 10Traffic, 10serviceops, 10Wikimedia-production-error: Unexpected upload speed to commons - https://phabricator.wikimedia.org/T288481 (10Majavah) [14:36:36] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:36:44] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:38:32] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:40:19] !log Run `User::newSystemUser( 'MediaWiki default', ['steal' => true] )` in brwiki shell.php session (T289690) [14:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:23] T289690: initWikiConfig.php GrowthExperiments script fatals for brwiki - https://phabricator.wikimedia.org/T289690 [14:42:48] !log [urbanecm@mwmaint2002 /srv/mediawiki/php]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=brwiki # T289690, T289680 [14:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:53] T289680: Deploy Growth features to Round 4 wikis - https://phabricator.wikimedia.org/T289680 [14:43:19] 10SRE, 10Infrastructure-Foundations, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [14:46:10] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2002.codfw.wmnet [14:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:34] (03CR) 10Klausman: [C: 03+1] kubeflow: change storage-init's AWS_DEFAULT_REGION value [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/714773 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [14:48:50] (03CR) 10Urbanecm: [C: 03+2] Deploy Growth features to 44 new Wikipedias in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714765 (https://phabricator.wikimedia.org/T289680) (owner: 10Urbanecm) [14:49:38] (03Merged) 10jenkins-bot: Deploy Growth features to 44 new Wikipedias in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714765 (https://phabricator.wikimedia.org/T289680) (owner: 10Urbanecm) [14:52:20] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2002.codfw.wmnet [14:52:22] (03Merged) 10jenkins-bot: WikiPageConfigWriter: Fix `autopatrol` right name [extensions/GrowthExperiments] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714676 (https://phabricator.wikimedia.org/T288886) (owner: 10Urbanecm) [14:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:31] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2003.codfw.wmnet [14:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:54:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:21] !log urbanecm@deploy1002 sync-file aborted: 0ccac4b2816f01c4b035aa51cbe4651c715632e0: Deploy Growth features to 44 new Wikipedias in dark mode (T289680) (duration: 00m 01s) [14:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:24] T289680: Deploy Growth features to Round 4 wikis - https://phabricator.wikimedia.org/T289680 [14:55:32] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 0ccac4b2816f01c4b035aa51cbe4651c715632e0: Deploy Growth features to 44 new Wikipedias in dark mode (T289680; 1/3) (duration: 01m 06s) [14:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:48] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:56:53] !log urbanecm@deploy1002 Synchronized wmf-config/config/: 0ccac4b2816f01c4b035aa51cbe4651c715632e0: Deploy Growth features to 44 new Wikipedias in dark mode (T289680; 2/3) (duration: 01m 05s) [14:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:53] 10SRE, 10Infrastructure-Foundations, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) Perhaps the most surprising result I found so far is kafka plaintext traffic from cp hosts to kafka-main1* (... [14:58:29] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2003.codfw.wmnet [14:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:09] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2004.codfw.wmnet [14:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:34] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:00:11] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0ccac4b2816f01c4b035aa51cbe4651c715632e0: Deploy Growth features to 44 new Wikipedias in dark mode (T289680; 3/3) (duration: 01m 06s) [15:00:12] (03PS4) 10Labdajiwa: Set the project namespace and sitename for Javanese Wikipedia and Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/710565 (https://phabricator.wikimedia.org/T287437) [15:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:14] T289680: Deploy Growth features to Round 4 wikis - https://phabricator.wikimedia.org/T289680 [15:00:24] 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) Hi @jcrespo , are you the correct person to reach out to for assistance with triaging this task? It looks like you're the contact listed on [[ https://wikitech.wikime... [15:02:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:39] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/GrowthExperiments/includes/Config/WikiPageConfigWriter.php: 0b9ca1e11c1f0397847d4cfc7bc86220b6ebe9f6: WikiPageConfigWriter: Fix `autopatrol` right name (T288886) (duration: 01m 04s) [15:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:44] T288886: Community configuration should not save the edit as unpatrolled - https://phabricator.wikimedia.org/T288886 [15:02:48] * urbanecm is done [15:03:38] 10Puppet, 10Cloud Services Proposals, 10Cloud-VPS, 10Infrastructure-Foundations, and 2 others: Audit puppet usage in cloud hosts - https://phabricator.wikimedia.org/T289658 (10bd808) >>! In T289658#7308070, @jbond wrote: > following on from brokes script i create a small python script to give us a list of... [15:03:42] 10SRE, 10Infrastructure-Foundations, 10Datacenter-Switchover, 10User-fgiunchedi: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [15:04:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:07] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2004.codfw.wmnet [15:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:32] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Cookbooks repository: avoid stale code in master branch - https://phabricator.wikimedia.org/T287465 (10Volans) a:03Volans [15:06:32] 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10jcrespo) Hey, @ldelench_wmf I am indeed the first point of contact for SREs this week. Both @jijiki and @MoritzMuehlenhoff maybe temporarily unavailable or on vacations. However, i... [15:07:16] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:07:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:09:16] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:13:55] I just accidentaly brought https://zh-yue.wikipedia.org/ down. Fixing... [15:13:58] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10jcrespo) [15:14:51] and zh_yuewiki is back online now. Sorry! [15:15:04] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:16:00] urbanecm: out of curiosity, how? [15:16:26] majavah: by failing to create extension tables. My scripts to automate GE deployment somehow missed zh_yuewiki in that part. [15:16:42] !log [urbanecm@mwmaint2002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=zh_yuewiki growthexperiments # T289680 [15:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:46] T289680: Deploy Growth features to Round 4 wikis - https://phabricator.wikimedia.org/T289680 [15:18:56] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:22:30] !log Run `User::newSystemUser( 'MediaWiki default', ['steal' => true] )` in mywiki shell.php session (same issue as T289690) [15:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:35] T289690: initWikiConfig.php GrowthExperiments script fatals for brwiki - https://phabricator.wikimedia.org/T289690 [15:27:40] (03CR) 10MMandere: "Looks good to me, see below suggestion." [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:29:00] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10ssingh) [15:29:00] (03PS1) 10Mforns: Finalize Event Platform migration of EchoEmail and EchoInteraction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714794 (https://phabricator.wikimedia.org/T287210) [15:29:49] (03PS1) 10Vgutierrez: cache: Support TLS on kafka::statsv [puppet] - 10https://gerrit.wikimedia.org/r/714795 (https://phabricator.wikimedia.org/T286038) [15:29:51] (03PS1) 10Vgutierrez: hieradata: Enable SSL for statsv varnishkafka@cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/714796 (https://phabricator.wikimedia.org/T286038) [15:31:32] 10SRE, 10serviceops: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 (10elukey) Cleaned up some old istio/knative/kubeflow images, got down to this: ` Data Space Used: 66.39GB Data Space Total: 107.4GB Data Space Available: 40.98GB ` We should try to reduce the images even... [15:32:20] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:34:18] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:34:22] (03PS2) 10Vgutierrez: cache: Support TLS on kafka::statsv [puppet] - 10https://gerrit.wikimedia.org/r/714795 (https://phabricator.wikimedia.org/T286038) [15:34:24] (03PS2) 10Vgutierrez: hieradata: Enable SSL for statsv varnishkafka@cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/714796 (https://phabricator.wikimedia.org/T286038) [15:35:37] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30839/console" [puppet] - 10https://gerrit.wikimedia.org/r/714795 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [15:37:16] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30840/console" [puppet] - 10https://gerrit.wikimedia.org/r/714796 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [15:38:16] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:40:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:42:07] (03PS1) 10Lucas Werkmeister (WMDE): Return normalized snaks from SetClaim, SetReference [extensions/Wikibase] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714677 (https://phabricator.wikimedia.org/T289501) [15:43:21] (03CR) 10Ahmon Dancy: Create profile for emacs with disabled backup files, use on releases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714414 (owner: 10Ahmon Dancy) [15:43:32] (03CR) 10Ahmon Dancy: Add emacs-nox to standard packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377721 (owner: 10Muehlenhoff) [15:44:19] (03PS2) 10Lucas Werkmeister (WMDE): Return normalized snaks from SetClaim, SetReference [extensions/Wikibase] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714677 (https://phabricator.wikimedia.org/T289501) [15:44:41] (03CR) 10Filippo Giunchedi: [C: 03+1] cache: Support TLS on kafka::statsv (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714795 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [15:44:46] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: Enable SSL for statsv varnishkafka@cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/714796 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [15:45:10] (03CR) 10Elukey: [C: 03+1] cache: Support TLS on kafka::statsv [puppet] - 10https://gerrit.wikimedia.org/r/714795 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [15:45:19] (03CR) 10Elukey: [C: 03+1] hieradata: Enable SSL for statsv varnishkafka@cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/714796 (https://phabricator.wikimedia.org/T286038) (owner: 10Vgutierrez) [15:46:57] (03PS1) 10Volans: wmcs.wikireplicas.add_wiki: rename [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) [15:46:59] (03PS1) 10Volans: wmcs: remove wmcs/ subtree [cookbooks] - 10https://gerrit.wikimedia.org/r/714798 (https://phabricator.wikimedia.org/T287465) [15:47:16] (03PS1) 10Volans: admin: update sudo rule for renamed cookbook [puppet] - 10https://gerrit.wikimedia.org/r/714799 (https://phabricator.wikimedia.org/T287465) [15:47:41] (03PS1) 10Elukey: kubeflow: add override in admin_ng for storage-init [deployment-charts] - 10https://gerrit.wikimedia.org/r/714800 (https://phabricator.wikimedia.org/T272919) [15:51:09] (03CR) 10Elukey: [C: 03+2] kubeflow: add override in admin_ng for storage-init [deployment-charts] - 10https://gerrit.wikimedia.org/r/714800 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [15:51:26] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10JMando) @jcrespo I spoke with @EYener and it looks like I will need kerberos access. Should I make a separate ticket for that? [15:52:16] (03CR) 10Volans: "It should be merged together with I6b72a01dc50fac2b48e608b4fb121f999ce44f43" [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [15:53:59] (03CR) 10Volans: "And once merged I'll update https://wikitech.wikimedia.org/wiki/Add_a_wiki accordingly" [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [15:54:14] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:21] (03CR) 10David Caro: [C: 03+1] wmcs: remove wmcs/ subtree [cookbooks] - 10https://gerrit.wikimedia.org/r/714798 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [15:55:28] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) No need for another ticket, just making it explicit on the 'Requested group membership' section on the description above will make me not forget it when deploying the change 0:-) Thank... [15:58:16] (03CR) 10David Caro: [C: 03+1] wmcs.wikireplicas.add_wiki: rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [16:02:06] (03CR) 10DLynch: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714720 (https://phabricator.wikimedia.org/T287801) (owner: 10Bartosz Dziewoński) [16:05:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:22] (03PS2) 10Legoktm: services_proxy: Add mwapi envoyproxy for MediaWiki-internal requests [puppet] - 10https://gerrit.wikimedia.org/r/714420 (https://phabricator.wikimedia.org/T288848) [16:07:41] (03PS2) 10RLazarus: 08-start-maintenance: Remove cron-specific maintenance implementation details [cookbooks] - 10https://gerrit.wikimedia.org/r/713532 (https://phabricator.wikimedia.org/T289078) [16:07:59] (03CR) 10Legoktm: [C: 03+2] services_proxy: Add mwapi envoyproxy for MediaWiki-internal requests [puppet] - 10https://gerrit.wikimedia.org/r/714420 (https://phabricator.wikimedia.org/T288848) (owner: 10Legoktm) [16:08:54] (03CR) 10RLazarus: [C: 03+2] "Merging now that Spicerack 0.0.58 is released" [cookbooks] - 10https://gerrit.wikimedia.org/r/713532 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [16:09:06] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10jcrespo) Hey, @ssingh I've seen you have created this task. Recently there was a discussion about [[ https://wikitech.wikimedia.org/w/index.php?title=Phabricator&diff=1892983&oldid=18... [16:11:00] (03CR) 10David Caro: [C: 03+2] nova_fullstack: rephrase log message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714722 (https://phabricator.wikimedia.org/T289663) (owner: 10David Caro) [16:11:07] (03CR) 10David Caro: [C: 03+2] nova_fullstack: Add last error output when timing out puppet check [puppet] - 10https://gerrit.wikimedia.org/r/714733 (https://phabricator.wikimedia.org/T289663) (owner: 10David Caro) [16:11:35] (03Merged) 10jenkins-bot: 08-start-maintenance: Remove cron-specific maintenance implementation details [cookbooks] - 10https://gerrit.wikimedia.org/r/713532 (https://phabricator.wikimedia.org/T289078) (owner: 10RLazarus) [16:12:35] (03CR) 10Nskaggs: "I'm happy to confirm I can still run it with my reduced permission set once it's merged. Feel free to ping." [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [16:12:50] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [16:15:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:15:16] (03CR) 10David Caro: [C: 03+1] wmcs.wikireplicas.add_wiki: rename (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/714797 (https://phabricator.wikimedia.org/T287465) (owner: 10Volans) [16:17:45] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure: swift-ring: Add support for Cinder based Cloud VPS VMs - https://phabricator.wikimedia.org/T281699 (10fgiunchedi) FWIW for Pontoon I've developed a solution to emulate block devices via loop files, in a way that resembles production. It can be activated in pu... [16:18:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:22:51] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10jcrespo) This seems resolved based on T284349#7144169 ? [16:24:34] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for Wikidough in ulsfo - https://phabricator.wikimedia.org/T284349 (10ssingh) 05Open→03Resolved Yes please, this is resolved. Thanks! [16:26:05] (03PS7) 10Vgutierrez: cache: Provide an envoy STEK manager script [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) [16:26:26] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:27:43] (03CR) 10Vgutierrez: cache: Provide an envoy STEK manager script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711407 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:29:29] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10ssingh) >>! In T289693#7308865, @jcrespo wrote: > Hey, > > @ssingh I've seen you have created this task. Recently there was a discussion about [[ https://wikitech.wikimedia.org/w/inde... [16:31:45] (03CR) 10Abijeet Patro: [C: 03+1] Rename wgTranslateBlacklist to wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714770 (owner: 10Nikerabbit) [16:32:16] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:32:47] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10jcrespo) No protocol breaking- I just asked in case you needed help, as I happen to be on clinic duty this week. I am also offering my help to do this together, I am not @Dzahn, but I... [16:33:04] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10jcrespo) p:05Triage→03High [16:34:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:36:21] (03PS1) 10Gergő Tisza: Fix reference to renamed abortAllApiRequests method [extensions/Flow] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714678 (https://phabricator.wikimedia.org/T289648) [16:41:28] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): cloud cumin: exclude certain projects from "A:all" - https://phabricator.wikimedia.org/T289706 (10Andrew) [16:43:26] jouncebot: now [16:43:26] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [16:43:45] I’ll deploy some Wikibase backports with Amir1 (+2ing them now, will take ca 20 minutes in CI) [16:44:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s start with this backport" [extensions/Wikibase] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714675 (https://phabricator.wikimedia.org/T285987) (owner: 10Ladsgroup) [16:45:38] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:45:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] lldp fact: updated lldp parent fact to fall back to routers [puppet] - 10https://gerrit.wikimedia.org/r/714767 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [16:47:04] (03PS3) 10Lucas Werkmeister (WMDE): Return normalized snaks from SetClaim, SetReference [extensions/Wikibase] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714677 (https://phabricator.wikimedia.org/T289501) [16:51:24] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:53:57] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): cloud cumin: exclude certain projects from "A:all" - https://phabricator.wikimedia.org/T289706 (10Volans) FYI the alias is defined in `hieradata/eqiad/profile/openstack/eqiad1/cumin.yaml` and does already exclude a project: ` all: '... [16:57:37] 10SRE, 10Traffic, 10vm-requests: Please create a Ganeti VM for durum in eqiad - https://phabricator.wikimedia.org/T289693 (10jcrespo) I talked to @ssingh, he told me he planned to ask for review to @Dzahn so I will leave it to you two :-). [16:58:32] I arrived, setting up now [17:02:27] \o/ [17:02:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s kick off gate-and-submit, this should be good to go by the time it merges" [extensions/Wikibase] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714677 (https://phabricator.wikimedia.org/T289501) (owner: 10Lucas Werkmeister (WMDE)) [17:06:02] Lucas_WMDE: did you mean the wmf.19 one? [17:06:13] no? [17:06:18] I thought you’d want to deploy that last [17:06:22] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) Great, thank you for your help connecting us with serviceops @jcrespo ! @Daimona is the engineer leading this project; he will be better equipped than... [17:06:31] yeah but +2'ed the wmf.20 twice [17:06:39] at least from the one I see in IRC [17:06:41] no, I +2ed your wmf.20 and then mine, I think? [17:07:05] aha I see two different patches [17:07:20] yup [17:07:20] (03Merged) 10jenkins-bot: Set EntityHandler::generateHTMLOnEdit to false [extensions/Wikibase] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714675 (https://phabricator.wikimedia.org/T285987) (owner: 10Ladsgroup) [17:07:28] ayyy [17:07:31] yay [17:07:33] do you want to deploy? [17:07:43] yup [17:08:32] okay, it's on mwdebug2002 for wmf.20 [17:08:42] let's test [17:10:34] done some tests on mwdebug2002 on test wikidata [17:10:38] so far looks good [17:10:45] let me check xhgui [17:10:53] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-7): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10jcrespo) Don't worry, pings have been sent to the right people- and the manager confirmed someone will get back to you soon. Please ping me again if it doesn't happ... [17:11:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:55] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Cmjohnson) @wiki_willy looking for racking space for these big and very heavy servers. A4 u17/u18 B4 U2/U3 (U3 has maps1002 that looks to be off and ready for decom but I do not h... [17:12:28] yup, all good [17:12:46] no parsing has been triggered (except one for a message) [17:12:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:12:55] cool [17:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:25] !log ladsgroup@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/Wikibase/repo/includes/Content/EntityHandler.php: Backport: [[gerrit:714675|Set EntityHandler::generateHTMLOnEdit to false (T285987)]] (duration: 01m 18s) [17:14:26] okay deployed, do you want to +2 the wmf.19 one? it'll take twenty minutes to finish. i.e. what is your plan? [17:14:26] !log T289483 Depooled `wdqs1013` [17:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:29] T285987: Do not generate full html parser output at the end of Wikibase edit requests - https://phabricator.wikimedia.org/T285987 [17:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:32] T289483: asw2-c-eqiad:ge-5/0/39 - wdqs1013 - Inbound interface errors - https://phabricator.wikimedia.org/T289483 [17:14:40] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:17:41] sorry, didn’t pay attention for a minute [17:17:49] if you think the wmf.19 is ready to go as well then I can +2 it [17:17:54] I wasn’t sure how much time you wanted to leave between them [17:18:30] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:18:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Set EntityHandler::generateHTMLOnEdit to false [extensions/Wikibase] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714674 (https://phabricator.wikimedia.org/T285987) (owner: 10Ladsgroup) [17:20:34] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:20:54] 10SRE, 10ops-eqiad: asw2-c-eqiad:ge-5/0/39 - wdqs1013 - Inbound interface errors - https://phabricator.wikimedia.org/T289483 (10Cmjohnson) 05Open→03Resolved swapped the cable [17:21:37] 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10observability: decommission logstash102[012] - https://phabricator.wikimedia.org/T283507 (10wiki_willy) Hi @herron - just a heads up to add "ops-eqiad" as a project task, when this is ready for dc-ops to unrack. Much appreciated! Thanks, Willy [17:21:37] I wanted to test properly [17:21:41] I just tested it [17:21:48] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1002 - https://phabricator.wikimedia.org/T289271 (10Cmjohnson) [17:22:06] I'm sure it works fine e.g. a visit triggeres a render meaning parser cache entry didn't exist [17:22:08] https://performance.wikimedia.org/xhgui/run/symbol?id=61267b4ed6eae4b45cae14b8&symbol=MediaWiki%5CRevision%5CRenderedRevision%3A%3AgetRevisionParserOutput [17:22:51] alright [17:23:11] hm, what if the render on visit comes from a lagged replica? [17:23:29] I waited :D [17:23:37] (03Merged) 10jenkins-bot: Return normalized snaks from SetClaim, SetReference [extensions/Wikibase] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714677 (https://phabricator.wikimedia.org/T289501) (owner: 10Lucas Werkmeister (WMDE)) [17:23:55] alright, I’ll test this [17:23:56] 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10decommission-hardware: decommission snapshot100[5,6,7].eqiad.wmnet - https://phabricator.wikimedia.org/T282078 (10wiki_willy) Hi @ArielGlenn - just a heads up to add "ops-eqiad" as a project task, when this is ready for dc-ops to unrack. Much appreciated! Than... [17:24:25] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:28] (03PS1) 10Jforrester: Remove call to text() on string. [skins/WikimediaApiPortal] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714679 (https://phabricator.wikimedia.org/T289692) [17:24:29] pulled to mwdebug2001, testing [17:24:49] 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10SRE Observability (FY2021/2022-Q1): reclaim icinga1001.wikimedia.org - https://phabricator.wikimedia.org/T279601 (10wiki_willy) Hi @colewhite - just a heads up to add "ops-eqiad" as a project task, when this is ready for dc-ops to unrack. Much appreciated!... [17:25:13] it works \o/ let’s sync [17:25:19] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1002 - https://phabricator.wikimedia.org/T289271 (10Cmjohnson) [17:25:25] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1002 - https://phabricator.wikimedia.org/T289271 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson [17:25:27] I’ll sync all of WIkibase at once because the two files don’t depend on each other [17:26:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:27:13] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/Wikibase: Backport: [[gerrit:714677|Return normalized snaks from SetClaim, SetReference (T289501)]] (duration: 01m 11s) [17:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:18] T289501: wbsetclaim and wbsetreference return unnormalized data - https://phabricator.wikimedia.org/T289501 [17:27:51] alright, now waiting for wmf.19 [17:29:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, and 2 others: decommission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10wiki_willy) Hi @herron - just a heads up to add "ops-eqiad" as a project tag, when this is ready for dc-ops to unrack. Thanks, Willy [17:29:21] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission druid1001.eqiad.wmnet - https://phabricator.wikimedia.org/T289339 (10wiki_willy) Hi @BTullis - just a heads up to add "ops-eqiad" as a project tag, when this is ready for dc-ops to unrack. Thanks, Willy [17:29:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:08] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10wiki_willy) a:05Jclark-ctr→03RobH Hi @ssingh - just a heads up to add "ops-ulsfo" as a project tag, when this is ready for dc-ops to unrack. Thanks, Willy [17:31:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:16] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1001 - https://phabricator.wikimedia.org/T289270 (10hnowlan) [17:33:29] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1003 - https://phabricator.wikimedia.org/T289272 (10hnowlan) [17:33:42] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1004.eqiad.wmnet - https://phabricator.wikimedia.org/T289269 (10hnowlan) [17:34:47] (03CR) 10Ahmon Dancy: "I can deploy during the morning backport window." [skins/WikimediaApiPortal] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714679 (https://phabricator.wikimedia.org/T289692) (owner: 10Jforrester) [17:36:02] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:38:18] 10SRE, 10DNS, 10Traffic: DNS entries for WikiLearn dev servers - https://phabricator.wikimedia.org/T289618 (10jcrespo) Adding @Brandon @vgutierrez as sometimes notifications on newly created tickets are a bit unreliable. I am on clinic duty this week, so trivial patches is something I am happy to help with,... [17:38:52] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:15] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1004.eqiad.wmnet - https://phabricator.wikimedia.org/T289269 (10Cmjohnson) a:03Cmjohnson [17:39:37] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1004.eqiad.wmnet - https://phabricator.wikimedia.org/T289269 (10Cmjohnson) [17:39:51] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1004.eqiad.wmnet - https://phabricator.wikimedia.org/T289269 (10Cmjohnson) 05Open→03Resolved [17:39:54] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:40:11] Lucas_WMDE: I have to leave rn [17:40:15] can you deploy the last bit? [17:40:18] ok [17:40:19] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1003 - https://phabricator.wikimedia.org/T289272 (10Cmjohnson) [17:40:27] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1003 - https://phabricator.wikimedia.org/T289272 (10Cmjohnson) a:03Cmjohnson [17:40:30] Thanks [17:40:33] without testing in xhgui? [17:40:35] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1003 - https://phabricator.wikimedia.org/T289272 (10Cmjohnson) 05Open→03Resolved [17:40:39] I’ll just quickly test that editing works at all [17:40:51] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1001 - https://phabricator.wikimedia.org/T289270 (10Cmjohnson) [17:40:56] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1001 - https://phabricator.wikimedia.org/T289270 (10Cmjohnson) a:03Cmjohnson [17:41:03] 10ops-eqiad, 10Maps, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability), 10Product-Infrastructure-Team-Backlog (Kanban): Decommission maps1001 - https://phabricator.wikimedia.org/T289270 (10Cmjohnson) 05Open→03Resolved [17:42:34] (03Merged) 10jenkins-bot: Set EntityHandler::generateHTMLOnEdit to false [extensions/Wikibase] (wmf/1.37.0-wmf.19) - 10https://gerrit.wikimedia.org/r/714674 (https://phabricator.wikimedia.org/T285987) (owner: 10Ladsgroup) [17:42:45] alright, let’s test ^ [17:43:18] pulled to mwdebug2001, testing… [17:44:25] everything working as far as I can tell [17:45:29] syncing [17:45:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:52] I'm around with phone [17:46:06] Aka emotional support [17:46:11] ^^ [17:46:34] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.37.0-wmf.19/extensions/Wikibase/repo/includes/Content/EntityHandler.php: Backport: [[gerrit:714674|Set EntityHandler::generateHTMLOnEdit to false (T285987)]] (duration: 01m 06s) [17:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:38] T285987: Do not generate full html parser output at the end of Wikibase edit requests - https://phabricator.wikimedia.org/T285987 [17:46:56] the Wikibase ParserOutputGenerator Grafana board only has 10-minute resolution apparently :/ [17:47:28] let’s see if wbeditentity execution time changes https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&from=now-2h&to=now&var-metric=p95&var-module=wbeditentity [17:48:01] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:15] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) [17:48:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission druid1001.eqiad.wmnet - https://phabricator.wikimedia.org/T289339 (10Cmjohnson) [17:48:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission druid1001.eqiad.wmnet - https://phabricator.wikimedia.org/T289339 (10Cmjohnson) 05Open→03Resolved [17:49:35] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission druid1002.eqiad.wmnet - https://phabricator.wikimedia.org/T288744 (10wiki_willy) Hi @BTullis - just following up. to see if we can proceed with the dc-ops steps, since "remove all remaining puppet references and all host entries in the puppet repo"... [17:50:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:52:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:16] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:53:16] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275696 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson Hi @ssingh - just a heads up to add the "ops-eqiad" project tag, when its ready for the dc-ops steps. Thanks, W... [17:53:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, and 2 others: decommission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10Cmjohnson) [17:53:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, and 2 others: decommission mwlog1001 - https://phabricator.wikimedia.org/T282575 (10Cmjohnson) 05Open→03Resolved [17:56:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275696 (10ssingh) Thanks, Willy! I will make sure to do it in the future. [17:56:38] 10SRE: bacula restore job waiting on higher jobs - https://phabricator.wikimedia.org/T95705 (10jcrespo) If Allow Mixed Priority wouldn't work, we could alternative set the default restore priority (RestoreFiles job) to be 10, like the regular backups. Having the same priority, counterintuitively, would lead to s... [17:56:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:49] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:56:50] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:56:50] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission druid1002.eqiad.wmnet - https://phabricator.wikimedia.org/T288744 (10Cmjohnson) [17:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:09] everything looks fine after those Wikibase backports as far as I can tell, so I’m going to sign off [17:57:17] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission druid1002.eqiad.wmnet - https://phabricator.wikimedia.org/T288744 (10Cmjohnson) 05Open→03Resolved [17:57:19] contact my emotional support Amir if anything is wrong ;) [17:57:38] (or find my phone number in my home dir on the deployment server ^^) [17:57:59] look what's happening https://usercontent.irccloud-cdn.com/file/uLUsH5WQ/Screenshot_20210825-195544_Firefox.jpg [17:58:13] kormat: ^ [17:58:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:58:58] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:59:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission-hardware: reclaim cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275696 (10Cmjohnson) [18:00:05] dancy and brennen: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T1800). [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T1800). [18:00:05] MatmaRex, Arlolra, tgr, and dancy: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) So I'm not sure if we want to power this off and not use it at all, or re-allocate it as another service/host in ulsfo. My first thought was potentia... [18:00:15] I can deploy today [18:00:27] hello [18:00:27] o/ [18:00:38] (03CR) 10Urbanecm: [C: 03+2] Fix reference to renamed abortAllApiRequests method [extensions/Flow] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714678 (https://phabricator.wikimedia.org/T289648) (owner: 10Gergő Tisza) [18:00:44] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:00:49] (03PS3) 10Urbanecm: Enable topic subscriptions as a beta feature on Wikipedias except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714720 (https://phabricator.wikimedia.org/T287801) (owner: 10Bartosz Dziewoński) [18:00:56] Amir1: 😮 [18:01:01] (03CR) 10Urbanecm: [C: 03+2] Enable topic subscriptions as a beta feature on Wikipedias except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714720 (https://phabricator.wikimedia.org/T287801) (owner: 10Bartosz Dziewoński) [18:01:05] (03PS2) 10Urbanecm: Disable upcoming DiscussionTools automatic topic subscriptions for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714721 (owner: 10Bartosz Dziewoński) [18:01:09] (03CR) 10Urbanecm: [C: 03+2] Disable upcoming DiscussionTools automatic topic subscriptions for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714721 (owner: 10Bartosz Dziewoński) [18:01:39] arlolra: hi, are you around? [18:01:43] yes [18:01:54] (03Merged) 10jenkins-bot: Enable topic subscriptions as a beta feature on Wikipedias except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714720 (https://phabricator.wikimedia.org/T287801) (owner: 10Bartosz Dziewoński) [18:01:55] great [18:02:03] (03Merged) 10jenkins-bot: Disable upcoming DiscussionTools automatic topic subscriptions for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714721 (owner: 10Bartosz Dziewoński) [18:02:29] MatmaRex: hello, your patches are available at mwdebug2001, please have a look [18:03:21] (03CR) 10Cicalese: Remove call to text() on string. (031 comment) [skins/WikimediaApiPortal] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714679 (https://phabricator.wikimedia.org/T289692) (owner: 10Jforrester) [18:03:28] urbanecm: seems good [18:03:34] thanks, syncing [18:03:46] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) IRC Update from my chat with @BBlack This old host is non-ideal but would work as a fallback ganeti host. I'll create a setup task for the host to r... [18:04:42] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:48] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:05:00] dancy: hey, ok to +2 your backport? [18:05:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10observability: decommission logstash102[012] - https://phabricator.wikimedia.org/T283507 (10Cmjohnson) [18:05:06] yes please! [18:05:10] (03CR) 10Urbanecm: [C: 03+2] Remove call to text() on string. [skins/WikimediaApiPortal] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714679 (https://phabricator.wikimedia.org/T289692) (owner: 10Jforrester) [18:05:15] done :) [18:05:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware, 10observability: decommission logstash102[012] - https://phabricator.wikimedia.org/T283507 (10Cmjohnson) 05Open→03Resolved [18:05:23] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10Cmjohnson) [18:05:44] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2b14eb525e99008d5103a93c5bd01f75211dca99: Enable topic subscriptions as a beta feature on Wikipedias except enwiki (T287801) (duration: 01m 06s) [18:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:48] T287801: Deploy config to introduce manual topic subscriptions as Beta Feature at Phase 2 projects - https://phabricator.wikimedia.org/T287801 [18:06:07] (03PS3) 10Urbanecm: Disable legacy media dom on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714635 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [18:06:11] (03CR) 10Urbanecm: [C: 03+2] Disable legacy media dom on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714635 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [18:06:41] kormat: wait until the refresh script kicks in [18:06:44] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:47] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [18:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:49] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:54] (03Merged) 10jenkins-bot: Disable legacy media dom on a few more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714635 (https://phabricator.wikimedia.org/T51097) (owner: 10Arlolra) [18:07:06] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5182ac88263f23c15a3b10d0f3bc2e492fe425d5: Disable upcoming DiscussionTools automatic topic subscriptions for now (duration: 01m 04s) [18:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:21] arlolra: your patch is at mwdebug2001, can you please have a look? [18:07:27] sure, one sec [18:07:36] (03PS3) 10Urbanecm: Add Wikimedia ES to $wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714211 (https://phabricator.wikimedia.org/T289446) (owner: 10Platonides) [18:08:09] (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack: try to get the puppet state from a couple places [puppet] - 10https://gerrit.wikimedia.org/r/714761 (https://phabricator.wikimedia.org/T289663) (owner: 10David Caro) [18:08:30] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:08:49] (03PS4) 10Urbanecm: Add Wikimedia ES to $wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714211 (https://phabricator.wikimedia.org/T289446) (owner: 10Platonides) [18:08:53] (03CR) 10Urbanecm: [C: 03+2] Add Wikimedia ES to $wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714211 (https://phabricator.wikimedia.org/T289446) (owner: 10Platonides) [18:08:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:02] hmm [18:10:06] arlolra: yes? [18:10:26] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:11:16] So I should see the effect with [18:11:16] curl -H 'X-Wikimedia-Debug: backend=mwdebug2001.codfw.wmnet' https://www.mediawiki.org/wiki/Parsoid [18:11:25] that is correct [18:11:51] and yet not ... [18:12:13] arlolra: hmm, wgParserEnableLegacyMediaDOM is evaluated at the parse hosts? or at mw* hosts? [18:12:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:02] 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) [18:13:16] 10ops-ulsfo, 10DC-Ops, 10Traffic: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) [18:13:22] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) [18:13:29] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:46] urbanecm: can you explain the difference? [18:13:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10decommission-hardware: decommission snapshot100[5,6,7].eqiad.wmnet - https://phabricator.wikimedia.org/T282078 (10Cmjohnson) [18:14:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10decommission-hardware: decommission snapshot100[5,6,7].eqiad.wmnet - https://phabricator.wikimedia.org/T282078 (10Cmjohnson) 05Open→03Resolved [18:14:20] arlolra: I'm afraid that parsing could happen outside of the debug host, which could explain what you're seeing [18:14:37] I see [18:14:43] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet (WIP) - https://phabricator.wikimedia.org/T289657 (10Cmjohnson) a:03Cmjohnson [18:14:50] it's fine to proceed [18:14:54] arlolra: i can get it out and you can test after that? [18:14:55] okay [18:15:26] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:35] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10KFrancis) @jcrespo Hi Jaime, Joseph Mando is a currently contractor wit the WMF and therefore the NDA is covered under the contractor agreement signed. Thanks! [18:15:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:05] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10decommission-hardware: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) 05Open→03Resolved [18:16:22] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e6df0803e4eaca91bd725bcd376b260b97917de3: Disable legacy media dom on a few more wikis (T51097) (duration: 01m 05s) [18:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:25] arlolra: here you go [18:16:26] T51097: Use figure and figcaption HTML5 elements when possible - https://phabricator.wikimedia.org/T51097 [18:16:31] hopefully it works now :) [18:17:07] (03CR) 10Urbanecm: [C: 03+2] Add Wikimedia ES to $wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714211 (https://phabricator.wikimedia.org/T289446) (owner: 10Platonides) [18:17:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:22] urbanecm: yes, working [18:17:22] thanks [18:17:28] great! [18:17:29] any time [18:17:31] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T282025 (10wiki_willy) a:03Cmjohnson [18:17:49] fwiw backports should be ready any second [18:18:37] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e7c8c041faa974585128c48631522a401fb3d41d: Add Wikimedia ES to $wgCopyUploadsDomains whitelist (T289446) (duration: 01m 04s) [18:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:41] T289446: Please add *.wikimedia.es to the wgCopyUploadsDomains whitelist - https://phabricator.wikimedia.org/T289446 [18:18:51] urbanecm: I might have needed to purged the parser cache for my request above [18:18:57] in any case [18:19:02] it works :) [18:19:15] :) [18:19:22] (03Merged) 10jenkins-bot: Fix reference to renamed abortAllApiRequests method [extensions/Flow] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714678 (https://phabricator.wikimedia.org/T289648) (owner: 10Gergő Tisza) [18:19:32] (03Merged) 10jenkins-bot: Remove call to text() on string. [skins/WikimediaApiPortal] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714679 (https://phabricator.wikimedia.org/T289692) (owner: 10Jforrester) [18:20:13] tgr: dancy: your backports are at mwdebug2001! [18:21:29] thanks urbanecm, it works [18:21:33] thanks tgr, syncing [18:21:37] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission mc1024 - https://phabricator.wikimedia.org/T272074 (10wiki_willy) a:03Cmjohnson Hi @Dzahn - just a quick reminder to add the "ops-eqiad" project tag when the servers are ready for dc-ops to unrack. Much appreciated. Thanks, Willy [18:21:39] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) [18:21:58] urbanecm: OK to proceed. [18:22:05] thanks dancy [18:23:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:38] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/skins/WikimediaApiPortal/src/Component/NotificationAlertComponent.php: a5bfcc8def96ad1b44fff31c4c1965311be2982a: Remove call to text() on string (T289692) (duration: 01m 04s) [18:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:42] T289692: api.wikimedia.org fatal exception: Error: Call to a member function text() on string - https://phabricator.wikimedia.org/T289692 [18:24:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:13] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10jcrespo) a:05KFrancis→03odimitrijevic Thank you for the quick clarification. Assigning to @odimitrijevic for the Data Engineering approval. [18:25:11] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet (WIP) - https://phabricator.wikimedia.org/T289657 (10wiki_willy) Hi @jijiki - hope all is well. We were wondering if it would be possible to prioritize the decom of mc1033 and 1034? It would help us with... [18:25:19] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/Flow/modules/editor/editors/visualeditor/ui/inspectors/mw.flow.ve.ui.MentionInspector.js: dd464b4522effbfabea371f8b95b0b25d53da43e: Fix reference to renamed abortAllApiRequests method (T289648) (duration: 01m 04s) [18:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:24] T289648: Impossible to close "Mention" popup: Uncaught TypeError: this.transclusionModel.abortRequests is not a function - https://phabricator.wikimedia.org/T289648 [18:25:24] tgr: should be live [18:25:30] anything else from anyone? [18:25:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:45] thank you! [18:25:46] thanks! [18:26:02] any time! [18:28:46] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10wiki_willy) Just a quick summary of what Chris and I went over: - the decom of maps1002 has been taken care of via T289271 to free up rack space in B4 - we're asking Service-Ops if... [18:30:30] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:55] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T282025 (10Cmjohnson) [18:31:16] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T282025 (10Cmjohnson) 05Open→03Resolved [18:31:38] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission mc1024 - https://phabricator.wikimedia.org/T272074 (10Cmjohnson) [18:31:42] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission mc1024 - https://phabricator.wikimedia.org/T272074 (10Cmjohnson) 05Open→03Resolved [18:31:52] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:32:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10decommission-hardware: reclaim cescout1001.eqiad.wmnet - https://phabricator.wikimedia.org/T275696 (10wiki_willy) Much appreciated @ssingh, thanks! [18:33:32] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:33:32] PROBLEM - Long running screen/tmux on maps2004 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 17726, 1741826s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [18:34:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:24] (03CR) 10RLazarus: [V: 03+1 C: 03+2] hieradata: Run httpbb hourly from cumin1001 against an eqiad appserver. [puppet] - 10https://gerrit.wikimedia.org/r/714646 (https://phabricator.wikimedia.org/T289202) (owner: 10RLazarus) [18:37:40] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:37:53] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10Cmjohnson) @ayounsi I did some investigating on this today and there have been servers plugged into 2 of 3 (ge-1/0/5 and 1/0/6) ports now for qu... [18:40:48] (03CR) 10Jforrester: "Thank you for working on this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714770 (owner: 10Nikerabbit) [18:41:00] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): cloud cumin: exclude certain projects from "A:all" - https://phabricator.wikimedia.org/T289706 (10Andrew) awesome! [18:41:20] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:42:42] !log robh@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:05] (03PS1) 10Herron: prometheus: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) [18:43:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:14] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:45:12] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:46:10] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet (WIP) - https://phabricator.wikimedia.org/T289657 (10Cmjohnson) a:05Cmjohnson→03jijiki [18:46:18] rzl ^ I am assuming httpbb is WIP/in setup, right? [18:48:21] 10SRE, 10DNS, 10Traffic, 10WMF-Communications, and 3 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) [18:48:51] jynus: yep, I've got it, thanks [18:49:31] (03CR) 10Herron: "My thinking here is the query used by SLO dashboards would become:" [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [18:52:28] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:53:28] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:53:51] weird spike [18:53:55] hm, httpbb succeeded on a retry [18:54:05] (and the timing doesn't align with that latency spike) [18:54:20] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:54:21] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-2d&orgId=1&to=now&var-cluster=api_appserver&var-datasource=eqiad%20prometheus%2Fops&var-method=GET&viewPanel=9 [18:54:29] gonna leave it there, but if it turns out to be flaky in general, I might either extend the timeouts or give it a retry or something [18:55:12] legoktm: yeah, and the appservers and parsoid saw it too [18:58:29] 10SRE-swift-storage, 10Thumbor, 10Traffic: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10AntiCompositeNumber) [18:59:40] legoktm: something memcachey I guess https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=1629914367626&orgId=1&to=1629917967626&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 [18:59:52] everything on the actual memcache dashboards looks fine afaict though [19:00:00] OH WAIT the alert was for eqiad [19:00:04] dancy and brennen: Time to snap out of that daydream and deploy MediaWiki train - American Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T1900). [19:00:52] I still don't see much on the dashboard but now I don't care as much :P [19:00:56] ..... [19:01:05] I didn't notice that either haha [19:01:56] yeah, at 25 reqs/s if a few slow down that could easily be a significant spike [19:02:03] yeah [19:06:46] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:07:16] 10SRE-swift-storage, 10Thumbor, 10Traffic: Thumbnail of deleted image shown in "File history" after new image with same filename got uploaded - https://phabricator.wikimedia.org/T281780 (10AntiCompositeNumber) File:Benjamin Ola Akande.jpg was deleted after this task was filed, but the [[https://upload.wikime... [19:10:52] (03PS1) 10Ahmon Dancy: group1 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714816 [19:10:54] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714816 (owner: 10Ahmon Dancy) [19:11:39] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714816 (owner: 10Ahmon Dancy) [19:13:10] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.20 [19:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:15] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.20 (duration: 01m 04s) [19:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:10] (03PS1) 10Ahmon Dancy: group1 wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714817 [19:25:12] (03CR) 10Ahmon Dancy: [C: 03+2] group1 wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714817 (owner: 10Ahmon Dancy) [19:25:22] Rolling back the train due [19:25:29] .. due to errors [19:25:55] dancy: i'll file one for that PageNumberNotFound error; also looks like T289717 should probably block [19:25:56] T289717: Wikimedia\Assert\PostconditionException: Postcondition failed: Revision had no page - https://phabricator.wikimedia.org/T289717 [19:26:00] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714817 (owner: 10Ahmon Dancy) [19:26:04] Agreed and thanks. [19:27:13] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.19 [19:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:19] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.19 (duration: 01m 05s) [19:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:46] (03CR) 10RLazarus: "n" [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [19:29:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:55] (03CR) 10RLazarus: prometheus: add recording rules for etcd error slo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [19:31:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:36] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:34:32] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:40:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:42:48] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:45:23] !log Start server-side upload for ~2 GB tiff file (T289711) [19:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:28] T289711: Server side upload to commons: (>2GB TIFF) - https://phabricator.wikimedia.org/T289711 [19:48:34] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:48:45] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10User-fgiunchedi: Decom ms-be[1019-1026] - https://phabricator.wikimedia.org/T272836 (10Jclark-ctr) [19:54:02] !log enwikisource: Start server-side upload for one video file (T289698) [19:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:06] T289698: Server side upload to enwikisource (>200MB DJVU) - https://phabricator.wikimedia.org/T289698 [19:55:47] (03PS1) 10Urbanecm: knwiki: Disable wmgNewUserMessageOnAutoCreate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714827 (https://phabricator.wikimedia.org/T289333) [19:56:00] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:00:04] dancy and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T1900). [20:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T2000). [20:00:28] really going to have to fix that jouncebot bug one of these weeks. [20:00:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:01:20] brennen: which one [20:01:40] RhinosF1: it doubles up on announcing train windows since they overlap with later ones. [20:01:56] Oh I see [20:02:52] !log 1.37.0-wmf.20 (T281161) status: blocked at group0; 2/3 blockers have probable patches, all seem to be getting attention, so holding off on blocker mail for now. [20:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:57] T281161: 1.37.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T281161 [20:03:22] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:04:03] (03PS3) 10H.krishna123: bernard: Changes to dashboard, add indidiual section data, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) [20:04:21] (03CR) 10H.krishna123: "recheck" [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [20:05:21] (03CR) 10RhinosF1: bernard: Changes to dashboard, add indidiual section data, minor fixes (032 comments) [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [20:09:18] (03CR) 10H.krishna123: "Ah interesting, didn't know. Many thanks" [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [20:09:43] (03PS4) 10H.krishna123: bernard: Changes to dashboard, add individual sections data, minor fixes [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) [20:09:45] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T289732 (10RobH) [20:09:51] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T289732 (10RobH) [20:10:37] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install - https://phabricator.wikimedia.org/T289732 (10RobH) a:03Jclark-ctr [20:11:45] (03CR) 10RhinosF1: bernard: Changes to dashboard, add individual sections data, minor fixes (031 comment) [software/bernard] - 10https://gerrit.wikimedia.org/r/714748 (https://phabricator.wikimedia.org/T289441) (owner: 10H.krishna123) [20:14:13] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install puppetmaster100[45] - https://phabricator.wikimedia.org/T289732 (10RobH) [20:14:30] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install puppetmaster200[45] - https://phabricator.wikimedia.org/T289733 (10RobH) [20:15:09] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install puppetmaster200[45] - https://phabricator.wikimedia.org/T289733 (10RobH) a:03Papaul [20:16:12] (03PS1) 10Zabe: Make sure params is an array [extensions/VisualEditor] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714681 (https://phabricator.wikimedia.org/T289730) [20:16:18] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install puppetmaster200[45] - https://phabricator.wikimedia.org/T289733 (10RobH) [20:19:46] (03PS1) 10Andrew Bogott: nova vendor-data: another mild attempt to avoid races with the puppet agent [puppet] - 10https://gerrit.wikimedia.org/r/714831 (https://phabricator.wikimedia.org/T289663) [20:20:03] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install puppetmaster100[45] - https://phabricator.wikimedia.org/T289732 (10RhinosF1) [20:21:54] (03CR) 10Andrew Bogott: [C: 03+2] nova vendor-data: another mild attempt to avoid races with the puppet agent [puppet] - 10https://gerrit.wikimedia.org/r/714831 (https://phabricator.wikimedia.org/T289663) (owner: 10Andrew Bogott) [20:23:53] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:25:30] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install puppetmaster200[45].codfw.wmnet - https://phabricator.wikimedia.org/T289733 (10RobH) [20:26:03] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:27:03] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:27:12] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: (Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10RobH) [20:36:43] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:37:33] dancy: hi, i just +2'ed the fix for T289731. Do you want me to backport it to .20 too? [20:37:33] T289731: PHP Deprecated: Use of Parser::getUser was deprecated in MediaWiki 1.36. [Called from SimpleCaptcha::findLinks] - https://phabricator.wikimedia.org/T289731 [20:38:01] Yes please! [20:38:12] ok, will do once it merges [20:39:44] jouncebot: nowandnext [20:39:44] For the next 0 hour(s) and 20 minute(s): MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T1900) [20:39:44] For the next 0 hour(s) and 20 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T2000) [20:39:44] In 2 hour(s) and 20 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T2300) [20:40:02] urbanecm: Thanks, I see you +2 the change to mediawiki-config :) [20:40:23] dancy: ok if i claim a mwdebug host now? want to test a fix for train blocker, too :-) [20:40:31] All yours. [20:40:35] thanks [20:41:11] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:41:24] Platonides: no problem. Ordinarily I'd ask you to schedule it for deployment, but since it's so trivial, i just went ahead since i was deploying other stuff anyway :) [20:46:16] 10SRE, 10MW-on-K8s, 10Release Pipeline, 10Release-Engineering-Team: Unable to pull restricted/mediawiki-multiversion image to kubestage1002.eqiad.wmnet - https://phabricator.wikimedia.org/T289737 (10dancy) [20:49:00] (03PS1) 10Tpt: Fixes exception thrown by FilePagination::getPageNumber [extensions/ProofreadPage] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714682 (https://phabricator.wikimedia.org/T289728) [20:49:26] 13 [20:50:15] ^^I'll deploy once the master patch merges [20:50:39] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:48] (03PS1) 10Zabe: Use Parser::getUserIdentity() instead of ::getUser() in SimpleCaptcha [extensions/ConfirmEdit] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714683 (https://phabricator.wikimedia.org/T289731) [20:52:50] (03PS2) 10Urbanecm: Fixes exception thrown by FilePagination::getPageNumber [extensions/ProofreadPage] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714682 (https://phabricator.wikimedia.org/T289728) (owner: 10Tpt) [20:52:55] (03CR) 10Urbanecm: [C: 03+2] Fixes exception thrown by FilePagination::getPageNumber [extensions/ProofreadPage] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714682 (https://phabricator.wikimedia.org/T289728) (owner: 10Tpt) [20:53:13] (03PS2) 10Urbanecm: Use Parser::getUserIdentity() instead of ::getUser() in SimpleCaptcha [extensions/ConfirmEdit] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714683 (https://phabricator.wikimedia.org/T289731) (owner: 10Zabe) [20:53:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:53:18] meh, too late [20:53:20] thanks zabe [20:53:24] (03CR) 10Urbanecm: [C: 03+2] Use Parser::getUserIdentity() instead of ::getUser() in SimpleCaptcha [extensions/ConfirmEdit] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714683 (https://phabricator.wikimedia.org/T289731) (owner: 10Zabe) [20:54:47] RECOVERY - Check whether ferm is active by checking the default input chain on labstore1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:57:25] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:59:07] (03Merged) 10jenkins-bot: Fixes exception thrown by FilePagination::getPageNumber [extensions/ProofreadPage] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714682 (https://phabricator.wikimedia.org/T289728) (owner: 10Tpt) [20:59:21] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:00:57] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:01:28] urbanecm: thanks for deploying that fix :) [21:01:30] np [21:01:45] Many thanks to urbanecm today. [21:03:13] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/ProofreadPage/: 913043a5ca7982e07ab0c01f88076af866a43cc3: Fixes exception thrown by FilePagination::getPageNumber (T289728) (duration: 01m 06s) [21:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:18] T289728: ProofreadPage\PageNumberNotFoundException: Page:[page] provides invalid page number - https://phabricator.wikimedia.org/T289728 [21:03:34] urbanecm is my go to for deploys [21:03:40] 🙂 [21:04:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:53] (03PS1) 10Bartosz Dziewoński: PageStore: Pass query flags to getPageByName() [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714684 (https://phabricator.wikimedia.org/T289717) [21:08:00] (03PS1) 10Bartosz Dziewoński: EventDispatcher: Try really, really hard to read from master [extensions/DiscussionTools] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714685 (https://phabricator.wikimedia.org/T289717) [21:09:07] MatmaRex: hi, if you can help me making sure it works, happy to get it live too [21:09:29] urbanecm: sure, i was just going to ask if anyone wants to deploy them [21:09:42] urbanecm: there isn't really a test for mine though, just watching the logs [21:09:49] (03Merged) 10jenkins-bot: Use Parser::getUserIdentity() instead of ::getUser() in SimpleCaptcha [extensions/ConfirmEdit] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714683 (https://phabricator.wikimedia.org/T289731) (owner: 10Zabe) [21:09:58] so there's no way to test it directly? [21:10:29] not really [21:10:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:10:40] pretty sure the volume was low enough that it couldn't be happening every time [21:10:45] okay [21:10:58] i'll just get it live then and hope for the best [21:11:00] (03CR) 10Urbanecm: [C: 03+2] EventDispatcher: Try really, really hard to read from master [extensions/DiscussionTools] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714685 (https://phabricator.wikimedia.org/T289717) (owner: 10Bartosz Dziewoński) [21:11:02] (03CR) 10Urbanecm: [C: 03+2] PageStore: Pass query flags to getPageByName() [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714684 (https://phabricator.wikimedia.org/T289717) (owner: 10Bartosz Dziewoński) [21:12:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:14:15] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:14:46] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/ConfirmEdit/SimpleCaptcha/SimpleCaptcha.php: 190d8b7579af981cf2f5e4a6d9457ee0a7edca3f: Use Parser::getUserIdentity() instead of ::getUser() in SimpleCaptcha (T289731) (duration: 01m 05s) [21:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:51] T289731: PHP Deprecated: Use of Parser::getUser was deprecated in MediaWiki 1.36. [Called from SimpleCaptcha::findLinks] - https://phabricator.wikimedia.org/T289731 [21:14:56] zabe: ^ [21:15:29] thx [21:15:45] MatmaRex: is there a good way to test https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/714681/ ? [21:16:58] yeah, one sec [21:17:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:04] or, there should be, i'm struggling to reproduce locally :D [21:19:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:44] i guess i don't actually know how to test it [21:22:43] oh wait, i do [21:22:51] duh [21:23:21] https://test.wikipedia.org/w/api.php?action=visualeditor&format=json&page=A&paction=parse§ion=new&pst=1&preload=B [21:23:24] zabe: urbanecm: ^ [21:23:26] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr) ms-be1064 A4 U17. Cable#11035 port#25 ms-be1065 B4 U2. Cable#11036. port#29 ms-be1066 C2 U19. Cable#11037. port#22 [21:23:33] good [21:23:36] (that should nto output an exception) [21:23:40] MatmaRex: so, what should i +2? :D [21:23:55] urbanecm: we would like to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/714681/ [21:24:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:24:07] (03CR) 10Urbanecm: [C: 03+2] Make sure params is an array [extensions/VisualEditor] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714681 (https://phabricator.wikimedia.org/T289730) (owner: 10Zabe) [21:24:09] okay [21:24:19] Happy to see all these fixes going in. [21:24:26] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Jclark-ctr) [21:25:00] dancy: better than errors going out i guess : [21:25:01] :) [21:27:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:30:34] (03Merged) 10jenkins-bot: PageStore: Pass query flags to getPageByName() [core] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714684 (https://phabricator.wikimedia.org/T289717) (owner: 10Bartosz Dziewoński) [21:30:38] (03Merged) 10jenkins-bot: EventDispatcher: Try really, really hard to read from master [extensions/DiscussionTools] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714685 (https://phabricator.wikimedia.org/T289717) (owner: 10Bartosz Dziewoński) [21:31:46] syncing the core patch [21:32:48] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/includes/page/PageStore.php: 34fb2b99104d0a2bda8aa202f4cdeb07cb983531: PageStore: Pass query flags to getPageByName() (T289717; T195069) (duration: 01m 06s) [21:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:53] T289717: Wikimedia\Assert\PostconditionException: Postcondition failed: Revision had no page - https://phabricator.wikimedia.org/T289717 [21:32:54] T195069: Factor PageStore and PageRecord out of WikiPage - https://phabricator.wikimedia.org/T195069 [21:34:10] and the DT patch [21:35:09] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/DiscussionTools/includes/Notifications/EventDispatcher.php: cc04b33dec6b9aed1d7621957c4de527266600d1: EventDispatcher: Try really, really hard to read from master (T289717) (duration: 01m 04s) [21:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:38] MatmaRex: ^^, please verify as train goes forward and close the task when possible :) [21:35:44] a commit message after my own heart [21:36:16] heh [21:36:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:48] urbanecm: do you know if folks are planning to retry the train today? [21:40:08] MatmaRex: no, i think that's questions for dancy [21:40:29] i was going to just close it, and folks to reopen it if the exceptions reappear [21:40:35] (03CR) 10jerkins-bot: [V: 04-1] Make sure params is an array [extensions/VisualEditor] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714681 (https://phabricator.wikimedia.org/T289730) (owner: 10Zabe) [21:41:00] ... [21:41:07] I do not expect to roll forward again today. [21:41:21] looks like unrelated failure [21:41:22] trying again [21:41:28] (03CR) 10Urbanecm: [C: 03+2] Make sure params is an array [extensions/VisualEditor] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714681 (https://phabricator.wikimedia.org/T289730) (owner: 10Zabe) [21:42:02] oh, I zee that Zabe's fixes for the remaining blockers are landing. [21:42:05] so I can reconsider. [21:42:30] I'll check with my train partner. [21:42:53] the Flow blocker is also backportable [21:43:08] MatmaRex: which one please? [21:43:16] oh, that's marked for next week [21:43:31] urbanecm: yours :D https://phabricator.wikimedia.org/T289625 [21:43:43] oh, someone merged it already [21:43:48] someone called MatmaRex [21:43:49] thanks [21:44:05] (03PS1) 10Urbanecm: BoardContent: Fix deprecation warning [extensions/Flow] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714846 (https://phabricator.wikimedia.org/T289625) [21:44:11] (03CR) 10Urbanecm: [C: 03+2] BoardContent: Fix deprecation warning [extensions/Flow] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714846 (https://phabricator.wikimedia.org/T289625) (owner: 10Urbanecm) [21:44:13] when i'm at it... [21:46:28] dancy: if everything is already patched within a few minutes from now, i'm inclined to say go for it. if it's going to drift past 15:00 pacific before that happens, we officially have a cutoff then. [21:47:01] * urbanecm is currently waiting for CI on the last blocker's fix [21:47:06] it's a cutoff sometimes honored more in the breach than the observance - often just a judgment call and depends on how fried the conductor's nerves are. [21:47:33] Sounds good. I need to leave early today so 15:00 is a good cutoff. [21:47:46] cool. let's roll forward first thing in US morning. [21:48:05] (03PS5) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [21:49:58] (03PS1) 10Legoktm: shellbox: Remove unused service.port.nodePort [deployment-charts] - 10https://gerrit.wikimedia.org/r/714838 [21:50:00] (03PS1) 10Legoktm: shellbox-constraints: Remove unused service.port.nodePort [deployment-charts] - 10https://gerrit.wikimedia.org/r/714839 [21:51:52] (03CR) 10Dduvall: gitlab: Provide profile for docker based GitLab runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [21:53:49] (03CR) 10Legoktm: [C: 03+2] "Should be a no-op" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714838 (owner: 10Legoktm) [21:53:57] (03CR) 10Legoktm: [C: 03+2] "Should be a no-op" [deployment-charts] - 10https://gerrit.wikimedia.org/r/714839 (owner: 10Legoktm) [21:56:25] (03Merged) 10jenkins-bot: shellbox: Remove unused service.port.nodePort [deployment-charts] - 10https://gerrit.wikimedia.org/r/714838 (owner: 10Legoktm) [21:56:33] (03Merged) 10jenkins-bot: shellbox-constraints: Remove unused service.port.nodePort [deployment-charts] - 10https://gerrit.wikimedia.org/r/714839 (owner: 10Legoktm) [21:56:48] brennen: Morning +1 [21:58:11] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [21:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:27] i'll note on task. [21:58:29] forgot I left something undeployed from last time in shellbox-constraints [21:58:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:59:33] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:59:40] hm. [21:59:53] !log 1.37.0-wmf.20 train status (T281161) blockers should be patched shortly; as we've reached the 15:00 Pacific deploy cutoff for the day, train will resume first thing in US morning [21:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:57] T281161: 1.37.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T281161 [22:00:31] the httpbb failure is read timeouts from mw1414 again, that's weird [22:00:40] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [22:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:14] (03Merged) 10jenkins-bot: Make sure params is an array [extensions/VisualEditor] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714681 (https://phabricator.wikimedia.org/T289730) (owner: 10Zabe) [22:02:17] (03Merged) 10jenkins-bot: BoardContent: Fix deprecation warning [extensions/Flow] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714846 (https://phabricator.wikimedia.org/T289625) (owner: 10Urbanecm) [22:02:22] finally [22:02:57] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:03:26] https://test.wikipedia.org/w/api.php?action=visualeditor&format=json&page=A&paction=parse§ion=new&pst=1&preload=B does not throw, syncing [22:04:55] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:04:58] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/VisualEditor/includes/ApiVisualEditor.php: 73478bc9c72286123cef69e57e0aef9e745dcff9: Make sure params is an array (T289730) (duration: 01m 04s) [22:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:02] T289730: TypeError: Argument 4 passed to MediaWiki\Content\Transform\ContentTransformer::preloadTransform() must be of the type array, null given - https://phabricator.wikimedia.org/T289730 [22:06:17] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:07:12] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/Flow/includes/Content/BoardContent.php: 694b94657d251df64145e8153b269094bba75be9: BoardContent: Fix deprecation warning (T289625) (duration: 01m 04s) [22:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:16] T289625: PHP Deprecated: Use of ParserOptions::getUser was deprecated in MediaWiki 1.36. [Called from Flow\Content\BoardContent::getParserOutput] - https://phabricator.wikimedia.org/T289625 [22:07:20] so, that should be all for today [22:07:28] zabe: MatmaRex: dancy: brennen: fyi ^š [22:07:32] thanks urbanecm [22:07:36] np [22:07:36] thanks :) [22:07:38] :o thanks [22:07:40] Beautiful [22:08:22] since it's all clear... [22:08:25] (03CR) 10Legoktm: [C: 03+2] debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714418 (https://phabricator.wikimedia.org/T289246) (owner: 10Legoktm) [22:09:06] (03Merged) 10jenkins-bot: debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714418 (https://phabricator.wikimedia.org/T289246) (owner: 10Legoktm) [22:09:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:10:45] !log legoktm@deploy1002 Synchronized debug.json: List primary DC servers first (T289246) (duration: 01m 04s) [22:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:49] T289246: Unable to select backend server in WikimediaDebug extension - https://phabricator.wikimedia.org/T289246 [22:11:25] I still can't select a specific server, but now at least it's a usable server... [22:11:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:34] thanks legoktm :) [22:13:20] I'm waiting for the xkcd-space-bar complaint for the person who autopilots the selector and always goes to the third entry [22:13:35] legoktm: maybe add it to switch back docs ? [22:14:38] yeah, let me file a bug first [22:18:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:41] https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&type=revision&diff=1923257&oldid=1922601 [22:27:09] 10SRE, 10Gerrit, 10GitLab, 10Release-Engineering-Team, 10User-brennen: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10brennen) [22:31:45] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:35:37] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:43:36] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 3 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10Peachey88) [22:45:17] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:53:03] (03PS2) 10BryanDavis: toolhub: add LOGGING_CONSOLE_FORMATTER env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/714656 (https://phabricator.wikimedia.org/T276374) [22:53:05] (03PS1) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [23:00:05] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210825T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:07:42] I'd like to add a patch urbanecm Niharika [23:07:51] DannyS712: link? [23:08:17] (03PS1) 10DannyS712: GlobalWatchlistEntryLog: fix storing log id [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714847 (https://phabricator.wikimedia.org/T288385) [23:08:25] ^ [23:08:34] looking [23:08:55] (03CR) 10BryanDavis: [C: 03+2] toolhub: add LOGGING_CONSOLE_FORMATTER env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/714656 (https://phabricator.wikimedia.org/T276374) (owner: 10BryanDavis) [23:09:25] DannyS712: could you clarify what the impact of the bug is? [23:10:13] I broke something on GlobalWatchlist in wmf.20, and the only reason it hasn't been noticed on prod yet is that the train was blocked and meta hasn't been upgraded. In the global watchlist, there is a link to a specific log entry - it was having eg https://meta.wikimedia.org/w/index.php?title=Special:Log&logid=undefined instead of the correct log [23:10:13] id, because the id was saved as .logId instead of .logid [23:10:55] gétít [23:11:02] (03CR) 10Urbanecm: [C: 03+2] GlobalWatchlistEntryLog: fix storing log id [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714847 (https://phabricator.wikimedia.org/T288385) (owner: 10DannyS712) [23:11:05] since GLobalWatchlist is also deployed to testwiki you can confirm there, and confirm it works again on beta [23:11:06] let's ship it [23:11:17] you might want to fix the docs too DannyS712 [23:11:47] (03Merged) 10jenkins-bot: toolhub: add LOGGING_CONSOLE_FORMATTER env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/714656 (https://phabricator.wikimedia.org/T276374) (owner: 10BryanDavis) [23:12:16] DannyS712: can you also add it tocalendar please? [23:13:15] added [23:13:18] thx [23:14:51] good point about fixing the doc, sent https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalWatchlist/+/714848 but that doesn't need to be backported [23:15:17] should I be ready to test on testwiki? Or can it be deployed directly? [23:15:32] (03Merged) 10jenkins-bot: GlobalWatchlistEntryLog: fix storing log id [extensions/GlobalWatchlist] (wmf/1.37.0-wmf.20) - 10https://gerrit.wikimedia.org/r/714847 (https://phabricator.wikimedia.org/T288385) (owner: 10DannyS712) [23:15:57] DannyS712: if you can test it easily, let's do it [23:16:21] that docs patch is not ready for review yet fwiw [23:16:49] yeah, should be fairly easy as long as I bypass my RL cache, I'll use debug made; set doc patch ready for review, thanks [23:16:59] *debug mode* [23:17:10] pulled to mwdebug2001 [23:17:21] (03PS2) 10BryanDavis: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) [23:17:44] confirmed to work [23:18:01] syncing [23:19:11] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:19:34] (03PS1) 10H.krishna123: bernard: Add simple documentation into README.md [software/bernard] - 10https://gerrit.wikimedia.org/r/714870 (https://phabricator.wikimedia.org/T289735) [23:19:54] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.20/extensions/GlobalWatchlist/modules/EntryLog.js: 230aec3fe7f3d0e325882a5fc926e9f3e4e86717: GlobalWatchlistEntryLog: fix storing log id (T288385) (duration: 01m 07s) [23:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:59] T288385: Add objects to represent data for entry rows - https://phabricator.wikimedia.org/T288385 [23:20:02] (03CR) 10H.krishna123: "I've added the documentation to the README.md file" [software/bernard] - 10https://gerrit.wikimedia.org/r/714870 (https://phabricator.wikimedia.org/T289735) (owner: 10H.krishna123) [23:20:03] should be live [23:20:11] DannyS712: anything else? [23:20:40] nope, it works, thanks [23:20:43] great [23:20:52] !log Evening B&C window completed [23:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:45] (03CR) 10BryanDavis: [C: 04-1] "Blocked on figuring out the memcached setup which may require amending the chart." [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [23:22:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:27] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:44:11] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:25] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:53:31] (03PS1) 104nn1l2: Install Extension Quiz on fa.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714872 (https://phabricator.wikimedia.org/T289381) [23:53:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:54:05] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:37] (03CR) 10Zabe: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714872 (https://phabricator.wikimedia.org/T289381) (owner: 104nn1l2) [23:59:07] (03PS1) 104nn1l2: Install Extension Quiz on ja.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/714873 (https://phabricator.wikimedia.org/T289383)